Monday, July 25, 2016

How Bag of Words Model Can Work in Text Mining with Python?

The Bag of Words model, BOW, is a popular representation which is commonly used in natural language processing applications. In this model, a text document including several sentences and paragraphs is represented as the bag of the words included in that document, without considering syntax and the order of words. The main idea is to quantize each extracted key point into one of categorical words, and then represent each document by a histogram of the categorical words. The BOW feature design creates a vocabulary out of all the words in the training set and, for each example, creates a vector of word counts pertaining to that instance. Since the vector holds a place for each word in the vocabulary, the resulting BOW matrix is sparse (mostly zeros) because most words in the vocabulary don't appear in a given example.
One of its universal application is in document classification. In this case, the frequency of occurrence of each word from a dictionary is considered as an attribute for learning a classifier [1]. Salton and McGill proposed a Keyword Classification method to classify the documents into different categories providing a synonym dictionary for document indexing and query retrieval [2]. Furthermore, a statistical framework for generalizing the BOW representation is presented by Zhang et.al. for image processing inspired from text mining applications [3]. Voloshynovskiy et. al. also performs a survey for the analysis of the performance of BOW method to present a better understanding of this technique in terms of the robustness and accuracy in decision making. They also accomplished a successful experiment on real image data [4].
After we determine how many documents we have, we simply put all into a single text file. Then we will replace all the punctuation such as $, %, & and @ as well as some digits or specific terms such as WWW and com with blank space. After cleaning text file, we will count the frequency of each word in our text file, and rank them according to their frequency. We will always choose the top N words from the text file, and N is a user-defined parameter.

Here is a demo of how BOW works on a text data. For the first paragraph of this blog post, we have:


from __future__ import print_function
import numpy as np
import pandas as pd
from collections import Counter


def get_user_terms_stops(user_words):
    ''' extract the tokens set for each user from the document:
    '''
    import re
    import string
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer

    print('extracingt the token set from urls for each user ...')              
    user_terms_stops = []

    for i in range(len(user_words)):
        terms_stops = []
        u = ' '.join(np.array(user_words[i])).split()  # only return the non-empty elements 
        for j in range(len(u)):
            # tokenizing:
            tokens = re.findall('\w+', u[j])
            #remove stop words:
            stop = stopwords.words('english') + list(string.punctuation)
            terms_stop = [word for word in tokens if (word not in stop)]
            terms_stops = terms_stops + terms_stop
            #stemming:
            p_stemmer = PorterStemmer()
            stemmed_tokens = [p_stemmer.stem(k) for k in terms_stops]       

        user_terms_stops.append(terms_stops)
    print('accomplished!')
    user_terms_stops = [item for sublist in user_terms_stops for item in sublist]
    return user_terms_stops


if __name__ == '__main__':
       
    X = [['The Bag of Words model',
          'BOW, is a popular representation which is commonly used in natural language processing applications.',
          'In this model',
          'a text document including several sentences and paragraphs is represented as the bag of the words',
          'included in that document, without considering syntax and the order of words.'],
         ['The main idea is to quantize each extracted key point into one of categorical words',
          'and then represent each document by a histogram of the categorical words.'],
         ['The BOW feature design creates a vocabulary out of all the words in the training set and',
          'for each example, creates a vector of word counts pertaining to that instance.',
          'Since the vector holds a place for each word in the vocabulary',
          'the resulting BOW matrix is sparse (mostly zeros) because most words in the vocabulary do not appear in a given example.']
        ]
   
   
user_terms_stops = get_user_terms_stops(X)
user_terms_stops

['The',
 'Bag',
 'Words',
 'model',
 'BOW',
 'popular',
 'representation',
 'commonly',
 'used',
 'natural',
 'language',
 'processing',
 'applications',
 'In',
 'model',
 'text',
 'document',
 'including',
 'several',
 'sentences',
 'paragraphs',
 'represented',
 'bag',
 'words',
 'included',
 'document',
 'without',
 'considering',
 'syntax',
 'order',
 'words',
 'The',
 'main',
 'idea',
 'quantize',
 'extracted',
 'key',
 'point',
 'one',
 'categorical',
 'words',
 'represent',
 'document',
 'histogram',
 'categorical',
 'words',
 'The',
 'BOW',
 'feature',
 'design',
 'creates',
 'vocabulary',
 'words',
 'training',
 'set',
 'example',
 'creates',
 'vector',
 'word',
 'counts',
 'pertaining',
 'instance',
 'Since',
 'vector',
 'holds',
 'place',
 'word',
 'vocabulary',
 'resulting',
 'BOW',
 'matrix',
 'sparse',
 'mostly',
 'zeros',
 'words',
 'vocabulary',
 'appear',
 'given',
 'example']


c = Counter(user_terms_stops)
c




Following code returns the N most frequent words in our documents which can be used in many applications like text tagging, text categorization and etc. 
Counter({'BOW': 3,
         'Bag': 1,
         'In': 1,
         'Since': 1,
         'The': 3,
         'Words': 1,
         'appear': 1,
         'applications': 1,
         'bag': 1,
         'categorical': 2,
         'commonly': 1,
         'considering': 1,
         'counts': 1,
         'creates': 2,
         'design': 1,
         'document': 3,
         'example': 2,
         'extracted': 1,
         'feature': 1,
         'given': 1,
         'histogram': 1,
         'holds': 1,
         'idea': 1,
         'included': 1,
         'including': 1,
         'instance': 1,
         'key': 1,
         'language': 1,
         'main': 1,
         'matrix': 1,
         'model': 2,
         'mostly': 1,
         'natural': 1,
         'one': 1,
         'order': 1,
         'paragraphs': 1,
         'pertaining': 1,
         'place': 1,
         'point': 1,
         'popular': 1,
         'processing': 1,
         'quantize': 1,
         'represent': 1,
         'representation': 1,
         'represented': 1,
         'resulting': 1,
         'sentences': 1,
         'set': 1,
         'several': 1,
         'sparse': 1,
         'syntax': 1,
         'text': 1,
         'training': 1,
         'used': 1,
         'vector': 2,
         'vocabulary': 3,
         'without': 1,
         'word': 2,
         'words': 6,
         'zeros': 1})


top_N = 6
classes = []
top_words = c.most_common(top_N)
top_words


[('words', 6),
 ('BOW', 3),
 ('vocabulary', 3),
 ('The', 3),
 ('document', 3),
 ('creates', 2)]





References
[1] Harris, Zellig. "Distributional Structure". Word 10 (2/3): 146{62, 1954.
[2] Gerard Salton and Michael J. McGill. "Introduction to Modern Information Retrieval".
McGraw-Hill Book Company, New York, 1983.
[3] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. "Understanding bag-of-words model: a
statistical framework". International Journal of Machine Learning and Cybernetics, 1(1-
4):43{52, 2010.
[4] Sviatoslav Voloshynovskiy, Maurits Diephuis, Dimche Kostadinov, Farzad Farhadzadeh and Taras Holotyak. "On accuracy, robustness and security of bag-of-word search systems".


Monte Carlo tree search (MCTS)


Ever wondering what data structure is used at the core engine for class of games like, Othello, Go, etc., and almost most of the board games you can think of! I encourage you to read this blog post then ;)

Recently, AI (Artificial Intelligence) and ML (Machine Learning) have successfully developed a tool to defeat human expert at Go game that was perceived to think of as one of the most difficult board games for quite many years, but thanks to deep RL (Reinforcement Learning) that used the MCTS as the main data structure to make this dream come true! Except the undeniable role of deep learning in this regard, MCTS is the main data structure that was considered to address this problem and similar ones since 2006. The first formalization of Kocsis and Szepesvári toward upper confidence bounds for trees, that heavily owes to John Von Neumann's 1928 minimax theorem known as the breakthrough for adversarial tree search methods, makes all the way up to explore and exploit the optimal policy to win 3 out of 4 against professional Go player Lee Sedol known as the human champion of Go.
The goal of Monte Carlo tree search is to analyse the most promising moves, so that the search tree grows based on random sampling of the search space. However, MCTS is limited to the games that have a finite number of possible moves in each state, as well as a finite length. Representing all possible states of the game in the tree is the main aim of this data structure. The root node of the tree is the initial game state. For every valid move from a node, there is a child node representing the new game state if that move is chosen. MCTS can theoretically be applied to any domain that can be described in terms of (state, action) pairs and simulation used to forecast outcomes.

The process of MCTS algorithm can be broken down into the following steps:



  • Selection, Starting at root node R, recursively select optimal child nodes until a leaf node L is reached. 
  • Expansion, If L is a not a terminal node, then create one or more child nodes and select one C. 
  • Simulation, Run a simulated playout from C until a result is achieved. 
  • Backpropagation, Update the current move sequence with the simulation result.
Kocsis and Szepesvári recommend to choose each node of the MCTS as the first step of the above algorithm that maximizes the following expression,

\[ \frac{w_i}{n_i} + c \sqrt{\frac{\ln{t}}{n_i}} \]

where \(w_i\) stands for the number of wins after the i-th move, \(n_i\) stands for the number of simulations after the i-th move, \(c\) is the exploration parameter—theoretically equal to \(\sqrt{2}\); in practice usually chosen empirically, \(t\) stands for the total number of simulations for the node considered. It is equal to the sum of all the \(n_i\).
And that’s it!

Sunday, July 24, 2016

SVM is NOT robust to outliers but median regression is

This text is a bit technical. It is the result of a discussion with Andrea Lodi concerning the robustness of the support vector machines (SVM) the famous and the widely-used classifier in Machine Learning. 



It is well-known that the median is more robust compared to the mean. The mean is the solution to an L2 quadratic minimization (least squares), and median is the solution to an L1 linear minimization or (least absolute deviation). Quantile regression coincides with the median regression for tau=0.5. The hinge loss (SVM), is a combination of a constant and a linear loss, so the behaviour looks like a skewed median.

Here I examine support vector machines, and compare it with these two linear estimation techniques of a line.
The subject of robustness of the SVM is rarely studied, and we do not know much about it. It is interesting to see how much outliers affect the support vector machine (SVM) classifier. Let's take the simplest version, i.e. the linear SVM. As the SVM classification does not correspond to a statistical model, it is difficult to say what outlier means in this context, but at least we can talk about the influential observations and make a sort of cook’s distance by removing data and re-fitting.
The linear SVM is the hinge loss, but with L2
penalization over the classification coefficients. Instead least squares and the median regression include no penalization. Studying the behavior of the SVM in the presence of outliers is closely related to studying the influence function of a L2 penalized S-Estimator.
The support vector machines is a hinge loss regression while data are separable. The penalization in the case of separable data is there only to take care of maximizing the margin. For non-separable data it will play a totally different role, it maximizes a hypothetical generalized margin which has no geometrical interpretation.
For non-separable data the L2norm penalization plays more role, as often more data fall on the hypothetical generalized margin.
Let’s start with generating some data and fitting a separating line that maximizes the margin.  Now lets fit a SVM classifier. The support vectors are highlighted in blue, least squares in green, and least absolute (median regression) in magenta. Pay attention to magenta line while you the following plots.
rm(list=ls(all=TRUE))
set.seed(150)
n <-50
x1s <- c(rnorm(n,-3), rnorm(n,3))
x2s <- c(rnorm(n,-3), rnorm(n,3))
ys <- c(rep(+1,n),          rep(-1,n))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)
plot(my.data[,-3],col=(ys+3)/2, pch=19, xlim=c(-10,10), ylim=c(-10,10))


library('e1071')
library('quantreg')
## Warning: package 'quantreg' was built under R version 3.2.5
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

SVM Robustness

Lets add some contamination that are correctly classified and see if the SVM line changes. As long as the support vecotrs (blue data) remain on the margin, the SVM fit remain the same.
outlier<-5
x1s <- c(x1s, rnorm(outlier,-10))
x2s <- c(x2s, rnorm(outlier,0))
ys <- c(ys,rep(+1,outlier))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

# SVM
svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

Non-separable data

Of course most of classifying data are not separable. It is also possible that somebody made a mistake and swaped +1 class with -1. Therefore it is more realistic to consider a non-separable data set. Therefore, we modify our simulated data a bit.
#make data non-separable
x1s[1] <- 2
x2s[1] <- 2

x1s[n+1] <- -2
x2s[n+1] <- -2

my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

As you see there are more blue dots now. Since misclassified data now fall on the hypothetical generalized margin. I seems SVM is more sensitive to data separability compared with least squares, or median regression.

SVM Breakdown

Now lets see if we can add some contamination to our data and influence the separating line.
outlier <-10
x1s <- c(x1s, rnorm(outlier,10))
x2s <- c(x2s, rnorm(outlier,5))
ys <- c(ys,rep(+1,outlier))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)

w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

Lesson

The lesson is hidden in magenta lines by comparing the above plots. Most of real data are non-separable. Be careful with using SVM in the presence of mistakenly assigned lables or in the presence of outliers. If you are unsure about the presence of outliers, always keep median  regression (magenta) as a competitor for SVM.