Sunday, July 24, 2016

SVM is NOT robust to outliers but median regression is

This text is a bit technical. It is the result of a discussion with Andrea Lodi concerning the robustness of the support vector machines (SVM) the famous and the widely-used classifier in Machine Learning. 



It is well-known that the median is more robust compared to the mean. The mean is the solution to an L2 quadratic minimization (least squares), and median is the solution to an L1 linear minimization or (least absolute deviation). Quantile regression coincides with the median regression for tau=0.5. The hinge loss (SVM), is a combination of a constant and a linear loss, so the behaviour looks like a skewed median.

Here I examine support vector machines, and compare it with these two linear estimation techniques of a line.
The subject of robustness of the SVM is rarely studied, and we do not know much about it. It is interesting to see how much outliers affect the support vector machine (SVM) classifier. Let's take the simplest version, i.e. the linear SVM. As the SVM classification does not correspond to a statistical model, it is difficult to say what outlier means in this context, but at least we can talk about the influential observations and make a sort of cook’s distance by removing data and re-fitting.
The linear SVM is the hinge loss, but with L2
penalization over the classification coefficients. Instead least squares and the median regression include no penalization. Studying the behavior of the SVM in the presence of outliers is closely related to studying the influence function of a L2 penalized S-Estimator.
The support vector machines is a hinge loss regression while data are separable. The penalization in the case of separable data is there only to take care of maximizing the margin. For non-separable data it will play a totally different role, it maximizes a hypothetical generalized margin which has no geometrical interpretation.
For non-separable data the L2norm penalization plays more role, as often more data fall on the hypothetical generalized margin.
Let’s start with generating some data and fitting a separating line that maximizes the margin.  Now lets fit a SVM classifier. The support vectors are highlighted in blue, least squares in green, and least absolute (median regression) in magenta. Pay attention to magenta line while you the following plots.
rm(list=ls(all=TRUE))
set.seed(150)
n <-50
x1s <- c(rnorm(n,-3), rnorm(n,3))
x2s <- c(rnorm(n,-3), rnorm(n,3))
ys <- c(rep(+1,n),          rep(-1,n))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)
plot(my.data[,-3],col=(ys+3)/2, pch=19, xlim=c(-10,10), ylim=c(-10,10))


library('e1071')
library('quantreg')
## Warning: package 'quantreg' was built under R version 3.2.5
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

SVM Robustness

Lets add some contamination that are correctly classified and see if the SVM line changes. As long as the support vecotrs (blue data) remain on the margin, the SVM fit remain the same.
outlier<-5
x1s <- c(x1s, rnorm(outlier,-10))
x2s <- c(x2s, rnorm(outlier,0))
ys <- c(ys,rep(+1,outlier))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

# SVM
svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

Non-separable data

Of course most of classifying data are not separable. It is also possible that somebody made a mistake and swaped +1 class with -1. Therefore it is more realistic to consider a non-separable data set. Therefore, we modify our simulated data a bit.
#make data non-separable
x1s[1] <- 2
x2s[1] <- 2

x1s[n+1] <- -2
x2s[n+1] <- -2

my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)


w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

As you see there are more blue dots now. Since misclassified data now fall on the hypothetical generalized margin. I seems SVM is more sensitive to data separability compared with least squares, or median regression.

SVM Breakdown

Now lets see if we can add some contamination to our data and influence the separating line.
outlier <-10
x1s <- c(x1s, rnorm(outlier,10))
x2s <- c(x2s, rnorm(outlier,5))
ys <- c(ys,rep(+1,outlier))
my.data <- data.frame(x1=x1s, x2=x2s, type=ys)

svm.model <- svm(type ~ ., data=my.data,
                 type='C-classification',
                 kernel='linear',scale=FALSE)

w <- t(svm.model$coefs) %*% svm.model$SV
b <- -svm.model$rho
p <- svm.model$SV

#Fit Least Squares and Median Regression 
lscoef <- coef(lm(type ~., data=my.data))
medcoef <- coef(rq(type ~., data=my.data,tau=0.5))

# Plot
plot(my.data[,-3],col=(ys+3)/2, 
     pch=19, xlim=c(-10,10), ylim=c(-10,10))
points(my.data[svm.model$index,c(1,2)],cex=2, col="blue",lwd=2) 

abline(a=-lscoef[1]/lscoef[3],b=-lscoef[2]/lscoef[3],
       col="green")
abline(a=-medcoef[1]/medcoef[3],b=-medcoef[2]/medcoef[3],
       col="magenta")
abline(a=-b/w[1,2], b=-w[1,1]/w[1,2], col="blue", lty=1)

legend(-5,10, legend=c("SVM", "L2", "L1"), 
       col=c("blue","green","magenta"), lwd=3)

Lesson

The lesson is hidden in magenta lines by comparing the above plots. Most of real data are non-separable. Be careful with using SVM in the presence of mistakenly assigned lables or in the presence of outliers. If you are unsure about the presence of outliers, always keep median  regression (magenta) as a competitor for SVM.

5 comments:

  1. Thanks for your interesting post!
    Do you have any real world dataset example where this hypothesis works too?

    ReplyDelete
  2. And my second question is about the penalization, do you think your suggestion is superior to soft-margin assumption?
    As far as I remember, several other remedies have already addressed this problem such as the following old one,
    https://www.aaai.org/Papers/AAAI/2006/AAAI06-086.pdf
    Or in the case of you last example, Kernel trick could easily handle this example.

    To me all these sort of issues related to handling the data could be resolved by penalization term, that is somehow equivalent to the prior knowledge about data.

    Anyway, if we want to solve the SVM problem by changing the methodology, why not thinking about Gaussian process?

    ReplyDelete
  3. In this paper, it looks they use the same assumption to develop a new objective function for SVM
    https://www.cs.cmu.edu/~yifeim/resources/IFAC11_3467_FI.pdf

    ReplyDelete
  4. Here is another remedy with mathematical proof,

    Support Vector Machines with the Ramp Loss and the Hard Margin Loss
    http://www.optimization-online.org/DB_FILE/2008/11/2134.pdf

    ReplyDelete
    Replies
    1. Ramp Loss Linear Programming Support Vector Machine in more details,
      http://jmlr.org/papers/volume15/huang14a/huang14a.pdf

      Delete