Friday, May 8, 2015

PCA: “Principle Component Analysis”

There are usually exhibit relationships (e.g. linear) among the variables of a real data set. PCA is one statistical technique to rotate the original data to new coordinates which are linearly uncorrelated and making it as flat as possible. Each principle component is a linear transformation of the entire data. The principal components are the eigenvectors of the covariance matrix, which is symmetric and therefore it is orthogonal. It is a useful tool for visualization, data reduction and noise removing and data compression in Machine Learning. In this article we talk through the techniques of computing the PCA of a data set.
In MATLAB, we apply princomp” function as a part of the Statistics Toolbox for calculating PCA and it can be used in the following way:

[n m] = size(OriginalData);
XMean = mean(OriginalData); 
% computes the mean value of the OriginalDat.
XStd = std(OriginalData);
%calculate the standard deviation of OriginalDat.
Data = (OriginalData - repmat(XMean,[n 1]))./ repmat(XStd,[n 1]);
% standardizing the OriginalData via subtracting the mean from each observation and dividing by the standard deviation to center and scale the OriginalData.
[Coeff Score Latent] = princomp(Data)

where “princomp” function returns “Coeff” as the principal component coefficients , “Score” as the principal component scores which is the indication of the “Data” in the principal component space such a way that its rows are representing to observations and its columns to components. “Latent” is the eigenvalues vector of the covariance matrix of “Data”. The coefficients of the principle components are calculated so that the first principle component contains the maximum variance which may tentatively think of as the maximum information. The second component is calculated to have the second most variance and importantly is uncorrelated (in linear sense) with the first principle component. Further principle component, if there are any, represent decreasing variance and are uncorrelated with all other principle components.

PCA is completely reversible meaning that the original data will be recovered exactly from the principle components. To compute the reconstruction error we perform the following code:

C = Coeff(:,1:dim))
ReconstructedData = ((Data*C*C').* repmat(XStd,[n 1])+ repmat(XMean,[n 1]);

Error =
sqrt(sum(sum((Data - ReconstructedData).^2.715)))+(1/Dimensionality)^1;

where dim is the number of principle components we want to considered.
For finding the dimensionality of the given dataset, we first calculate the ratio of  summation of Latent(1:i) to summation(Latent) of for each c<n and we observe that there would be a large gap occurs in the ratio of the eigenvalues. Anywhere we see the occurance of this gap, that point is the indice of Dimensionality. Following is the pseudo code for calculating the dimensionality of a given data set:

for i=1 to size(Latent)
    sum(Latent(1 to i))/sum(Latent)
    if( Latent(i)/Latent(i+1) > Threshold) Then Ndimension = i

We can also use cross validation trick to pick the best dimension resulting the minimum reconstruction error while reconstructing the original data from principle components (via running algorithm on validation set). 

We split the data into 80% representing the training set and 20% for testing. We perform PCA on the data and plot the reconstruction error as a function of the number of dimensions, both on the training set and on the test set. Following is the list of results extracted from our observations:

  • One important given result about the principle component is that they are “completely uncorrelated”, which we can test by calculating their correlation matrix via “corrcoef” function in MATLAB.
  • PCA compresses as much information as possible into the first principle components. In some cases, the number of principle components needed to be stored.
  • The majority of variance is too small compared to number of features.
  • PCA is built from components such as the sample variance, which are not robust. It means that PCA may be thrown off by outliers.
  • Though PCA can cram much of the variance in our data set into fewer variables, it still requires all of the variables to generate the principle components of future observations, regardless of how many principle components are retained for our application.

No comments:

Post a Comment