I found that most of the time while you receive a new data set you simply do not know how to start.
Always visualization (hopefully combined with a general data analysis tool) is the first step. This gives you ideas about possible further steps to implement your knowledge discovery using more advanced computational or statistical techniques.
Most of the time one can arrange data in a matrix, with subjects in rows and variables in columns. I found sometimes, in practice, it is counter-intuitive to find out what must be the subject and what must be the variable. Lets discuss this fundamental issue later and do some analysis.
Lets start with a simple example, called iris data. It is already arranged in a matrix, hopefully with subjects in rows and variables in columns.
# load data
data(iris)
# see the data size
dim(iris)
## [1] 150 5
The iris data in R contain 150 rows and 5 variables. I am going to use only 4 variables out of this 5 (you will learn why at the end of this post).
I found a Heat-Map is the perfect tool to start data analysis. It visualizes the data with heat colours and implements two clustering trees, once on rows and another time on columns. Be careful, this tool does not work for large data (from large I mean matrices with dimension more than 1000X1000. That's why I checked the dimension first.
If I want to run the heatmap command on the iris data I have to use as.matrix function, since the iris data is in data.frame format, but heatmap accepts only a matrix. The data.frame and matrix format in R are quite similar. I suggest to use matrix format while your data all are quantitative.
# heatmap data
heatmap(as.matrix(iris[, 1:4]), cexRow=0.5, cexCol=0.8)
You see three blocks of data on the right. This discovery is astonishing since the iris data matrix contain measurements of three different plants.
Truth about the iris data:Row 1 to 50 are measurements from the stosa category, 51 to 100 from versicolor,and 101 to 150 from virginica.
Question: Now think why I excluded the 5th column of the analysis? (see below)
# 5th column
iris[ ,5]
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Ecole Polytechnique de Montreal: Statistics, Machine Learning, and Data Mining Graduate Students Blog
Wednesday, March 11, 2015
First Step in Data Analysis!
Subscribe to:
Post Comments (Atom)
Thanks, that is useful.
ReplyDelete