Principal component analysis
In this chapter we will show how to use PCA method implemented in the mdatools. Besides that, we will use PCA examples to introduce some principles, which are common for most of the other methods (e.g. PLS, SIMCA, PLS-DA, etc.) available in this toolbox. This includes such things as model and result objects, showing performance statistics for models and results, validation, different kinds of plots, and so on.
Principal component analysis is one of the methods that decompose a data matrix X into a combination of three matrices: $$X = TP' + E$$. Here $$P$$ is a matrix with unit vectors, defined in the original variables space. The unit vectors form a new basis, which is used to project all data points into. Matrix $$T$$ contains coordinates of the projections in the new basis and product of the two matrices, $$TP'$$ represent the coordinates of projections in original variable space. Matrix $$E$$ contains residuals — difference between position of projected data points and their original locations.
In terms of PCA, the unit-vectors defining the new coordinate space are called loadings and the coordinate axes oriented alongside the loadings are Principal Components (PC). The coordinates of data points projected to the principal components are called scores.
There are several other methods, such as Projection Pursuit (PP), Independent Component Analysis (ICA) and some others, that work in a similar way and resulting in the data decomposition shown above. The principal difference among the methods is the way they find the orientation of the unit-vectors. Thus, PCA finds them as directions of maximum variance of data points. In addition to that, all PCA loadings are orthogonal to each other. The PP and ICA use other criteria for the orientation of the vectors for basis and e.g. for ICA the vectors are not orthogonal.
It was decided to put several methods, including ICA (and in future PP) under the PCA umbrella. First of all it was done to reduce amount of code, as the interpretation and analysis of the results, the methods return, is very similar. In order to select which method (algorithm) to use for the decomposition there is a parameter Method which can be defined by a user as it will be shown in the examples below.
Models and results
In mdatools, any method for data analysis, such as PCA, PLS regression, SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca
and pcares
correspondingly.
Each object includes a list with variables (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for investigation.
Model calibration
Let's see how this works using a simple example — People data. The data consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on.). More information about the data can be found using ?people
. We will first load the data matrix and split it into two subsets as following:
library(mdatools)
data(people)
idx = seq(4, 32, 4)
cset = people[-idx, ]
tset = people[idx, ]
So cset
is our calibration subset we are going to use for creating a PCA model and tset
is a subset we will apply the calibrated model to. Now let's calibrate the model and show an information about the model object:
model = pca(cset, 10, scale = T, info = "People PCA model")
model = selectCompNum(model, 5)
print(model)
Here pca
is a function that builds (calibrates) a PCA model and returns the model object.
Function selectCompNum
allows to select an “optimal” number of components for the model. In our case we calibrate model with 10 principal components (second argument for the function pca()
) however, e.g. after investigation of explained variance we found out that 5 components is optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 10 components but will use optimal number of components when necessary. For example when showing residuals plot for the model. Or when PCA model is used in SIMCA classification.
Finally, function print
prints the model object info:
PCA model (class pca)
Info:
People PCA model
Call:
pca(x = cset, 10,scale = T, info = "People PCA model")
Major fields:
$loadings - matrix with loadings
$eigenvals - eigenvalues for components
$ncomp - number of calculated components
$ncomp.selected - selected number of components
$center - values for centering data
$scale - values for scaling data
$cv - number of segments for cross-validation
$alpha - significance level for Q residuals
$calres - results (scores, etc) for calibration set
As you can see there are no scores, explained variance values, residuals and so on. Because they actually are not part of a PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, number of segments for cross-validation and significance level are model fields:
model$loadings[1:4, 1:4]
Comp 1 Comp 2 Comp 3 Comp 4
Height -0.3792846 0.08004057 -0.06676611 0.04512380
Weight -0.3817929 0.08533809 -0.08527883 -0.04051629
Hairleng 0.3513874 -0.22676635 -0.02273504 0.01575716
Shoesize -0.3776985 0.12503739 -0.02117369 0.09929010
One can also notice that the model object has a particular field — calres
, which is in fact a PCA result object containing results of applying the model to the calibration set. If we look at the object description we will get the following:
print(model$calres)
Results for PCA decomposition (class pcares)
Major fields:
$scores - matrix with score values
$T2 - matrix with T2 distances
$Q - matrix with Q residuals
$ncomp.selected - selected number of components
$expvar - explained variance for each component
$cumexpvar - cumulative explained variance
And if we want to look at scores, here is the way:
> model$calres$scores[1:4, 1:4]
Comp 1 Comp 2 Comp 3 Comp 4
Lars -5.108742 -1.2714943 1.0765871 1.08910438
Peter -3.021811 -0.3163758 -0.2958259 -1.36053121
Rasmus -2.887335 -0.4428721 0.1231706 -1.15070563
Mette 1.116457 -1.3716444 -1.6344512 -0.03803356
Both model and result objects also have related functions, first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper method.
Applying model to a new data
The result object is also created every time you apply a model to a new data. Like in many built-in R methods, method predict()
is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model):
res = predict(model, tset)
print(res)
Results for PCA decomposition (class pcares)
Major fields:
$scores - matrix with score values
$T2 - matrix with T2 distances
$Q - matrix with Q residuals
$ncomp.selected - selected number of components
$expvar - explained variance for each component
$cumexpvar - cumulative explained variance
Model validation
Any model can be validated with cross-validation or/and test set validation. The validation results are, of course, represented by result objects, which are fields of a model object similar to calres
, but with names cvres
and testres
correspondingly.
Here is how to build a PCA model with full cross-validation and test set validation (we will use tset
as test data) at the same time:
model = pca(cset, 10, scale = T, cv = 1, x.test = tset, info = "PCA model")
model = selectCompNum(model, 5)
Parameter cv
specifies options for cross-validation. If a numeric value is provided then it will be used as number of segments for random cross-validation, e.g. if cv = 2
cross-validation with two segments will be used. For full cross-validation use cv = 1
like we did in the example above. For more advanced option you can provide a list with name of cross-validation method, number of segments and number of iterations, e.g. cv = list('rand', 4, 4)
for running random cross-validation with four segments and four repetitions.
And here is the model object info:
print(model)
PCA model (class pca)
Call:
pca(x = cset, scale = T, cv = 1, x.test = tset)
Major fields:
$loadings - matrix with loadings
$eigenvals - eigenvalues for components
$ncomp - number of calculated components
$ncomp.selected - selected number of components
$center - values for centering data
$scale - values for scaling data
$cv - number of segments for cross-validation
$alpha - significance level for Q residuals
$calres - results (scores, etc) for calibration set
$cvres - results for cross-validation
$testres - results for test set
As you can see we have all three types of results now — calibration (calres
), cross-validation (cvres
) and test set validation (testres
). Let us compare, for example, the explained variance values for the results:
> cbind(model$calres$expvar, model$cvres$expvar, model$testres$expvar)
[,1] [,2] [,3]
Comp 1 54.23864691 43.30674751 44.8183978
Comp 2 20.28232983 20.99943694 17.1681782
Comp 3 13.10288913 14.70622977 16.9932063
Comp 4 7.88223195 13.00109104 8.0441691
Comp 5 2.26858962 3.57562285 4.4311656
Comp 6 1.14031431 2.01785501 2.4012850
Comp 7 0.47959862 0.76529214 0.7288616
Comp 8 0.24185482 0.05667122 0.5221894
Comp 9 0.22088015 0.77905369 0.5064012
Comp 10 0.11338134 0.65262040 0.2456456
Comp 11 0.02928332 0.13937943 0.3145322
Comp 12 0.00000000 0.00000000 3.8259682
Every model and every result has a method summary()
, which shows some statistics for evaluation of a model performance. Here are some examples.
summary(model)
PCA model (class pca) summary
Info:
PCA model
Eigvals Expvar Cumexpvar
Comp 1 6.509 54.24 54.24
Comp 2 2.434 20.28 74.52
Comp 3 1.572 13.10 87.62
Comp 4 0.946 7.88 95.51
Comp 5 0.272 2.27 97.77
Comp 6 0.137 1.14 98.92
Comp 7 0.058 0.48 99.39
Comp 8 0.029 0.24 99.64
Comp 9 0.027 0.22 99.86
Comp 10 0.014 0.11 99.97
summary(model$calres)
Summary for PCA results
Selected components: 5
Expvar Cumexpvar
Comp 1 54.24 54.24
Comp 2 20.28 74.52
Comp 3 13.10 87.62
Comp 4 7.88 95.51
Comp 5 2.27 97.77
Comp 6 1.14 98.92
Comp 7 0.48 99.39
Comp 8 0.24 99.64
Comp 9 0.22 99.86
Comp 10 0.11 99.97
The same methodology is used for any other method, e.g. PLS or SIMCA. In the next section we will look at how to use plotting functions for models and results.
Plotting tools
Every model and result class has a number of functions for visualization. Thus for PCA the function list includes scores and loadings plots, explained variance and cumulative explained variance plots, T2 vs. Q residuals and many others.
A function that does the same for different models and results has always the same name. For example plotPredictions
will show predicted vs. measured plot for PLS model and PLS result, MLR model and MLR result, PCR model and PCR result and so on. The first argument must always be a model or a result object.
The major difference between plots for model and plots for result is following. A plot for result always shows one set of data objects — one set of points, lines or bars. For example predicted vs. measured values for calibration set or scores values for test set and so on. And a plot for a model in most cases shows several sets of data object, e.g. predicted values for calibration and validation.
Let's call first type of plot as a normal plot and the second type as a group plot. If a plot needs a legend — it is a group plot and it is a normal plot otherwise. Both normal and group plots can be scatter, line or bar.
In this package functionality of the normal plots is implemented in function mdaplot
and the group plots are made with function mdaplotg
. It means every time you make a plot for a model or a result, e.g. with plotScores
or plotPredictions
, either mdaplot
or mdaplotg
is used. It is not necessary to use the functions directly (however they give extra tools in comparison with standard R plotting functions, so it may be a good idea as well), but it will be useful to look at help text for the functions and learn their parameters.
Some of the properties and parameters, described in the following subsections. The same PCA example will be used to show how they work.
Plotting options for result objects (normal plots)
The result object in most cases uses plots of the normal type. Besides general things, e.g. marker type, line type and size, etc., similar to standard R function plot
, the normal plot allows to show vertical and horizontal lines with automatic adjustment of axes limits, to show labels for data objects (points or bars) and to do color grouping with color bar legend. For more details, please, see ?mdaplot
.
Most of the parameters from standard R plot
function, such as pch
, lty
, lwd
, xlab
, ylab
, main
, xlim
, ylim
, and many others can be used. All plots have default values for all parameters, including limits, labels and a plot title, however if user specifies any of those the default values will be overwritten.
data(People)
model = pca(people, scale = T)
res = model$calres
par(mfrow = c(1, 2))
plotScores(res, xlab = 'PC1', ylab = 'PC2', main = 'Scores plot')
plotScores(res, pch = 18, col = 'red')
The following additional parameters can be used (we assume that object res
is already available).
Labels
Parameters show.labels
and labels
can be used for showing data object labels on the plot. The first is a logical parameter that turns the labels on and off. Parameter labels
is a vector of labels for the data objects, if it is not specified, row or column names will be used instead. Here is an example with scores plot of People data:
# make manual labels for scores
slabels = paste('Obj', 1:nrow(people))
par(mfrow = c(1, 2))
plotScores(res, show.labels = T)
plotScores(res, show.labels = T, labels = slabels)
Color groups
It is possible to make color groups of points or lines using vector with values, corresponding to a number of data objects. The values should be specified using parameter cgroup
. Up to eight groups will be made, if there are more than eight unique values in the vector, the whole range will be divided into eight equal pieces. A color bar legend can be shown to see which color correspond to a particular range, using logical parameter show.colorbar
. Here is an example with coloring scores plots by height of people:
par(mfrow = c(1, 2))
plotScores(res, cgroup = people[, 'Height'])
plotScores(res, cgroup = people[, 'Height'], show.colorbar = F)
By default a built in color palette, based on a diverging scheme from colorbrewer2.org is used. Alternatively a gray or user defined palette can be applied using parameter colmap
. In case if user specifies its own colors the palette will be generated as a gradient among them (from two to eight colors can be specified):
par(mfrow = c(1, 2))
plotScores(res, cgroup = people[, 'Height'], colmap = 'gray')
plotScores(res, cgroup = people[, 'Height'], colmap = c('red', 'green', 'blue'))
From version 0.7.0 it is also possiblt to use a factor with text level labels to make the groups as it is shown in the example below.
g = factor(people[, 'Sex'], labels = c('Male', 'Female'));
plotScores(res, cgroup = g)
The color grouping option is not available for the group (model) plots as colors are used there to underline the groups.
Plotting options for model objects
The model object in most cases uses plots of the group type. The general settings, such as marker type, line type and size, color, etc., similar to standard R function matplot
— it means each can be either one value (same for all groups) or vector of values (one for each group). The group plot allows to show legend and most of the things the normal plot can except color grouping. For more details, please, see ?mdaplotg
.
Like in standard R matplot
function, the following parameters: pch
, lty
, lwd
, xlab
, ylab
, main
, xlim
, ylim
, and many others can be used. All plots have default values for all parameters, including limits, labels and a plot title, however if user specifies any of those the default values will be overwritten.
data(People)
model = pca(people, scale = T, cv = 1)
model = selectCompNum(model, 5)
par(mfrow = c(1, 2))
plotResiduals(model, main = 'Residuals plot', pch = 15)
plotResiduals(model, pch = c(18, 15), col = c('red', 'green'))
The following additional parameters can be used (we assume that object model
is already available).
Labels
The same parameters, like in case of normal plot, — show.labels
and labels
can be used for showing data object labels on the plot. However here parameter labels
should be a matrix of values — one column for each group. Normally the labels are generated automatically, based on names of objects and variables:
par(mfrow = c(1, 2))
plotResiduals(model)
plotResiduals(model, show.labels = T)
Legend
The legend is generated automatically, but user can change its position. Logical parameter show.legend
turns the legend on (default) and off:
par(mfrow = c(1, 2))
plotResiduals(model, pch = c(15, 17))
plotResiduals(model, pch = c(15, 17), show.legend = F)
Parameter legend.position
allows to change the position of the legend (the possible values are: topright
, top
, topleft
, bottomright
, bottom
, bottomleft
):
par(mfrow = c(1, 2))
plotResiduals(model, legend.position = 'bottomright')
plotResiduals(model, legend.position = 'top')
Color maps
Like in normal plots, there are two built in color maps — default, based on colorbrewer2.org (colmap = "default"
), and grayscale (colmap = "gray"
). And similarly a user can define its own color map, e.g. colmap = c("red", "green")
.
Colors from the map are used for visual separation of groups of points, lines and bars. It is not possible to colorize the plot elements using values, like in the case of normal plots (cgroup
).
Common options for both types of plots
Axes and other lines
Logical parameter show.axes
shows axis lines that cross the origin (0, 0). The parameter is available for scores and loadings scatter plots. There are also some plot specific parameters allowing turn on and off lines on plots, e.g. show.limits
for significance limits on a residuals plot or show.line
for target line on a predicted vs. measured plot. If a function shows some specific lines on a plot, look at the help text to find how to turn this option off.
par(mfrow = c(2, 2))
plotScores(model)
plotScores(model, show.axes = F)
plotResiduals(model)
plotResiduals(model, show.limits = F)
Grid
By default grid is on on most of the plots, if you want to get rid of it use logical parameter show.grid
.
Plot type
Many plots allow to use different representation depending on a type of plot. For example, plotLoadings
by default shows a scatter plot with loadings for first and second principal components. But if one changes plot type to bar or line, it shows the loading values vs. variable number:
par(mfrow = c(2, 2))
plotLoadings(model)
plotLoadings(model, type = "h")