Quick start guide
This document gives a brief introduction to the basic functionality of mdatools toolbox as well as presents some ideas about how it deals with data, models and results. It is assumed that after reading the text and doing all exercises from the document one can easily start working with toolbox and then learn more features gradually.
Introduction to datasets
One of the most important things in the MDA toolbox is a dataset object. MATLAB is a great to deal with matrices and arrays, however one has to do a lot of routine operations to represent e.g. matrix values properly. Usually we have names for variables and objects or labels for measurements in our data. However, dealing with the names and labels in MATLAB is not an easy job. One of the ways to solve this issue is to use dataset - a specific object, which is a wrapper for a conventional numeric matrix, giving possibilities to use names more or less easily. Such objects exist in many toolboxes (including Statistics Toolbox) and the mdatools is not an exception here. Moreover, in mdatools dataset is a main way to represent any values. Scores, loadings, residuals, regression coefficients and so on are datasets in mdatools. Therefore is important to start with introduction what dataset is, how to create and manipulate datasets.
Actually in mdatools datasets give a lot more options than just a possibility to have names for rows and columns of a matrix. They allow to hide rows and columns without removing them (for example when one need to remove an outlier or do variable selection), to use qualitative data, factors, for grouping values and do many other things. But in this quick guide we will talk only about most important features of datasets, namely names/labels for rows and columns, making subsets, displaying data values, doing mathematical calculations and making plots.
Dataset can be created from any matrix (array with two dimensions: rows and columns). Actually it is not necessary to provide names for the dimensions. Column names will be generated automatically and row names will remain empty. Here is an example for 3x2 matrix with height and weight values for four persons:
values = [180 84; 170 68; 165 71; 172 75];
d = mdadata(values);
show(d)
Variables
1 2
---- ---
180 84
170 68
165 71
172 75
`
Function show()
displays data values as a table. By default it uses three significant figures but this can be changed by providing extra argument, e.g. show(d, 5)
.
To specify names for rows and objects one can provide them as a second and third arguments of the mdadata()
method. The names can be either cell arrays with text values or numeric vectors. Numbers will be converted to text automatically.
values = [180 84; 170 68; 165 71; 172 75];
d = mdadata(values, 1:4, {'Height', 'Weight'});
show(d)
Variables
Height Weight
------- -------
1 180 84
2 170 68
3 165 71
4 172 75
It is mandatory that row and column names are unique. It is recommended also not to use spaces and other special symbols, especially for column names, to avoid ambiguity. Actually the names may have two forms: full, with spaces and special symbols and short, with only letters and numbers. If one provides names with spaces and special symbols they will be converted to the short form automatically. More on that can be found in the User Guide.
The mdadata
is a MATLAB object which has several properties and many methods. You can see some of the properties by using disp()
.
disp(d)
mdadata handle
Properties:
name: ''
info: []
dimNames: {'Objects' 'Variables'}
values: [4x2 double]
nCols: 2
nRows: 4
nFactors: 0
rowNames: {4x1 cell}
colNames: {'Height' 'Weight'}
rowFullNames: {4x1 cell}
colFullNames: {'Height' 'Weight'}
The most important ones are values
, which is a matrix with data values, rowNames
- cell array with row names and colNames
- cell array with column names. All three can be changed manually for the whole object or for particular rows or columns. You can also specify a name for the dataset, short information text and labels for each of the two dimensions. Here are some examples:
d.rowNames = {'Lars', 'Peter', 'Anna', 'Kim'};
show(d)
Variables
Height Weight
------- -------
Lars 180 84
Peter 170 68
Anna 165 71
Kim 172 75
d(1, :).rowNames = 'Mike';
show(d)
Variables
Height Weight
------- -------
Mike 180 84
Peter 170 68
Anna 165 71
Kim 172 75
d.name = 'People';
d.info = 'People data for quick start guide';
d.dimNames = {'Persons', 'Parameters'};
show(d)
People:
People data for quick start guide
Parameters
Height Weight
------- -------
Mike 180 84
Peter 170 68
Anna 165 71
Kim 172 75
You can subset datasets using the same way as with matrices: by specifying indices for rows and columns. All special names and symbols, like : and end will work properly. Alternatively column and row names can be used for the same purpose.
show(d(1:2, :))
People:
People data for quick start guide
Parameters
Height Weight
------- -------
Mike 180 84
Peter 170 68
show(d({'Mike', 'Anna'}, 'Height'))
People:
People data for quick start guide
Height
-------
Mike 180
Anna 165
The mdadata
class has most of the standard mathematical and statistical methods overridden. This means that you can work with datasets just as with conventional matrices in MATLAB. Result of any operation is also a dataset (object of class mdadata
). For example, let's calculate BMI index for our data values.
bmi = d(:, 'Weight') ./ (d(:, 'Height') / 100) .^ 2;
bmi.colNames = 'BMI';
show(bmi)
BMI
-----
Mike 25.9
Peter 23.5
Anna 26.1
Kim 25.4
Simple plots
The mdadata
also overrides some plotting methods, including scatter()
, plot()
, bar()
and several others. Besides that, statistical plots, such as hist()
, boxplot()
and qqplot()
. It means that if one provided an mdadata
object as a first argument for these functions, a specially written version will be used instead of conventional MATLAB method. Thus to make a scatter plot one has to provide a dataset with one or two columns. If more than two are available, scatter()
method will ignore them.
figure
subplot(1, 2, 1)
scatter(d)
subplot(1, 2, 2)
plot(d)
As you can see the labels for axes, ticks, as well as title for the plot were set using dataset names. Color of data points, lines and bars are selected automatically but one can specify these and several other most important parameters for each plot. There are also additional options, allowing, for example, color grouping of data points and lines according to a vector of values. Look at description of plotting methods for the mdadata
class for details. One of the most useful option is a possibility to show labels for data points or bars. Labels can be names ('names'
), numbers ('numbers'
) or values ('values'
, this can be used only with bar plot).
figure
subplot(1, 2, 1)
scatter(d, 'Marker', 'd', 'Color', 'g', 'Labels', 'names')
subplot(1, 2, 2)
bar(d('Mike', :), 'FaceColor', 'b', 'Labels', 'values')
Univariate statistics
There are several statistic methods also available for the mdadata datasets. To demonstrate this we will use a subset of dataset People, which is provided with the toolbox. In the dataset there are values for 32 persons from scandinavian and medditeranian regions (50% males, 50% females). Here are some examples.
load('people')
d = people(:, {'Height', 'Weight', 'Shoesize'});
show( d(1:5, :) )
People:
People dataset
Variables
Height Weight Shoesize
------- ------- ---------
Lars 198 92 48
Peter 184 84 44
Rasmus 183 83 44
Lene 166 47 36
Mette 170 60 38
show( mean(d) )
Variables
Height Weight Shoesize
------- ------- ---------
Mean 173 64.5 39.9
show( std(d) )
Variables
Height Weight Shoesize
------- ------- ---------
Stdev 10.1 15.2 3.9
show( se(d) )
Variables
Height Weight Shoesize
------- ------- ---------
Std. error 1.78 2.69 0.689
show( percentile(d, 25) )
Percentiles:
Variables
Height Weight Shoesize
------- ------- ---------
25% 164 50 36
show( summary(d) )
Summary statistics:
Variables
Height Weight Shoesize
------- ------- ---------
Min 157 46 34
Q1 164 50 36
Median 174 64.5 40
Mean 173 64.5 39.9
Q3 180 80.5 43
Max 198 92 48
As well as several statistical plots.
figure
subplot(2, 2, 1)
hist( d(:, 'Height') )
subplot(2, 2, 2)
qqplot( d(:, 'Height') )
subplot(2, 2, 3)
boxplot( d )
We hope that this brief overview of mdadata
class gave an overall impression on how it works and how to use it for storing and visualisation of data values. To learn more, please, look at the User Guide and full description of the mdadata
class and its methods.
Principal component analysis
The next step is to learn how to build and use models in mdatools
. We will employ PCA to demonstrate the most important things, as we believe it is most known, and then will show some peculiarities and issues on how this methodology works with regression models.
The basic idea behind creating and using any model is following. For most of the methods, mdatools
has two classes (objects). One for model, that can be calibrated using this method, and one for result of applying this model to any dataset(s). The first (model) object has properties related to the model only. The second (result) object, contains properties related to the results. Thus for PCA model contains: loadings and their eigenvalues, number of components, which preprocessing methods to use and so on. The PCA result object mainly contains: scores, variance, and Q2/T2 residuals.
Of course the results of calibration, cross-validation or test-set validation are also part of a model. Therefore any model may have three result objects, as the model properties. One for calibration set (calres
) is always exist, since it is not possible to make a model without calibration set. The other two, cvres
and testres
, are optional and are empty if none of the validation methods is used.
Any model and result object has also methods for making various plots and showing statistics. Thus method plotscores()
for result object shows scores only for particular results. But the same plotscores()
method being called for a model will show scores for each of the results available. The same for, e.g. explained variance. Method plotexpvar()
shows explained variance for each component either for particular result or for the whole model (calibration, test and cross-validation if last two are available).
Let's look how to make and explore PCA model and results using the People data.
load('people')
m = mdapca(people, 6, 'Scale', 'on');
disp(m)
mdapca handle
Properties:
info: []
nComp: 6
loadings: [12x6 mdadata]
eigenvalues: [6x1 mdadata]
prep: [1x1 prep]
alpha: 0.05
cv: []
calres: [1x1 pcares]
cvres: []
testres: []
limits: [2x6 mdadata]
method: 'svd'
disp(m.calres)
pcares handle
Properties:
info: 'Results for calibration set'
scores: [32x6 mdadata]
variance: [6x2 mdadata]
modpower: [32x6 mdadata]
T2: [32x6 mdadata]
Q2: [32x6 mdadata]
As you can see, indeed the m object has mdapca
class and contains, among others, loadings of the principal component space. It also has three objects with results: calres
, cvres
and testrest
but the last two are empty, since we did not use any validation here.
For cross-validation one need to specify parameter 'CV'
with cell array as a value. The first element of the cell array is how to split data. Possible values are 'rand'
for random splits, 'ven'
for systematic (venetian blinds) splits and 'full'
for full cross-validation (leave-one-out). For the first two one can specify a second value which is number of segments to split the data into. Finally, for the random splits, we can also specify a number of repetitions. Thus the following example will make cross-validation with random splits to eight segments and four repetitions.
load('people')
m = mdapca(people, 6, 'Scale', 'on', 'CV', {'rand', 8, 4});
disp(m)
mdapca handle
Properties:
info: []
nComp: 6
loadings: [12x6 mdadata]
eigenvalues: [6x1 mdadata]
prep: [1x1 prep]
alpha: 0.05
cv: {'rand' [8] [4]}
calres: [1x1 pcares]
cvres: [1x1 pcares]
testres: []
limits: [2x6 mdadata]
method: 'svd'
Now model contains two types of results which are not empty - calres
and cvres
.
How to explore models and results? All numerical values are available as mdadata
objects, so if you want to look at e.g. scores for the first five rows of calibration set, just use
show(m.calres.scores(1:5, :))
Scores:
Components
Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6
------- ------- ------- ------- ------- -------
Lars -5.33 -0.677 1.07 1.1 -1.06 0.017
Peter -3.11 -0.293 -0.671 -1.31 -0.435 -0.119
Rasmus -3 -0.36 -0.212 -1.12 -0.204 0.0166
Lene 1.08 -1.84 -0.409 -0.123 1.32 -0.872
Mette 0.981 -1.43 -1.65 0.526 -0.714 0.0346
However it is much easier with plots. Here is how to show summary and plot overview for a PCA model:
summary(m)
figure
plot(m)
Eigenvalues Expvar Cumexpvar Expvar (CV) Cumexpvar (CV)
------------ ------- ---------- ------------ ---------------
Comp 1 6.43 53.6 53.6 45.5 45.5
Comp 2 2.24 18.7 72.3 17.8 63.3
Comp 3 1.62 13.5 85.7 14.4 77.7
Comp 4 0.998 8.32 94.1 13.3 91
Comp 5 0.319 2.66 96.7 3.36 94.3
Comp 6 0.165 1.38 98.1 1.87 96.2
If one need to look at the scores and loadings for other set of components, just specify this as a second argument:
figure
plot(m, [1 3])
Examples of how to use scores and loadings with extra options
figure
subplot(2, 2, 1)
plotscores(m, 1, 'Labels', 'names')
subplot(2, 2, 2)
plotscores(m, [1 3], 'Marker', 's', 'Color', 'g')
subplot(2, 2, 3)
plotloadings(m)
subplot(2, 2, 4)
plotloadings(m, [1 3], 'Labels', 'names')
Scores are not calculated for the cross-validated results, so they are not shown on the plot. In the current version scores for model can be plotted as scatter or density scatter plot. Loadings can be shown as scatter, line and bar plot.
figure
subplot(2, 1, 1)
plotloadings(m, 1, 'Type', 'line', 'Marker', '.')
subplot(2, 1, 2)
plotloadings(m, [1 2], 'Type', 'bar')
Examples for residuals:
figure
subplot(1, 2, 1)
plotresiduals(m, 'Labels', 'names', 'Marker', 's')
subplot(1, 2, 2)
plotresiduals(m, 2, 'Labels', 'names', 'Marker', 'sdo', 'Color', 'rgb')
Since cross-validated values can be shown on residuals plot (as well as test set results) here we need to specify color or/and marker either one for all results, as it is done in first plot, or three (one for each type) as in the second plot.
One can also make similar plots for any results. One important feature of e.g. scores and residuals plots for a particular type of results is that they can be colorised by any vector of values. For model plots this option is not available, since color is used to separate type of results, but here we can do it easily using specific options.
figure
subplot(1, 2, 1)
plotscores(m.calres, 'Labels', 'names')
subplot(1, 2, 2)
plotscores(m.calres, 'Labels', 'names', 'Colorby', people(:, 'Beer'))
Thus in the second plot data points are colorised according to annual beer consumption by the persons and one can also see a colobar with legend.
More details about PCA model and result objects and methods can be found in class descriptions (mdapca
and pcares
).
Working with images
The mdatools may work naturally with images. Image can be represented as a 2-way dataset by unfolding 3-way cube, so all pixels become rows (objects) and all channels — columns (variables). In mdatools there is a specific object to work with images, mdaimage
. It is based on mdadata
class and inherits all its properties and methods. So all examples above will also work with mdaimage
obects.
However there are also some important things to know. First of all, image has no row names, since number of pixels is very large, it would slow manipulations with such objects down if we used names. Second difference is when you subset an mdaimage
you have to use three indices: width, height and channels. Finally mdaimage
has an extra method imagesc()
allowing to show an image for any channel. Let's play with that:
img = imread('test.jpg');
img = mdaimage(img, {'Red', 'Green', 'Blue'});
disp(img)
mdaimage handle
Properties:
width: 400
height: 285
image: [285x400x3 double]
name: ''
info: []
dimNames: {'Pixels' 'Channels'}
values: [114000x3 double]
nCols: 3
nRows: 114000
nFactors: 0
rowNames: {}
colNames: {'Red' 'Green' 'Blue'}
rowFullNames: {}
colFullNames: {'Red' 'Green' 'Blue'}
Show color values for 3x3 pixels from left top corner:
show(img(1:3, 1:3, :))
Channels
Red Green Blue
---- ------ -----
183 80 101
191 76 105
187 73 99
187 77 102
185 75 102
183 73 100
183 73 100
182 72 97
223 119 142
Method imagesc
shows images for separate channels. If it is needed to show a color image, use .image
property, but do not forget to scale intensities, since all values of mdaimage
are double.
figure
subplot(2, 2, 1)
imagesc(img(:, :, 'Red'))
title('Red')
subplot(2, 2, 2)
imagesc(img(:, :, 'Green'))
title('Green')
subplot(2, 2, 3)
imagesc(img(:, :, 'Blue'))
title('Blue')
subplot(2, 2, 4)
imshow(img.image/255)
title('Color image')
colormap(gray)
Since image is just an extension of mdadata
it can be treated as a just a dataset, e.g. here is how to make a conventional and density scatter plot.
figure
subplot 121
scatter( img(:, :, {'Red', 'Blue'}));
subplot 122
densscatter( img(:, :, {'Red', 'Blue'}));
When one make a PCA model for an object of mdaimage
class, all results for objects (pixels), such as scores and residuals will be automatically converted to mdaimage
objects as well. It means, we can make scatter image for particular component.
m = mdapca(img);
figure
subplot(1, 2, 1)
plotscores(m.calres, 'Type', 'densscatter');
subplot(1, 2, 2)
imagesc(m.calres.scores(:, :, 2));