Factors and groups
Dataset class has a possibility to mark one or several columns as factors. Factor is a qualitative variable, it has fixed values (levels) and normally can not be treated as quantitative variable.
Factors can be used for splitting datasets, combine data values into groups, calculate statistics and show plots for the groups. Besides that, one can calculate qualitative statistics for factors, such as frequencies, contingency table, chi-square test for association and so on. All arithmetic operators and functions as well as methods for quantitative statistic will ignore factors in calculations.
To add a factor you need to have a column in the dataset with discrete numeric values, such as variables Sex and Region in the People data. It is possible to define text values for each of the levels. Keep level names as simple as possible and avoid using spaces and other special symbols. The column name for a factor is marked with an asterisk when dataset is displaying.
load people
d = people(1:8, :);
show(d)
People:
People dataset
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Sex Swim Region IQ
------- ------- --------- --------- ---- -------- ----- ----- ---- ----- ------- ----
Lars 198 92 -1 48 48 4.5e+04 420 115 -1 98 -1 100
Peter 184 84 -1 44 33 3.3e+04 350 102 -1 92 -1 130
Rasmus 183 83 -1 44 37 3.4e+04 320 98 -1 91 -1 127
Lene 166 47 -1 36 32 2.8e+04 270 78 1 75 -1 112
Mette 170 60 1 38 23 2e+04 312 99 1 81 -1 110
Gitte 172 64 1 39 24 2.2e+04 308 91 1 82 -1 102
Jens 182 80 -1 42 35 3e+04 398 65 -1 85 -1 140
Erik 180 80 -1 43 36 3e+04 388 63 -1 84 -1 129
Let us make 'Sex' as a factor and display the data again:
d.factor('Sex');
show(d)
People:
People dataset
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine * Sex Swim Region IQ
------- ------- --------- --------- ---- -------- ----- ----- ------ ----- ------- ----
Lars 198 92 -1 48 48 4.5e+04 420 115 -1 98 -1 100
Peter 184 84 -1 44 33 3.3e+04 350 102 -1 92 -1 130
Rasmus 183 83 -1 44 37 3.4e+04 320 98 -1 91 -1 127
Lene 166 47 -1 36 32 2.8e+04 270 78 1 75 -1 112
Mette 170 60 1 38 23 2e+04 312 99 1 81 -1 110
Gitte 172 64 1 39 24 2.2e+04 308 91 1 82 -1 102
Jens 182 80 -1 42 35 3e+04 398 65 -1 85 -1 140
Erik 180 80 -1 43 36 3e+04 388 63 -1 84 -1 129
The column Sex now is marked with an asterisk. Let us mark two columns as factors and provide text labels for the levels:
d.factor('Sex', {'Male', 'Female'})
d.factor('Hairleng', {'Short', 'Long'})
show(d)
People:
People dataset
Variables
Height Weight * Hairleng Shoesize Age Income Beer Wine * Sex Swim Region IQ
------- ------- ----------- --------- ---- -------- ----- ----- ------- ----- ------- ----
Lars 198 92 Short 48 48 4.5e+04 420 115 Male 98 -1 100
Peter 184 84 Short 44 33 3.3e+04 350 102 Male 92 -1 130
Rasmus 183 83 Short 44 37 3.4e+04 320 98 Male 91 -1 127
Lene 166 47 Short 36 32 2.8e+04 270 78 Female 75 -1 112
Mette 170 60 Long 38 23 2e+04 312 99 Female 81 -1 110
Gitte 172 64 Long 39 24 2.2e+04 308 91 Female 82 -1 102
Jens 182 80 Short 42 35 3e+04 398 65 Male 85 -1 140
Erik 180 80 Short 43 36 3e+04 388 63 Male 84 -1 129
As you will in the example below Sex and Hairleng are now ignored in calculations.
show(mean(d))
show(d * 10)
Variables
Height Weight Shoesize Age Income Beer Wine Swim Region IQ
------- ------- --------- ----- --------- ----- ----- ----- ------- ----
Mean 179 73.8 41.8 33.5 3.02e+04 346 88.9 86 -1 119
Variables
Height Weight Shoesize Age Income Beer Wine Swim Region IQ
--------- ------- --------- ---- -------- --------- --------- ----- ------- ---------
Lars 1.98e+03 920 480 480 4.5e+05 4.2e+03 1.15e+03 980 -10 1e+03
Peter 1.84e+03 840 440 330 3.3e+05 3.5e+03 1.02e+03 920 -10 1.3e+03
Rasmus 1.83e+03 830 440 370 3.4e+05 3.2e+03 980 910 -10 1.27e+03
Lene 1.66e+03 470 360 320 2.8e+05 2.7e+03 780 750 -10 1.12e+03
Mette 1.7e+03 600 380 230 2e+05 3.12e+03 990 810 -10 1.1e+03
Gitte 1.72e+03 640 390 240 2.2e+05 3.08e+03 910 820 -10 1.02e+03
Jens 1.82e+03 800 420 350 3e+05 3.98e+03 650 850 -10 1.4e+03
Erik 1.8e+03 800 430 360 3e+05 3.88e+03 630 840 -10 1.29e+03
One can also convert a factor back to a quantitative variable by using method notfactor()
.
d.notfactor('Sex');
% now 'Sex' is used for calculations again
show(d)
show(mean(d))
People:
People dataset
Variables
Height Weight * Hairleng Shoesize Age Income Beer Wine Sex Swim Region IQ
------- ------- ----------- --------- ---- -------- ----- ----- ---- ----- ------- ----
Lars 198 92 Short 48 48 4.5e+04 420 115 -1 98 -1 100
Peter 184 84 Short 44 33 3.3e+04 350 102 -1 92 -1 130
Rasmus 183 83 Short 44 37 3.4e+04 320 98 -1 91 -1 127
Lene 166 47 Short 36 32 2.8e+04 270 78 1 75 -1 112
Mette 170 60 Long 38 23 2e+04 312 99 1 81 -1 110
Gitte 172 64 Long 39 24 2.2e+04 308 91 1 82 -1 102
Jens 182 80 Short 42 35 3e+04 398 65 -1 85 -1 140
Erik 180 80 Short 43 36 3e+04 388 63 -1 84 -1 129
Variables
Height Weight Shoesize Age Income Beer Wine Sex Swim Region IQ
------- ------- --------- ----- --------- ----- ----- ------ ----- ------- ----
Mean 179 73.8 41.8 33.5 3.02e+04 346 88.9 -0.25 86 -1 119
Factors can be used to group your data according to combinations of the factor levels. Method getgroups()
creates a dataset with binary values (0, 1) for each of the possible combinations of selected factors. Even though there is normally no need to use this method directly, it gives a good idea how the splitting is made.
d = people(1:10, :);
d.factor('Sex', {'Male', 'Female'})
d.factor('Hairleng', {'Short', 'Long'})
show(d(:, {'Sex', 'Hairleng'}).getgroups())
Groups (Sex, Hairleng)
Male, Short Female, Short Female, Long
------------ -------------- -------------
Lars 1 0 0
Peter 1 0 0
Rasmus 1 0 0
Lene 0 1 0
Mette 0 0 1
Gitte 0 0 1
Jens 1 0 0
Erik 1 0 0
Lotte 0 0 1
Heidi 0 0 1
The getgroups()
is widely used in statistic and graphical methods. Here we will show how to use groups for calculation of quantitative statistics and in the next section graphical methods will be discussed.
Quantitative statistics
The idea is rather simple, if one provide a dataset with one or several factors as a second argument of any statistical method, the statistics will be calculated for all columns of the data but separately for rows belonging to each group. Here is an example:
people.factor('Sex', {'Male', 'Female'});
people.factor('Region', {'A', 'B'});
d = people(8:20, :);
show(d)
People:
People dataset
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine * Sex Swim * Region IQ
------- ------- --------- --------- ---- --------- ----- ----- ------- ----- --------- ----
Erik 180 80 -1 43 36 3e+04 388 63 Male 84 A 129
Lotte 169 51 1 36 24 2.3e+04 250 89 Female 78 A 98
Heidi 168 52 1 37 27 2.35e+04 260 86 Female 78 A 100
Kaj 183 81 -1 42 37 3.5e+04 345 45 Male 90 A 105
Gerda 157 47 1 36 32 3.2e+04 235 92 Female 70 A 127
Anne 164 50 1 38 41 3.4e+04 255 134 Female 76 A 101
Britta 162 49 1 37 40 3.4e+04 265 124 Female 75 A 108
Magnus 180 82 -1 44 43 3.7e+04 355 82 Male 88 A 109
Casper 180 81 -1 44 46 4.2e+04 362 90 Male 86 A 113
Luka 185 82 -1 45 26 1.6e+04 295 180 Male 92 B 109
Federico 187 84 -1 46 27 1.65e+04 299 178 Male 95 B 119
Dona 168 50 1 37 49 3.4e+04 170 162 Female 76 B 135
Fabrizia 166 49 1 36 21 1.4e+04 150 245 Female 75 B 123
% just a normal use of mean for a column
m = mean(d);
show(m)
People:
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Swim IQ
------- ------- --------- --------- ----- --------- ----- ----- ----- ----
Mean 173 64.5 0.0769 40.1 34.5 2.85e+04 279 121 81.8 114
% grouping factors are provided
m = mean(d, d(:, {'Sex'}));
show(m)
Mean for People:
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Swim IQ
------- ------- --------- --------- ----- --------- ----- ----- ----- ----
Male 182 81.7 -1 44 35.8 2.94e+04 341 106 89.2 114
Female 165 49.7 1 36.7 33.4 2.78e+04 226 133 75.4 113
If a method requires additional parameters, they should be specified after dataset with factors.
p = percentile(d, d(:, 'Sex'), 25);
show(p)
Percentiles for People:
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Swim IQ
------- ------- --------- --------- ----- --------- ----- ----- ----- ----
25% Male 180 80.5 -1 42.5 26.5 1.62e+04 297 54 85 107
75% Male 186 83 -1 45.5 44.5 3.95e+04 375 179 93.5 124
25% Female 162 49 1 36 24 2.3e+04 170 89 75 100
75% Female 168 51 1 37 41 3.4e+04 260 162 78 127
Several factors can be used at the same time.
s = ci(d, d(:, {'Sex', 'Region'}));
show(s)
Confidence intervals (95%) for People:
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Swim IQ
------- ------- --------- --------- ----- ---------- ----- ----- ----- -----
Lower Male, A 178 79.7 -1 41.7 32.9 2.81e+04 333 37.9 82.9 97.3
Upper Male, A 183 82.3 -1 44.8 48.1 4.39e+04 392 102 91.1 131
Lower Male, B 173 70.3 -1 39.1 20.1 1.31e+04 272 166 74.4 50.5
Upper Male, B 199 95.7 -1 51.9 32.9 1.94e+04 322 192 113 178
Lower Female, A 158 47.4 1 35.8 23.4 2.24e+04 239 77.3 71.3 92
Upper Female, A 170 52.2 1 37.8 42.2 3.62e+04 267 133 79.5 122
Lower Female, B 154 43.1 1 30.1 -143 -1.03e+05 32.9 -324 69.1 52.8
Upper Female, B 180 55.9 1 42.9 213 1.51e+05 287 731 81.9 205
s = ci(d, d(:, {'Sex', 'Region'}), 0.10);
show(s)
Confidence intervals (90%) for People:
Variables
Height Weight Hairleng Shoesize Age Income Beer Wine Swim IQ
------- ------- --------- --------- ------ ---------- ----- ------ ----- -----
Lower Male, A 179 80 -1 42.1 34.9 3.02e+04 341 46.3 84 102
Upper Male, A 183 82 -1 44.4 46.1 4.18e+04 384 93.7 90 126
Lower Male, B 180 76.7 -1 42.3 23.3 1.47e+04 284 173 84 82.4
Upper Male, B 192 89.3 -1 48.7 29.7 1.78e+04 310 185 103 146
Lower Female, A 159 48 1 36 25.6 2.4e+04 242 83.7 72.3 95.5
Upper Female, A 169 51.6 1 37.6 40 3.46e+04 264 126 78.5 118
Lower Female, B 161 46.3 1 33.3 -53.4 -3.91e+04 96.9 -58.5 72.3 91.1
Upper Female, B 173 52.7 1 39.7 123 8.71e+04 223 466 78.7 167
Qualitative statistics
Factors can be also used for calculation of qualitative statistics, including frequencies and relative frequencies (proportions) of factor levels, confidence interval for proportions, contingency tables for combination of two factors, chi square test for association of two factors, standardized residuals for observed and expected frequencies.
Let's take a part people data, so number of males and females, is different.
load people
data = people(6:20, {'Sex', 'Region'});
data.factor('Sex', {'Male', 'Female'})
data.factor('Region', {'A', 'B'})
Here is an example on how to calculate frequency table, which includes the observed frequencies for each level, relative frequencies (proportions), and confidence interval for the proportions. Optional second argument is significance level (alpha) for the interval.
f = freq(data(:, 'Sex'));
show(f)
f = freq(data(:, 'Sex'), 0.1);
show(f)
Observed frequencies:
Sex
Male Female
------ -------
Freq 7 8
Rel. Freq 0.467 0.533
Lower (95%) 0.214 0.281
Upper (95%) 0.719 0.786
f = freq(data(:, 'Sex'), 0.1);
show(f)
Observed frequencies:
Sex
Male Female
------ -------
Freq 7 8
Rel. Freq 0.467 0.533
Lower (90%) 0.255 0.321
Upper (90%) 0.679 0.745
For investigation of association between two factors one can calculate the contingency table.
ct = crosstable(data);
show(ct)
Contingency table (Region, Sex):
Sex
Male Female Sum
----- ------- ----
A 5 6 11
B 2 2 4
Sum 7 8 15
As well as to use the chi-square test for association and calculate standardized residuals.
ch = chi2test(data);
show(ch)
Chi2 test (Sex, Region):
Statistics
-----------
p 0.56
chi2 0.0244
res = crossresid(data);
show(res)
Standardized residuals (Region, Sex):
Sex
Male Female
------- -------
A -0.156 0.156
B 0.156 -0.156