name: inverse layout: true class: center, middle, inverse --- name: cover # Data Preprocessing QING Pei
[edwardtoday@gmail.com](mailto:edwardtoday@gmail.com)
.footnote[June 5, 2013 @ PolyU BRC] --- name: agenda layout: false .left-column[ ## Agenda ] .right-column[ 1. Why? 2. How? 3. Data sets 4. Experiments 5. Summary ] --- .left-column[ ## Agenda ## Why? ] .right-column[ Real world data are - incomplete - noisy - inconsistent - redundant ] --- .left-column[ ## Agenda ## Why? ## How? ] .right-column[ Tasks in data preprocessing - Data cleaning: filling missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. - Data integration: using multiple databases, data cubes, or files. - Data transformation: normalization and aggregation. - Data reduction: reducing the volume but producing the same or similar analytical results. - Data discretization: part of data reduction, replacing numerical attributes with nominal ones. ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ 4 datasets - Face - Merged pulse - Odor - Tongue Those are not raw data. They are complete and consistent. Therefore I decided to apply .red[discretization] and .red[reduction]. ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ ### Discretization .green[Supervised discretization] - uses the values of the class variable. - Using class boundaries. Three steps: 1. Sort values. 2. Place breakpoints between values belonging to different classes. 3. If too many intervals, merge intervals with equal or similar class distributions. If we do not have class labels available, unsupervised discretization is worth trying. - Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size. - Equal-frequency (equidepth) binning: use intervals containing equal number of values. *Discretization is also a kind of data reduction. It reduces the number of attribute values from many sampled values to a much smaller number of bins.* ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ ### Data reduction Apart from reducing the number of attribute values, data reduction can have other approaches. - Reducing the number of instances - Sampling - Reducing the number of attributes - Data cube aggregation: applying roll-up, slice or dice operations. - Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space. - Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data.. The number of samples we have is in hundreds, which is not a large number. Attribute selection is adopted in addition to discretization to further reducing the data set. ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ ### Approaches of attribute selection 1. Wrapper - A *wrapper* uses the intended learning algorithm itself to evaluate the usefulness of features. - Slow 2. Filter - A *filter* evaluates features according to heuristics based on general characteristics of the data. - Fast I have tried both in previous experiments. (Wrapper approach for Simple Logistics and filter approach for odor feature optimization.) This time, I use a Correlation-based Filter Approach .red[*] by Mark A. Hall. .footnote[ .red[*] Hall, Mark A. _Correlation-based feature selection for machine learning_. Diss. The University of Waikato, 1999.] ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ ### Correlation-based Filter Approach - Hypothesis on which the heuristic is based: > _Good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other._ In my [last presentation](/talks/odor-optim/), I calculated Pearson's correlation coefficient to represent the usefulness of feature columns. Now I know this work has been formally investigated by Ghiselli, Hogarth and Zajonc. - Ghiselli, Edwin Ernest. _Theory of psychological measurement_. Vol. 13. New York: McGraw-Hill, 1964. - Hogarth, Robert M. _Methods for aggregating opinions_. Springer Netherlands, 1977. - Zajonc, Robert B. "A note on group judgments and group size." Human Relations (1962). Dr. Mark Hall uses three different measures, **minimum description length (MDL)**, **symmetrical uncertainty** and **relief**. ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ #### Symmetric Uncertainty The entropy (uncertainty of unpredictability) of feature `\(Y\)` is given by $$H(Y)=-\sum\_{y\in Y} p(y) log\_{2}(p(y))$$ If the observed values of `\(Y\)` in the training data are partitioned according to the values of a second feature `\(X\)`, and the entropy of `\(Y\)` with respect to the partitions induced by `\(X\)` is less the the entropy of `\(Y\)` prior to partitioning, then there is a relationship between features `\(Y\)` and `\(X\)`. Entropy of `\(Y\)` after observing `\(X\)` is given by $$H(Y|X)=-\sum\_{x\in X} p(x) \sum\_{y\in Y} p(y|x) log\_{2} (p(y|x))$$ ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ The amount by which the entropy of `\(Y\)` decreases reflects additional information about `\(Y\)`provided by `\(X\)` and is called the *information gain* or *mutual information*. Information gain is given by $$gain = H(Y) - H(Y|X) = H(X) - H(X|Y) = H(Y) + H(X) - H(X,Y)$$ Symmetrical uncertainty .red[*] compensates for information gain's bias toward attributes with more values and normalizes its value to the range [0,1]: $$symmetrical\ uncertainty\ coefficient = 2.0 \times \left[\frac{gain}{H(Y) + H(X)} \right]$$ .footnote[.red[*] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. _Numerical Recipes in C_. Cambridge University Press, Cambridge, 1988. ] ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ #### Relief RELIEF .red[*] is a feature weighting algorithm that is sensitive to feature interactions. $$Relief\_{X} = P(different\ value\ of\ X | \ different\ class) - P(different\ value\ of\ X | \ same\ class)$$ which can be reformulated as $$Relief\_{X} = \frac{Gini' \times \sum\_{x\in X} p(x)^2}{(1-\sum\_{c\in C}p(c)^2)\sum\_{c\in C}p(c)^2}$$ where `\(C\)` is the class variable and $$Gini' = \left[\sum\_{c \in C}p(c)(1-p(c))\right] - \sum\_{x \in X} \left(\frac{p(x)^2}{\sum\_{x\in X} p(x)^2}\sum\_{c \in C}p(c|x)(1-p(c|x))\right)$$ To use *relief* symmetrically for two features, the measure can be calculated twice (each feature is treated in turn as the "class"), and the results averaged. .footnote[.red[*] Kira, Kenji, and Larry A. Rendell. "A practical approach to feature selection." _Proceedings of the ninth international workshop on Machine learning_. Morgan Kaufmann Publishers Inc., 1992.] ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ #### MDL The minimum description length (MDL) principle states that the “best” theory to infer from training data is the one that minimizes the length (complexity) of the theory and the length of the data encoded with respect to the theory. More formally, if `\(T\)` is a theory inferred from data `\(D\)`, then the total description length is given by $$DL(T,D) = DL(T) + DL(D|T)$$ The best models (according to the MDL principle) are those which are predictive of the data and, at the same time, have captured the underlying structure in a compact fashion. Kononenko defines an MDL measure of attribute quality: $$MDL =\frac{Prior\\_MDL - Post\\_MDL}{n}$$ ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ] .right-column[ ![Prior_MDL](prior_mdl.svg) ![Post_MDL](post_mdl.svg) where `\(n\)` is the number of training instances, `\(C\)` is the number of class values, `\(n_i\)` is the number of training instances from class `\(C_i\)`, `\(n_j\)` is the number of training instances with the j-th value of the given attribute, and `\(n_{ij}\)` is the number of training instances of class `\(C_i\)` having the j-th value for the given attribute. To obtain a measure that lies between 0 and 1, the first equation can be normalized by dividing by `\(Prior\_MDL /n\)`. This gives the **fraction** by which the *average description length* of the class label is **reduced** through partitioning on the values of an attribute. MDL is asymmetric. To use the measure symmetrically, it can be calculated twice (treating each feature in turn as the "class") and the results averaged. ] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiment ] .right-column[ ### Discretization
Error rate in classification and clustering
BayesNet
J48
EM
K-means
face
1.1737
0.7042
5.6338
5.1643
More formally, if T is a theory inferred from data D, then the total description length is given by
face-d
.green[0.939]
.green[0.4695]
.green[0.939]
.green[1.4085]
pulse
26.7267
28.8288
37.8378
40.2402
pulse-d
.green[24.6246]
.green[23.4234]
.green[30.03]
.green[25.5255]
] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiments ] .right-column[ ### Attribute Selection
Error rate in classification and clustering
BayesNet
J48
EM
K-means
face
1.1737
0.7042
5.6338
5.1643
face-as
.green[0.4695]
0.7042
.green[1.8779]
.red[12.2066]
pulse
26.7267
28.8288
37.8378
40.2402
pulse-as
.green[24.6246]
.green[24.9249]
.green[24.6246]
.green[39.9399]
] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiments ] .right-column[ ### Discretization + Attribute Selection
Error rate in classification and clustering
BayesNet
J48
EM
K-means
face
1.1737
0.7042
5.6338
5.1643
face-d-as
.green[0.2347]
.green[0.4695]
.green[0.2347]
.green[0.939]
pulse
26.7267
28.8288
37.8378
40.2402
pulse-d-as
.green[23.7237]
.green[21.9219]
.green[23.4234]
.red[48.6486]
] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiments ## Overall Comparison ] .right-column[
BayesNet
J48
EM
K-means
face
1.1737
0.7042
5.6338
5.1643
face-d
0.939
.green[0.4695]
0.939
1.4085
face-as
0.4695
0.7042
1.8779
.red[12.2066]
face-d-as
.green[0.2347]
.green[0.4695]
.green[0.2347]
.green[0.939]
pulse
26.7267
28.8288
37.8378
40.2402
pulse-d
24.6246
23.4234
30.03
.green[25.5255]
pulse-as
24.6246
24.9249
24.6246
39.9399
pulse-d-as
.green[23.7237]
.green[21.9219]
.green[23.4234]
.red[48.6486]
odor
6.4362
8.1049
48.0334
48.8677
odor-d
4.41
.green[3.8141]
27.2944
34.2074
odor-as
3.0989
4.7676
.red[48.3909]
47.0799
odor-d-as
.green[2.9797]
.green[3.8141]
.green[2.9797]
.green[4.1716]
tongue
27.4648
27.23
40.1408
43.4272
tongue-d
24.8826
24.8826
36.6197
.red[44.3662]
tongue-as
23.0047
27.23
.red[48.8263]
.red[43.8967]
tongue-d-as
.green[22.3005]
.green[24.4131]
.green[22.5352]
.green[25.8216]
] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiments ## Overall Comparison ] .right-column[
Number of features
Original
Attribute Selection
Discretization + Attribute Selection
face
24
10
7
merged_pulse
51
13
10
odor
931
17
20
tongue
41
8
8
] --- .left-column[ ## Agenda ## Why? ## How? ## Data sets ## Experiments ## Overall Comparison ## Summary ] .right-column[ By preprocessing data, we successfully - improved classification / clustering performance - in some cases, quite dramatically - reduced data computation time / storage space ] --- name: last-page template: inverse ## Thank you. .footnote[Slideshow created using [remark](http://github.com/gnab/remark).]