name: inverse layout: true class: center, middle, inverse --- name: cover # Notes on KDD QING Pei
[edwardtoday@gmail.com](mailto:edwardtoday@gmail.com) View the [notes page](https://edwardtoday.github.io/BioFeatureFusion/kdd_notes.html) directly.
.footnote[April 17, 2013 @ PolyU BRC] --- name: agenda layout: false .left-column[ ## Agenda ] .right-column[ This is mainly my notes taken during reading *Data Mining: Concepts and Techniques* by J. Han. M. Kamber, and J. Pei. 1. What is KDD? 2. What can be discovered? 3. Interestingness 4. Data preprocessing 5. Data transformation ] --- .left-column[ ## Agenda ## What is KDD? ] .right-column[ What is KDD? - KDD is a process that attempts to discover **patterns** in large data sets - KDD's overall goal is to extract information from a data set and transform it into an **understandable structure** for further use, e.g. a decision support system. ] --- .left-column[ ## Agenda ## What is KDD? ] .right-column[ KDD process can be shown as an iterative sequence of the following steps [@Han:2011wk]: 1. Data cleaning 2. Data integration 3. Data selection 4. Data transformation 5. Data mining 6. Pattern evaluation 7. Knowledge presentation ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ ### Associations and Correlations * **Apriori** [@Agrawal:1994ut]: Finding frequent itemsets by confined candidate generation 1. Hashing itemsets into buckets [PCY95a] 2. transaction reduction 3. partitioning the data to find candidate itemsets [SON95] 4. sampling: mining on a subset of the given data [Toi96] 5. dynamic itemset counting: start with count-so-far; fewer database scans [BMUT97] 6. parallel and distributed association mining [PCY95b] [AS96] [CHN+96] [ZPOL97] * **A-Close** [PBTL99]: Finding frequent closed itemsets * **FPgrowth** [HPY00]: Pattern-growth approach for mining frequent itemsets * **CLOSET** [PHM00]: Closed itemset mining based on FPgrowth * **Eclat** (Equivalence Class Transformation) [Zak00]: mining frequent itemsets using the vertical data format ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ Interestingness A pattern is interesting if it is 1. *easily understood* by humans 2. *valid* on new or test data with some degree of *certainty* 3. potentially *useful* 4. *novel* or if it validates a hypothesis that the user *sought to confirm*. An interesting pattern represents **knowledge**. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ Interestingness of association rules * **Lift** assesses the degree to which the occurrence of `\(A\)` “lifts” the occurrence of `\(B\)`. [AY99] $$lift(A,B) = \frac{P(A∪B)}{P(A)P(B)}$$ *Lift* is sensitive to transactions that do not contain the itemsets of interest (**null-transaction**); it would generate unstable results. Therefore we need other measures which are **null-invariant**. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ * **All confidence** $$all\\_conf(A,B) = \frac{sup(A∪B)}{max(sup(A), sup(B))} = min(P(A|B),P(B|A))$$ > `\(all\_conf(A,B)\)` is the minimum confidence of the two association rules related to `\(A\)` and `\(B\)`, namely, `\(A ⇒ B\)` and `\(B ⇒ A\)`. [Omi03; LKCH03] * **Max confidence** $$max\\_conf(A,B) = max(P(A|B),P(B|A))$$ > `\(max\_conf(A,B)\)` measure is the maximum confidence of the two association rules, `\(A ⇒ B\)` and `\(B ⇒ A\)`. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ * **Kulczynski** [WCH10] $$Kulc(A,B) = \frac{1}{2} (P(A|B) + P(B|A))$$ * **Cosine** $$cosine(A,B) = \frac{P(A∪B)}{\sqrt{P(A) \times P(B)}} = \frac{sup(A∪B)}{\sqrt{sup(A) \times sup(B)}} = \sqrt{P(A|B) \times P(B|A})$$ > The cosine measure can be viewed as a *harmonized lift* measure. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ ### Classification and Regression: Decision Tree A **decision tree** is a flowchart-like tree structure, where each **internal node** (non- leaf node) denotes a test on an attribute, each **branch** represents an outcome of the test, and each **leaf node** (or terminal node) holds a class label. * ID3, *information gain* * C4.5, *gain ratio* * CART, *Gini index* * CHAID, *a measure based on ChiSquare test* .red[*] * Minimum Description Length (MDL) .footnote[.red[*] popular in marketing] ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ Problems with decision trees: * **Repetition**: The same attribute may be tested repeatedly along a branch. E.g. "price > 50?", then "price < 100?", followed by "100 > price > 80", ... * **Replication**: Duplicate subtrees exist within the tree (, possibly in different branches.) Use multivariate splits .red[*] to relieve these problems. Or use a rule-based classifier instead of a decision tree classifier. .footnote[.red[*] CART can find multivariate splits based on a linear combination of attributes. This is a form of attribute construction.] ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ * **Scalability**: ID3, C4.5 and CART works for small data sets. What if the data set is disk-resident? + RainForest: maintains an AVC-set (attribute-value, classlabel) + BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) - 2~3x faster than RainForest while constructing exactly the same tree. - Incremental updates. BOAT can take new insertions and deletions for the training data without having to reconstruct the tree. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ ### Classification and Regression: Bayesian Classification `\(P(H)\)`, `\(P(X|H)\)` and `\(P(X)\)` may be estimated from the given data. The posterior probability, `\(P(H|X)\)`, can be calculated from them. Bayes's theorem: $$P(H|X) = \frac{P(X|H)P(H)}{P(X)}$$ Naive Bayesian Classifier * assumption: **class-conditional independence** * maximize `\(P(X|C_i)\)` * Note: If we have a probability of zero, **Laplacian correction** can help. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ ### Classification and Regression: Rule-based Classification An IF-THEN rule: > IF *condition* THEN *conclusion*. A rule `\(R\)` can be assessed by its **coverage** $$coverage(R) = \frac{n\_{covers}}{|D|}$$ and **accuracy** $$accuracy(R) = \frac{n\_{correct}}{n\_{covers}}$$ ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ] .right-column[ Rule quality measures - Accuracy alone is not a good measure. A rule can sacrifice coverage for higher accuracy. - Entropy, or information gain measure, is better. - Another measure based on information gain is First Order Inductive Learner (FOIL). Conditions that does not improve estimated accuracy of a given rule are pruned. Rules that do not improve estimated accuracy of the rule set are pruned. After pruning, rule ordering matters. C4.5 adopts a **class-based ordering scheme**. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ Terminology * **positive tuples** (P): Tuples of the main class of interest. * **negative tuples** (N): All other tuples. * **True positives** (TP): Positive tuples that were correctly labeled by the classifier. * **True negatives** (TN): Negative tuples that were correctly labeled by the classifier. * **False positives** (FP): Negative tuples that were incorrectly labeled as positive. * **False negatives** (FN): Positive tuples that were mislabeled as negative. We have `\(P = TP + FN, N = FP + TN\)`. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ Evaluation measures * accuracy $$accuracy = \frac{TP + TN}{P + N}$$ * error rate, `\(1 - accuracy\)` $$error\ rate = \frac{FP + FN}{P + N}$$ + If we use the training set to estimate the error rate of a model, this quantity is known as **resubstitution error** * sensitivity, or true positive rate, recall $$sensitivity = \frac{TP}{P}$$ ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ * specificity, or true negative rate $$specificity = \frac{TN}{N}$$ + Sensitivity and specificity can be used to solve class imbalance problem. * precision $$precision = \frac{TP}{TP+FP}$$ + Precision can be thought of a measure of *exactness* (i.e., what percentage of tuples labeled as positive are actually such), whereas recall is a measure of *completeness* (what percentage of positive tuples are labeled as such.) ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ * `\(F_{1}\)` is harmonic mean of precision and recall $$F_{1} = \frac{2 \times precision \times recall}{precision + recall}$$ * `\(F_{\beta}\)`, where `\(\beta\)` is a non-negative real number $$F_{\beta} = \frac{(1 + {\beta}^{2}) \times precision \times recall}{{\beta}^{2} \times precision + recall}$$ - Combine precision and recall into a single measure. Commonly used `\(F_{\beta}\)` measures are `\(F_{2}\)` and `\(F_{0.5}\)`. If tuples can belong to multiple classes, the accuracy measure is not appropriate. It is useful to return a **probability class distribution**. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ In addition to accuracy-based measures, classifiers can also be compared with respect to the following additional aspects: * **Speed**: computational cost to generate or use the classifier * **Robustness**: ability to make correct predictions given noisy data or missing values * **Scalability**: ability to construct the classifier efficiently given large amounts of data. * **Interpretability**: level of understanding and insight that is provided by the classifier or predictor. This is subjective and therefore difficult to assess. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ ### Cross-validation and Statistical Tests of Significance 10-fold cross-validation recommended Question: > We have two different models to compare. > Is the difference **statistically significant**? > (Otherwise we should not claim that one is better.) ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ Suppose we have trained and tested two models, `\(M_1\)` and `\(M_2\)`. In such cases, we do a **pairwise comparison** of the two models for *each* 10-fold cross-validation round. > That is, for the `\(i\)`th round of 10-fold cross-validation, the *same cross-validation partitioning* is used to obtain an error rate for `\(M_1\)` and for `\(M_2\)`. Let `\(err(M_1)_i\)` (or `\(err(M_2)_i\)`) be the error rate of model `\(M_1\)` (or `\(M_2\)` on round `\(i\)`. The error rates for `\(M_1\)` are averaged to obtain a mean error rate for `\(M_1\)`, denoted `\(\overline{err}(M_1)\)`. Similarly, we can obtain `\(\overline{err}(M_2)\)`. The variance of the difference between the two models is denoted `\(var(M_1 - M_2)\)`. ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ The *t*-test computes the *t-statistic with `\(k-1\)` degrees of freedom* for `\(k\)` samples. > In our example, we have `\(k=10\)`. The *t*-statistic for pairwise comparison is computed as follows: $$t = \frac{\overline{err}(M_1)-\overline{err}(M_2)}{\sqrt{var(M_1 - M_2)/k}},$$ where $$var(M\_1 - M\_2) = \frac{1}{k} \sum\_{i=1}^{k}{[err(M\_1)\_i - err(M\_2)\_i - (\overline{err}(M\_1) - \overline{err}(M\_2))]^2}.$$ ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ Compute `\(t\)`, select a *significance level* .red[\*] and consult a table for *t-distribution*. Then we shall tell whether the difference between `\(M_1\)` and `\(M_2\)` is statistically significant. Nonpaired version: $$var(M_1 - M_2) = \sqrt{\frac{var(M_1)}{k_1} + \frac{var(M_2}{k_2}},$$ where `\(k_1\)` and `\(k_2\)` are the number of cross-validation samples used for `\(M_1\)` and `\(M_2\)`, respectively. > When consulting the table of *t-distribution*, the number of freedom used is taken as the minimum number of degrees of the two models. .footnote[.red[\*] usually 5% or 1% in practice] ] --- .left-column[ ## Agenda ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ] .right-column[ ### ROC Curve ![ROC Curve](Roccurves.png) ] --- .left-column[ ## What is KDD? ## Kinds of Patterns ## Classifier Evaluation ## Improving Classifiers ## Next time ] .right-column[ Topics to be covered: - Ensemble methods - Cluster analysis - Outlier analysis - Data preprocessing + Cleaning + Reduction + Discretization ] --- name: last-page template: inverse ## That's all (for now)! .footnote[Slideshow created using [remark](https://github.com/gnab/remark).]