Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
38 Cards in this Set
- Front
- Back
Idea behind FP-Tree? |
1 |
|
Problem of Apriori? |
1 |
|
Why is frequent pattern growth fast? |
- No candidate generation, no candidate test -- Apriori algorithm has to proceed breadth-first - Use compact data structure - Eliminate repeated database scan - Basic operation is counting and FP-tree building |
|
Why summerizing itemsets? Name three representations! |
- search space of FI is large and grows exp - low minimum support value result in extreme high number of frequent item sets - reduces computional and storage demands - closed, maximal and non-derivabel itemsets M ⊆ C ⊆ F |
|
Maximal frequent itemset? |
- F frequent itemset - X (in F) ist maximal wenn es kein Y in F gibt, von dem X teilmenge ist - The set of maximal itemsets does not contain the complete support information but is compact |
|
Closed frequent itemset? |
c(X) = i(t(X)) - An itemset X is closed in a data set D if there exists no proper super-itemset Y such that support(X) = support(Y) in D. - The set of closed frequent itemsets contains complete information regarding its corresponding frequent itemsets. |
|
GenMax? |
1 |
|
Charm? |
1 |
|
support , confidence rechenregel? |
sup(X): anzahl der vorkommnise von X in F durch alle conf(X->Y): sup(X U Y) / sup(X) |
|
- KDD Process |
- Knowledge discovery in databases 1) data cleaning and intrepreation --60% of the effort -- removing noise, incosinstens -- calculate missing values 2) focusing of task relevant data -- selection (e.g., sales data for the year 2001) -- projection (Select the relevant attributes/columns ) -- Transformations (normalization, discretetization) 3) Data Mining: -- Clustering -- Classification -- Frequent Pattern Mining -- Outlier Mining 4) data evaulation and knowledge presentation -- visualisation |
|
What is data warehouse? |
- A decision support database that is maintained separately from the organization’s operational database(s) - source for data mining
-- subject-oriented -- integrated (a lot sources) -- time-variant (over long time period) -- nonvolatile (2 operation: load & analyze) |
|
Properties of cube measures? |
- distributive -- f(A U B) = f(A) + f(B) - Algebraic -- computed by a algebraic function -holostic -- ? |
|
Models in Datawarehouse? |
- Star - Snowflake - Galaxy |
|
OLAP Operations? |
- Drill down - roll up - slice and dice - pivot
|
|
curse of dimensionality |
complexity rises exponitionally to amount of dimensions |
|
options for materialization? |
- no materialization (on the fly) - full materialization (calc all cubiods) - partly materialization (strategy)
|
|
Bitmap Index? |
Für gewählte Spalte, erstelle eine Tabelle mit allen möglichen Werten und trage für jede zeile binär codiert ein ob zutrifft oder nicht |
|
discovery stragies? |
- hypothese driven (by user) - discovery driven -- precompute measures -- SelfExp, InExp, PathExp -- limited because no distribution assumptions e.g. trends |
|
catogorization of data |
- values -- catogorial --- nominal (no order) --- ordinal (order) -- numerical --- discrete --- continuies
|
|
dimensions? |
- 1D: age:12 - multidemnsional : (x,y) - high-dimensional - no dimensional (metric data) |
|
chi-square? |
sum( (e_i-n_i)^2 / e_i ) |
|
data transformations? |
- reduction strategies -- numerosity --- parametric (model) --- non-parametric (sampling) -- reduce dimensions --- feature selection (stepwise feature selction/elimination) --- statistically (PCA) -- discretizes --- equi-width (-outlier) --- equi-hight --- clustering --- entropy based --- Interval merge with chi square
|
|
pro / cons von k-means |
+ fast and easy - k has to be known - problem with varying density - non convex forms - result and runtime depends on initialization -- good method exist for that |
|
funktionsweise k-means? |
- k centres randomly assigned - points are assigend to nearest center - new center of this points is calculated - repeat |
|
slihoute cooeficient calculation? |
a() - average distance inside b() - average dinstance nearest other cluster s(o) = b(o)-a(o) / max(a(o),b(o)) s_c = average for every point in every cluster |
|
silhoutce cooeficient interpretation |
s(o) == -1 --> bad, on overage nearer to B s(0) == 0 --> distance equal to A and B s(0) == 1 --> super clustering
s_c > 0.7 --> good clustering s_c > 0.5 --> meduim clustering s_c > 0.25 --> weak clustering below --> no structure |
|
how to determine parameters in k-means |
- minPts a good value is often 2d-1 - eps draw distance function and look for kink |
|
where is dbscan bad? |
Hierachies! There is no kink... |
|
how can overfitting be avoided |
- remove noisy training data - good size of training data (not to big or small) - good samples which represent the problem domain |
|
No Free Lunch Theorem? |
- without knowledge of the problem there are no reasons to favor one classifier - i.e. there are always situation where another classifier is better --> ensemble classifier |
|
Bagging VS Bootstrapping |
- Bagging: randomly choose samples to learn m different classifiers - bootstrapping: sequently go throw samples and learn all bad classified samples on a new classifier |
|
limitations of statistically outlier detection |
- most tests are for single attributes only - In many cases, data distribution may not be known - statistical measures might be sensitive to outliers themselves - not all outliers are detected --> distance based approaches |
|
problem of distance based outlier detection |
globaly searched, sometimes bad results better localy! --> LOF |
|
Primäre VS sekundär index |
- Primäre Ordnung der Daten - sekündarer indize der gesondert gespeichert wird um schnellen zugriff zu erlauben |
|
Index Strukturen: Organisation |
- Raumorganissiert (z.B. Hashverfahren) - datenorganisierte (z.B. Suchbäume) |
|
Index Strukturen: Struktur |
- sequentieller Scan (Bitmap) - hierachisch (pruning bei bäumen) - Streuspeicher (hashing) |
|
Anfoderungen an Indexstrukturen |
Basis - effizientes suchen - dynamisch (einfügen, löschen, ändern) zusätlich - nebenläufigkeit - ordnungserhaltung - anpassung an verschiedene datanverteilungen |
|
Bespiele für Indexstrukturen |
- invertierte liste - composite index (erst nach nachname dann vorname) - R-Baum |