Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
Big Data

Big Data

by VitaliKaiser, Feb. 2015

Favorite

Add to folder

Flag

Related Essays

Big Data
This paper explains about big data and big data analysis, characteristic of big data and advantages and disadvantages of big data. Keywords: big data, ente...
Predictive Analysis: Unknown Future Scenarios
The procedure starts with the collection of raw data and pass it through Hadoop which cleans up the data and use for the predetermined jobs like BI and in th...
Big Data Research Paper
Big Data Analytics Introduction In this report, the term Big Data Analytics will be explained and studied, it has been broken down into sub-headings of whic...
Computer Engineering: Data Mining
Due to the way of a data warehouse, most apropos information that has been chosen by data scientists/analysts and business clients ought to be situated insid...
Big Data Rewards
As our society has progressed, we have adopted the mindset of “the bigger the better”. This is reflected in the houses we own, the food we consume, and the ...
Big Data Analytics Assessment
7.Importance of unstructured data: The characteristic of unstructured data which either does not have a pre-defined data model or is not organized in a pre-...
The Importance Of Big Data Management
Presence of big data is a very common phenomenon now days, specially when talking about medium to large size corporation. The term Big Data is often describe...
Big Data Vs Traditional Marketing Research Essay
The rapid transformation of technology has created a massive growth in the amount of data that humans and machines generate. Most of the big data has been cr...
Primary Stages Of Intelligence Analysis
The process entails extracting knowledge from large datasets to identify hidden relationships (Keyvanpour et al., 2011). In most cases, the process involves ...
The Accidental Data Scientist
To complete this Big Data tutorial, Affelt summarizes several examples of easy-to-use data visualization and predictive algorithm tools. The terms are reasse...

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/38

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

38 Cards in this Set

Front
Back

	Idea behind FP-Tree?	1
	Problem of Apriori?	1
	Why is frequent pattern growth fast?	- No candidate generation, no candidate test -- Apriori algorithm has to proceed breadth-first - Use compact data structure - Eliminate repeated database scan - Basic operation is counting and FP-tree building
	Why summerizing itemsets? Name three representations!	- search space of FI is large and grows exp - low minimum support value result in extreme high number of frequent item sets - reduces computional and storage demands - closed, maximal and non-derivabel itemsets M ⊆ C ⊆ F
	Maximal frequent itemset?	- F frequent itemset - X (in F) ist maximal wenn es kein Y in F gibt, von dem X teilmenge ist - The set of maximal itemsets does not contain the complete support information but is compact
	Closed frequent itemset?	c(X) = i(t(X)) - An itemset X is closed in a data set D if there exists no proper super-itemset Y such that support(X) = support(Y) in D. - The set of closed frequent itemsets contains complete information regarding its corresponding frequent itemsets.
	GenMax?	1
	Charm?	1
	support , confidence rechenregel?	sup(X): anzahl der vorkommnise von X in F durch alle conf(X->Y): sup(X U Y) / sup(X)
	- KDD Process	- Knowledge discovery in databases 1) data cleaning and intrepreation --60% of the effort -- removing noise, incosinstens -- calculate missing values 2) focusing of task relevant data -- selection (e.g., sales data for the year 2001) -- projection (Select the relevant attributes/columns ) -- Transformations (normalization, discretetization) 3) Data Mining: -- Clustering -- Classification -- Frequent Pattern Mining -- Outlier Mining 4) data evaulation and knowledge presentation -- visualisation
	What is data warehouse?	- A decision support database that is maintained separately from the organization’s operational database(s) - source for data mining -- subject-oriented -- integrated (a lot sources) -- time-variant (over long time period) -- nonvolatile (2 operation: load & analyze)
	Properties of cube measures?	- distributive -- f(A U B) = f(A) + f(B) - Algebraic -- computed by a algebraic function -holostic -- ?
	Models in Datawarehouse?	- Star - Snowflake - Galaxy
	OLAP Operations?	- Drill down - roll up - slice and dice - pivot
	curse of dimensionality	complexity rises exponitionally to amount of dimensions
	options for materialization?	- no materialization (on the fly) - full materialization (calc all cubiods) - partly materialization (strategy)
	Bitmap Index?	Für gewählte Spalte, erstelle eine Tabelle mit allen möglichen Werten und trage für jede zeile binär codiert ein ob zutrifft oder nicht
	discovery stragies?	- hypothese driven (by user) - discovery driven -- precompute measures -- SelfExp, InExp, PathExp -- limited because no distribution assumptions e.g. trends
	catogorization of data	- values -- catogorial --- nominal (no order) --- ordinal (order) -- numerical --- discrete --- continuies
	dimensions?	- 1D: age:12 - multidemnsional : (x,y) - high-dimensional - no dimensional (metric data)
	chi-square?	sum( (e_i-n_i)^2 / e_i )
	data transformations?	- reduction strategies -- numerosity --- parametric (model) --- non-parametric (sampling) -- reduce dimensions --- feature selection (stepwise feature selction/elimination) --- statistically (PCA) -- discretizes --- equi-width (-outlier) --- equi-hight --- clustering --- entropy based --- Interval merge with chi square
	pro / cons von k-means	+ fast and easy - k has to be known - problem with varying density - non convex forms - result and runtime depends on initialization -- good method exist for that
	funktionsweise k-means?	- k centres randomly assigned - points are assigend to nearest center - new center of this points is calculated - repeat
	slihoute cooeficient calculation?	a() - average distance inside b() - average dinstance nearest other cluster s(o) = b(o)-a(o) / max(a(o),b(o)) s_c = average for every point in every cluster
	silhoutce cooeficient interpretation	s(o) == -1 --> bad, on overage nearer to B s(0) == 0 --> distance equal to A and B s(0) == 1 --> super clustering s_c > 0.7 --> good clustering s_c > 0.5 --> meduim clustering s_c > 0.25 --> weak clustering below --> no structure
	how to determine parameters in k-means	- minPts a good value is often 2d-1 - eps draw distance function and look for kink
	where is dbscan bad?	Hierachies! There is no kink...
	how can overfitting be avoided	- remove noisy training data - good size of training data (not to big or small) - good samples which represent the problem domain
	No Free Lunch Theorem?	- without knowledge of the problem there are no reasons to favor one classifier - i.e. there are always situation where another classifier is better --> ensemble classifier
	Bagging VS Bootstrapping	- Bagging: randomly choose samples to learn m different classifiers - bootstrapping: sequently go throw samples and learn all bad classified samples on a new classifier
	limitations of statistically outlier detection	- most tests are for single attributes only - In many cases, data distribution may not be known - statistical measures might be sensitive to outliers themselves - not all outliers are detected --> distance based approaches
	problem of distance based outlier detection	globaly searched, sometimes bad results better localy! --> LOF
	Primäre VS sekundär index	- Primäre Ordnung der Daten - sekündarer indize der gesondert gespeichert wird um schnellen zugriff zu erlauben
	Index Strukturen: Organisation	- Raumorganissiert (z.B. Hashverfahren) - datenorganisierte (z.B. Suchbäume)
	Index Strukturen: Struktur	- sequentieller Scan (Bitmap) - hierachisch (pruning bei bäumen) - Streuspeicher (hashing)
	Anfoderungen an Indexstrukturen	Basis - effizientes suchen - dynamisch (einfügen, löschen, ändern) zusätlich - nebenläufigkeit - ordnungserhaltung - anpassung an verschiedene datanverteilungen
	Bespiele für Indexstrukturen	- invertierte liste - composite index (erst nach nachname dann vorname) - R-Baum

Share This Flashcard Set