Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Flashcards
»
level 2

Level 2

by CFAWiz, Nov. 2020

Subjects: qll

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/20

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

20 Cards in this Set

Front
Back

	Winsorizatiom	Outlier replacement with max and min value from data set
	Wrangling vs cleansing	Cleansing - basically removing all kinds of error from raw data Wrangling = removing outliers, or selling with outliers," extraction" (remember not selection) of features, filtration (rows) and selection (columns), conversion (must be stripped out if prefixes and suffixes)
	Feature extraction vs feature engineering	Extraction is the process of creating that is extracting new variables from existing ones in the data. Eg: ratio of two existing features. Engineering is the process of creating new features by changing a transform in existing one
	Text processing	Unstructured to structured - through cleansing and preprocessing (wrangling) Html removal, number removal, punctuations removal, white space removal
	Regex	Regular Expression - basically search function - used to search for patterns. Series that contain characters in certain order. Finding whitespace. Numbers, punctuations.
	Number removal	When numbers are present in the text they should be removed or substituted with an annotation /number/
	Text wrangling (preprocessing)	Token is equivalent to word. Tokenization - process of splitting a given text into separate tokens. In other words, text is considered to be collection of tokens (word) -Lowercasing -Stopwords (the, a, is) -Stemming (analysing, analysed -> analys) / lemmanization (analysing, analysed --> analys'e'). Stemming is more common
	N gram vs bow	N-gram = set of n words that occur in that order Bow = unigram. Adv of n gram is that they can be used in the same way as unigram to build a bow. Eg: "Man_went_to_mall" is a n-gram but also a unigram (BOW).
	Text cleansing when to add annotation?	In oujctuauoon, sometime currency signs, % signs are imp. So to not lose the sense add them in annotations /%/
	When some words appear vary in frequently in a textual database techniques that mein adress the risk of training highly complex models include	Stemming
	Text cleansing vs text preprocessing (wrangling)	Cleaning - remove punctuations remove numbers remove white spaces remove HTML tags Wrangling - lower casing stopwords stemming lemmatization
	One hot encoding	How to combine two categorical features? Through one hot encoding - it converts categorical variables into binary 0 or 1. One hot encoding is an example of feature engineering
	Feature selection (unstructured)	Frequency- can be used for vocabulary pruning to remove noise features a filtering the tokens with very high and low frequency. Document frequency is defined as the number of documents (texts) that contain the respective tokens divided by the total number of documents (texts) Chi-square test- it tests independence of two event - Occurrence of tokens and Occurrence of the class. Latest ranks the tokens buy the usefulness to each class in text classification problem. tokens with highest chi square test values occur more frequently in text associated with a particular size and therefore can be selected as a feature of ML Mutual info - measures how much information is contributed by a token to a class of text. If the value is zero the meaning contributed is almost the same in all classes if the value is 1, the token contributes more meaning to a particular class
	Class imbalance	Particular class is significantly larger than other class this could be due to over sampling. Aur under sampling of mini data
	Precision	Useful when cost of causing type I error is high. Rejecting true null hypothesis. Eg. Rejecting quality of the car, when it is perfectly fine. Ratio of correctly predicted positive classes by total predicted positive classes. High precision indicates that the class. Labeled as +'ve is indeed positive. Small FP
	Recall	Ratio of correctly predicted positive class by actual positive class. useful in situation when Type ii error cost is high. Eg: car passes qulirt check, but is actually faulty. High recall indicates that the class is correctly recignzed. (Small FN)
	Accuracy and f1 scire	Trading off precision and recall is business model choice. Thus two overall performance matrices are used: Accuracy: percentage of corre got predicted classes out if total prediction. Limitation is that it assumed equal cost/damage for both kind of error. F1: Harmonic mean if precision and call. F1 is more useful than accuracy when unequal class distribution is present in the database and it is necessary to measure equilibrium of precision and recall. f1 value is always closer to the lowest value of either recall or precision.
	Receiver Operating Characteristics (ROC)	Higher the Area under curved better the model oerfoencae. The more convex the curve the better the model. Test A is better
	Grid search	systematically training in machine learning model by using various combination of hyper parameter values across validating is model in determining which combination of hyperparametere results into best model
	Cross validation	Technique for estimating out-of-sample error directly by determining the error in validation sample

Share This Flashcard Set

Set the Language

Level 2

Add to Folders

Upgrade to Cram Premium

Card Range To Study

20 Cards in this Set