• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/20

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

20 Cards in this Set

  • Front
  • Back

Winsorizatiom

Outlier replacement with max and min value from data set


Wrangling vs cleansing

Cleansing - basically removing all kinds of error from raw data


Wrangling = removing outliers, or selling with outliers," extraction" (remember not selection) of features, filtration (rows) and selection (columns), conversion (must be stripped out if prefixes and suffixes)

Feature extraction vs feature engineering

Extraction is the process of creating that is extracting new variables from existing ones in the data. Eg: ratio of two existing features.



Engineering is the process of creating new features by changing a transform in existing one

Text processing

Unstructured to structured - through cleansing and preprocessing (wrangling)


Html removal, number removal, punctuations removal, white space removal

Regex

Regular Expression - basically search function - used to search for patterns. Series that contain characters in certain order. Finding whitespace. Numbers, punctuations.

Number removal

When numbers are present in the text they should be removed or substituted with an annotation /number/

Text wrangling (preprocessing)

Token is equivalent to word.


Tokenization - process of splitting a given text into separate tokens. In other words, text is considered to be collection of tokens (word)



-Lowercasing


-Stopwords (the, a, is)


-Stemming (analysing, analysed -> analys) / lemmanization (analysing, analysed --> analys'e'). Stemming is more common

N gram vs bow

N-gram = set of n words that occur in that order


Bow = unigram.


Adv of n gram is that they can be used in the same way as unigram to build a bow. Eg: "Man_went_to_mall" is a n-gram but also a unigram (BOW).

Text cleansing when to add annotation?

In oujctuauoon, sometime currency signs, % signs are imp. So to not lose the sense add them in annotations /%/

When some words appear vary in frequently in a textual database techniques that mein adress the risk of training highly complex models include

Stemming

Text cleansing vs text preprocessing (wrangling)

Cleaning - remove punctuations remove numbers remove white spaces remove HTML tags


Wrangling - lower casing stopwords stemming lemmatization

One hot encoding

How to combine two categorical features? Through one hot encoding - it converts categorical variables into binary 0 or 1. One hot encoding is an example of feature engineering

Feature selection (unstructured)

Frequency- can be used for vocabulary pruning to remove noise features a filtering the tokens with very high and low frequency. Document frequency is defined as the number of documents (texts) that contain the respective tokens divided by the total number of documents (texts)



Chi-square test- it tests independence of two event - Occurrence of tokens and Occurrence of the class. Latest ranks the tokens buy the usefulness to each class in text classification problem. tokens with highest chi square test values occur more frequently in text associated with a particular size and therefore can be selected as a feature of ML



Mutual info - measures how much information is contributed by a token to a class of text. If the value is zero the meaning contributed is almost the same in all classes if the value is 1, the token contributes more meaning to a particular class

Class imbalance

Particular class is significantly larger than other class this could be due to over sampling. Aur under sampling of mini data

Precision

Useful when cost of causing type I error is high. Rejecting true null hypothesis. Eg. Rejecting quality of the car, when it is perfectly fine.



Ratio of correctly predicted positive classes by total predicted positive classes.



High precision indicates that the class. Labeled as +'ve is indeed positive. Small FP

Recall

Ratio of correctly predicted positive class by actual positive class.


useful in situation when Type ii error cost is high. Eg: car passes qulirt check, but is actually faulty.



High recall indicates that the class is correctly recignzed. (Small FN)

Accuracy and f1 scire

Trading off precision and recall is business model choice. Thus two overall performance matrices are used:



Accuracy: percentage of corre got predicted classes out if total prediction. Limitation is that it assumed equal cost/damage for both kind of error.


F1: Harmonic mean if precision and call.



F1 is more useful than accuracy when unequal class distribution is present in the database and it is necessary to measure equilibrium of precision and recall. f1 value is always closer to the lowest value of either recall or precision.

Receiver Operating Characteristics (ROC)

Higher the Area under curved better the model oerfoencae. The more convex the curve the better the model. Test A is better


Grid search

systematically training in machine learning model by using various combination of hyper parameter values across validating is model in determining which combination of hyperparametere results into best model

Cross validation

Technique for estimating out-of-sample error directly by determining the error in validation sample