Machine Learning: Text Preprocessing and Vectorization

Показать описание

-------------------------------
Below topics covered in this Text preprocessing and vectorization lecture:
1)Process of converting text/unstructured data to structured data is vectorization.
2) Three types of vectrization:
2a) bag of words model
2b) count vectorizer
2c) tf-idf vectorizer

twenty_train = fetch_20newsgroups(subset = 'train', shuffle = true)

vectorizer = CountVectorizer()
text = ["the quick brown fox jumped over lazy dog"]
print(type(vector))

5) Bag of words: gives importance to important words and less importance of non-important words
a)collect data
b)Design the vocabulary
c)create document vectors

6)tfidf acronym for Term frequency and inverse document frequency, is used often over the other two methods because of their limitation. Term frequency is calculated *within that* document and inverse document frequency calculates the frequency across *all* the documents and downscales it.
vectorizer = TfidfVectorizer()
x[0] #display the probability(meaningful representation of words into numbers) of words in 0 document

Feed the above structured data(text to numbers) to ML models to classify text.