Text Mining Basics in Python

preview_player
Показать описание
Welcome to another module of Practical Data Science course. In this module, we will cover the basics of text mining. After completing this module, you will be comfortable with anything possible in text mining. This module starts with the definition of text mining. After that, you will learn the process of text mining. Then we will focus on application of text mining. The advantages and challenges of text mining will be discussed after it. And finally, we will implement the basic concepts of text mining in python.

Text mining is the process of exploring and analyzing large amounts of unstructured text data with the help of software that can find concepts, patterns, topics, keywords, and other attributes in the data.

It's also called text analytics, though some think the two terms are different. In their view, text analytics is the application that sorts through data sets by using text mining techniques. Sometimes, you will hear people using ‘Text Data Mining’ or ‘Document Mining’ instead of text mining. No matter which name is used, they all refer to the same thing. And that is the process of exploring unstructured text data to discover useful information.
Text mining has become more useful for data scientists and other users since big data platforms and deep learning algorithms that can analyze large amounts of unstructured data have become available.

Mining and analyzing text help businesses find potentially valuable business insights in corporate documents, customer emails, call center logs, verbatim survey comments, social network posts, medical records, and other text-based data sources. Text mining is also increasingly used in AI chatbots and virtual agents that companies use to respond to customers automatically as part of their marketing, sales, and customer service operations.

Here is the code used in this tutorial:

import nltk

text = '''Hello Mr. Jones, how are you doing today? The weather is great, and city is awesome.
The sky is bright-blue. You should't call for meeting today'''
tokenized_text = sent_tokenize(text)
print(tokenized_text)

tokenized_word = word_tokenize(text)
print(tokenized_word)

frequency = FreqDist(tokenized_word)
print(frequency)

print(stop_words)

filtered_sent = []
for w in tokenized_text:
if w not in stop_words:
print("Tokenized Sentence: ", tokenized_text)
print("Filtered Sentence: ", filtered_sent)

ps = PorterStemmer()
stemmed_words=[]

for w in filtered_sent:

print("Filtered Sentence:", filtered_sent)
print("Stemmed Sentence:", stemmed_words)

lem = WordNetLemmatizer()

stem = PorterStemmer()

word = "Working"

word = "Flying"

sentence = "Albert Einstein was born in Ulm, Germany in 1879"
print(tokens)

Рекомендации по теме
Комментарии
Автор

This is amazing, well structured and right to the point in the explanation, thanks. I am really interested in Text mining and Text analytics, please I would love to see more about it.

Gorzkun
Автор

Thank you for this video! I have a question: after setting the stopwords and looking at the filtered sentence (19:53) : why is the filtered sentence equal the tokenized sentence when the stopword list includes e.g. doing? Shouldn't it be deleted from the filtered sentence? An explaination would help me a lot. Thank you!

mentalresilience
Автор

I've been checking what I have this type of error. Hope you can help.

TypeError Traceback (most recent call last)
in <cell line: 8>()
6
7 word = "Working"
----> 8 print("Lemmatized Word: ", lem.lemmatize(word, "v"))
9 print("Stemmed Word: ", stem.stem(word))
10

TypeError: 'tuple' object is not callable

cristopherespiritu