Data Pipelines | Introduction to Text Analytics with R Part 3

preview_player
Показать описание
In our next installment of introduction to text analytics, data pipelines, we cover:
– Exploration of textual data for pre-processing “gotchas”
– Using the quanteda package for text analytics
– Creation of a prototypical text analytics pre-processing pipeline, including (but not limited to): tokenization, lower casing, stop word removal, and stemming.
– Creation of a document-frequency matrix used to train machine learning models

Kaggle Dataset:

The data and R code used in this series is available here:

Table of Contents:
0:00 Introduction
0:54 HTML escapes
8:40 Quantium
9:40 Tokenization
16:17 Stop words
16:53 Quantity
20:18 Stem
24:10 DFM

--

--

Unleash your data science potential for FREE! Dive into our tutorials, events & courses today!

--

📱 Social media links

--

Also, join our communities:

_

#datapipeline #textanalytics #rprogramming
Рекомендации по теме
Комментарии
Автор

This is excellent work! The pace of the videos is little on the slower side, but I completely appreciate that the author is trying to cater to aspirants with proficiency levels across the spectrum.

amit
Автор

Thanks for doing this great tutorial. I have learned more here that many other docs and tutorials just by going to the basics

pablomoreno
Автор

I understood everything. Thanks for the good work.

ii
Автор

At around 17:50 - I think stopwords were removed from "quanteda" (recently?) - I installed the "tm" package and ran it with quanteda and everything worked great.

davidcurrie
Автор

Really excellent and didactic to be recommended highly!tks

jean-mariemudry
Автор

Great video! Do you have tips on dealing with very large datasets? I have a dataset with 80, 000 observations and 40, 0000 tokens (3.2 billion elements). When I try to convert it to a dataframe (or matrix), I get "Cholmod error 'problem too large' at file" ...and the problem persists even when running additional doSNOW clusters. Any help would be much appreciated!

DualFixMusic
Автор

Hi Dave - first of all great tutorial.

I just had one doubt - at step where we find the dim' of the matrix - after following all the code as you wrote in your video - I am getting no' of col's as 5742 - can you think of any reason that this would happen?

Find below the code that I have used - I have written some extra comments for my personal use -


#installing all required packages
install.packages(c("quanteda", "ggplot2", "e1071", "caret", "irlba", "randomForest"))

# setting up wd
science study/R study/text analytics with data sceince dojo")

# load up the .csv data and explore in RStudio.
spam.raw <- read.csv("spam.csv", stringsAsFactors = FALSE)
# View(spam.raw)

# Clean up the dataframe
spam.raw <- spam.raw[, 1:2]
names(spam.raw) <- c("Label", "Text")
# View(spam.raw)


# check data to see if there are missin values
# Before starting any data analysis - we should know if our data is complete or it has any missing values that we
# need to account for.
# finding the no. of rows which are not complete

# Convert our class label into a factor
spam.raw$Label <- as.factor(spam.raw$Label)


# next most important step is to explore the data
# For classification problems - Find out if there is any skewness in the data.
# So, Let's take a look at distribution of the class labels (i.e., ham vs spam).



# Next up, let's get a feel for the distribution of text lengths of the SMS
# messages by adding a new fearure for the length of each message.
# we are doing this as we can see in the data that most short messages are ham and
# on average most long messages are spam - so to test this hypothesis - we are engeneering a new feature.

spam.raw$TextLength <- nchar(spam.raw$Text)
summary(spam.raw$TextLength)

# as can be seen from the results - there is presence of certain kind of skewness in the data
# the min is 2 and max is 910.
# Now lets try and visulize the skewness using a histogram

library(ggplot2)

ggplot(spam.raw, aes(x = TextLength, fill = Label)) +
theme_bw() +
geom_histogram(binwidth = 5) +
labs(y = "Text Count", x = "Length of Text",
title = "Distribution of Text Lengths with class Labels")

# now as can be seen from the histogram - our hypo is right - the SMS' with less data
# is normally ham and on average SMS' with more characters are spam.
# also as it can be seen that till some length of the data almost all the sms' are ham and
# also those at the extreme end's are hams vs the data in the middle which is invariable spam
# this could help us in future in engeneering new feature to help in prediction.

# Lecture 2


#Currently we are splitting are data into two - training set and test set.
# In a true project we would want to use a three way split of
# training, validation, and test.

# Also as we know our data has non-trivial class imbalance, we'll use
# the mighty caret package to create a random train and test split
# that ensures the correct ham/spam class label proportions
# using caret to do a random stratefied split
library(caret)

# using caret to create 70/30 stratified split
# also setting seed for reproducibility.
set.seed(32984)
indexes <- createDataPartition(spam.raw$Label, times = 1, p = 0.7, list = FALSE)
train <- spam.raw[indexes, ]
test <- spam.raw[-indexes, ]

#verify proportions

prop.table(table(test$Label))


# Lecture 3


# Basic Data exploration

# HTML -escaped ampersand character.
train$Text[21]

# [1] "I'm back &amp; we're packing the car now, I'll let you know if there's room"

# as can be seen above the '&amp;' is just '&' in actual message - but when we get the
# raw data - it gets converted to a different mix of symbols - therefore in text analytics
# when using the bag of words model - we have to deal with such instances as well - and according
# to situation we have to decide how to deal with it.

# same situation happens for train$Text[38] & URL example train$Text[357]

library(quanteda)

# Tokenize SMS text messages - Tokenisation is the process of breaking a document into individual words or tokens.

train.tokens <- tokens(train$Text, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
# Take a look at a specific SMS message and see how it transforms
train.tokens[[357]]

# Lower case the tokens.
train.tokens <- tokens_tolower(train.tokens)
train.tokens[[357]]

# use quanteda's built-in stopword list for english.
# NOTE - one should always inspect stopword lists for aplicability
# to your problem domain.

train.tokens <- tokens_select(train.tokens, stopwords(),
selection = "remove")
train.tokens[[357]]

#perfrom stemming on the tokens
train.tokens <- tokens_wordstem(train.tokens, language = "english")
train.tokens[[357]]

# create our first bag of words model.
train.tokens.dfm <- dfm(train.tokens, tolower = FALSE)

# transforming to a matrix and inspecting.
train.tokens.matrix <- as.matrix(train.tokens.dfm)
View(train.tokens.matrix[1:20, 1:100])
dim(train.tokens.matrix)

aashwinsinghal
Автор

So well clearly presented, amazing work. just wish there was a version with spark+scala/java.

TomerBenDavid
Автор

Hi, I would like to know why you use tokenization instead of working with the tm package and a Corpus

kmocordoba
Автор

can someone tell me at 16:00 why there are two 'your' even after converting the token into lowercase ?

vijaypalmanit
Автор

Thank you for the video! 

I got a quesiton - is it possible to define your own stopwords list to work with other languages?

ambsharp
Автор

Not able to convert the dfm into matrix, as my data is little bit large. Any alternative please.

shubhamchauhan
Автор

An excellent series. Are there any tutorials on how to get the dump file of large texts: how to extract all e.g. wikipedia text files?

neguinerezaii
Автор

Is it best practice to pre-process text for your training sample and test sample separately? Or, would it be acceptable to pre-process text for your entire data set first, and then split it into training and test samples?

heatherwells
Автор

I would've expected remove_punct=TRUE to break the URL into four words. Instead the www....com remained together. It's pretty cool that quanteda appears to differentiate between a period and a "dot" based on context.

phunqdaphied
Автор

Can someone recommend good literature on this topic? Thx in advance.

andreasmueller
Автор

train.tokens <- tokens ( ---) statement generated the following error.
Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "factor", plz help me

Jawadislamian
Автор

Hi Dave! I'm following this course for a project I'm working on! It seems like the tokens( ) function has changed a bit. It no longer supports the argument "remove_hyphens" like you have shown in this video. Do you have suggestions that I could include in my code that would give me similar results?

Thanks!

mimisjimenez
Автор

hi, can we use word doc instead of Excel sheet as a data file?

kylenash
Автор

Hello, I have so thoroughly enjoyed this series. I am starting to run the scripts myself and I am getting this error
train.tokens.dfm <- dfm(train.tokens, toLower = FALSE)

Creating a dfm from a tokenizedTexts object ...
... indexing documents: 3, 901 documents
... indexing features: 6, 262 feature types
Error in checkAtAssignment("dfmSparse", "ngrams", "NULL") :
assignment of an object of class “NULL” is not valid for @‘ngrams’ in an object of class “dfmSparse”; is(value, "integer") is not TRUE

swapnaramesh
welcome to shbcf.ru