Data Pipelines | Introduction to Text Analytics with R Part 3

Показать описание

In our next installment of introduction to text analytics, data pipelines, we cover:
– Exploration of textual data for pre-processing “gotchas”
– Using the quanteda package for text analytics
– Creation of a prototypical text analytics pre-processing pipeline, including (but not limited to): tokenization, lower casing, stop word removal, and stemming.
– Creation of a document-frequency matrix used to train machine learning models

Kaggle Dataset:

The data and R code used in this series is available here:

Table of Contents:
0:00 Introduction
0:54 HTML escapes
8:40 Quantium
9:40 Tokenization
16:17 Stop words
16:53 Quantity
20:18 Stem
24:10 DFM

--

--

Unleash your data science potential for FREE! Dive into our tutorials, events & courses today!

--

📱 Social media links

--

Also, join our communities:

_

#datapipeline #textanalytics #rprogramming

Рекомендации по теме

Комментарии

This is excellent work! The pace of the videos is little on the slower side, but I completely appreciate that the author is trying to cater to aspirants with proficiency levels across the spectrum.

amit

Thanks for doing this great tutorial. I have learned more here that many other docs and tutorials just by going to the basics

pablomoreno

I understood everything. Thanks for the good work.

ii

At around 17:50 - I think stopwords were removed from "quanteda" (recently?) - I installed the "tm" package and ran it with quanteda and everything worked great.

davidcurrie

Really excellent and didactic to be recommended highly!tks

jean-mariemudry

Great video! Do you have tips on dealing with very large datasets? I have a dataset with 80, 000 observations and 40, 0000 tokens (3.2 billion elements). When I try to convert it to a dataframe (or matrix), I get "Cholmod error 'problem too large' at file" ...and the problem persists even when running additional doSNOW clusters. Any help would be much appreciated!

DualFixMusic

Hi Dave - first of all great tutorial.

I just had one doubt - at step where we find the dim' of the matrix - after following all the code as you wrote in your video - I am getting no' of col's as 5742 - can you think of any reason that this would happen?

Find below the code that I have used - I have written some extra comments for my personal use -

#installing all required packages
install.packages(c("quanteda", "ggplot2", "e1071", "caret", "irlba", "randomForest"))

# setting up wd
science study/R study/text analytics with data sceince dojo")

# load up the .csv data and explore in RStudio.
spam.raw <- read.csv("spam.csv", stringsAsFactors = FALSE)
# View(spam.raw)

# Clean up the dataframe
spam.raw <- spam.raw[, 1:2]
names(spam.raw) <- c("Label", "Text")
# View(spam.raw)

# check data to see if there are missin values
# Before starting any data analysis - we should know if our data is complete or it has any missing values that we
# need to account for.
# finding the no. of rows which are not complete

# Convert our class label into a factor
spam.raw$Label <- as.factor(spam.raw$Label)

# next most important step is to explore the data
# For classification problems - Find out if there is any skewness in the data.
# So, Let's take a look at distribution of the class labels (i.e., ham vs spam).

# Next up, let's get a feel for the distribution of text lengths of the SMS
# messages by adding a new fearure for the length of each message.
# we are doing this as we can see in the data that most short messages are ham and
# on average most long messages are spam - so to test this hypothesis - we are engeneering a new feature.

spam.raw$TextLength <- nchar(spam.raw$Text)
summary(spam.raw$TextLength)

# as can be seen from the results - there is presence of certain kind of skewness in the data
# the min is 2 and max is 910.
# Now lets try and visulize the skewness using a histogram

library(ggplot2)

ggplot(spam.raw, aes(x = TextLength, fill = Label)) +
theme_bw() +
geom_histogram(binwidth = 5) +
labs(y = "Text Count", x = "Length of Text",
title = "Distribution of Text Lengths with class Labels")

# now as can be seen from the histogram - our hypo is right - the SMS' with less data
# is normally ham and on average SMS' with more characters are spam.
# also as it can be seen that till some length of the data almost all the sms' are ham and
# also those at the extreme end's are hams vs the data in the middle which is invariable spam
# this could help us in future in engeneering new feature to help in prediction.

# Lecture 2

#Currently we are splitting are data into two - training set and test set.
# In a true project we would want to use a three way split of
# training, validation, and test.

# Also as we know our data has non-trivial class imbalance, we'll use
# the mighty caret package to create a random train and test split
# that ensures the correct ham/spam class label proportions
# using caret to do a random stratefied split
library(caret)

# using caret to create 70/30 stratified split
# also setting seed for reproducibility.
set.seed(32984)
indexes <- createDataPartition(spam.raw$Label, times = 1, p = 0.7, list = FALSE)
train <- spam.raw[indexes, ]
test <- spam.raw[-indexes, ]

#verify proportions

prop.table(table(test$Label))

# Lecture 3

# Basic Data exploration

# HTML -escaped ampersand character.
train$Text[21]

# [1] "I'm back & we're packing the car now, I'll let you know if there's room"

# as can be seen above the '&' is just '&' in actual message - but when we get the
# raw data - it gets converted to a different mix of symbols - therefore in text analytics
# when using the bag of words model - we have to deal with such instances as well - and according
# to situation we have to decide how to deal with it.

# same situation happens for train$Text[38] & URL example train$Text[357]

library(quanteda)

# Tokenize SMS text messages - Tokenisation is the process of breaking a document into individual words or tokens.

train.tokens <- tokens(train$Text, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
# Take a look at a specific SMS message and see how it transforms
train.tokens[[357]]

# Lower case the tokens.
train.tokens <- tokens_tolower(train.tokens)
train.tokens[[357]]

# use quanteda's built-in stopword list for english.
# NOTE - one should always inspect stopword lists for aplicability
# to your problem domain.

train.tokens <- tokens_select(train.tokens, stopwords(),
selection = "remove")
train.tokens[[357]]

#perfrom stemming on the tokens
train.tokens <- tokens_wordstem(train.tokens, language = "english")
train.tokens[[357]]

# create our first bag of words model.
train.tokens.dfm <- dfm(train.tokens, tolower = FALSE)

# transforming to a matrix and inspecting.
train.tokens.matrix <- as.matrix(train.tokens.dfm)
View(train.tokens.matrix[1:20, 1:100])
dim(train.tokens.matrix)

aashwinsinghal

So well clearly presented, amazing work. just wish there was a version with spark+scala/java.

TomerBenDavid

Hi, I would like to know why you use tokenization instead of working with the tm package and a Corpus

kmocordoba

can someone tell me at 16:00 why there are two 'your' even after converting the token into lowercase ?

vijaypalmanit

Thank you for the video!

I got a quesiton - is it possible to define your own stopwords list to work with other languages?

ambsharp

Not able to convert the dfm into matrix, as my data is little bit large. Any alternative please.

shubhamchauhan

An excellent series. Are there any tutorials on how to get the dump file of large texts: how to extract all e.g. wikipedia text files?

neguinerezaii

Is it best practice to pre-process text for your training sample and test sample separately? Or, would it be acceptable to pre-process text for your entire data set first, and then split it into training and test samples?

heatherwells

I would've expected remove_punct=TRUE to break the URL into four words. Instead the www....com remained together. It's pretty cool that quanteda appears to differentiate between a period and a "dot" based on context.

phunqdaphied

Can someone recommend good literature on this topic? Thx in advance.

andreasmueller

train.tokens <- tokens ( ---) statement generated the following error.
Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "factor", plz help me

Jawadislamian

Hi Dave! I'm following this course for a project I'm working on! It seems like the tokens( ) function has changed a bit. It no longer supports the argument "remove_hyphens" like you have shown in this video. Do you have suggestions that I could include in my code that would give me similar results?

Thanks!

mimisjimenez

hi, can we use word doc instead of Excel sheet as a data file?

kylenash

Hello, I have so thoroughly enjoyed this series. I am starting to run the scripts myself and I am getting this error
train.tokens.dfm <- dfm(train.tokens, toLower = FALSE)

Creating a dfm from a tokenizedTexts object ...
... indexing documents: 3, 901 documents
... indexing features: 6, 262 feature types
Error in checkAtAssignment("dfmSparse", "ngrams", "NULL") :
assignment of an object of class “NULL” is not valid for @‘ngrams’ in an object of class “dfmSparse”; is(value, "integer") is not TRUE

swapnaramesh

Data Pipelines | Introduction to Text Analytics with R Part 3

Data Pipelines Explained

What is Data Pipeline? | Why Is It So Popular?

Data Pipelines: Introduction to Streaming Data Pipelines

Data Pipeline Overview

What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline (2025)

What Is A Data Pipeline - Data Engineering 101 (FT. Alexey from @DataTalksClub )

What is a Data Pipeline? | Data Analytics Explained

What the HECK is a “Data Pipeline”? 👩🏻‍🔧📊🪠

Introduction to MLOPS

Data pipeline overview

How Data Engineering Works

What are Data Pipelines?

Data Pipeline Overview | What is Data Pipeline

Data Engineering Course for Beginners

CI/CD Explained | How DevOps Use Pipelines for Automation

Learn Microsoft Fabric Data Pipelines in 2025 - Full Course!

Designing a Data Pipeline | What is Data Pipeline | Big Data | Data Engineering | SCALER

Explaining Data Engineering project in the Interview?

Azure Data Factory Tutorial | Introduction to ETL in Azure

What To Consider When Building Data Pipelines - Intro To Data Infrastructure Part 2

Introduction to Real-Time Data Engineering Pipeline: MSK vs Kinesis Explained !

What is a Data Pipeline?

Intro to Foundry Pipeline Builder

Learn Apache Airflow in 10 Minutes | High-Paying Skills for Data Engineers