R Tutorial: Regular expression basics

preview_player
Показать описание

---
Hello and welcome to this introduction course on natural language processing in R.

My name is Kasey Jones, and I will be helping you along this journey to master the fundamental elements of NLP in R.

So what is Natural language processing? Basically,

NLP focuses on using computers to analyze and understand text.

In this course, we will be ambitious and cover topics such as

classification,

topic modeling,

named entity recognition,

sentiment analysis and others. Each topic that we cover will prepare you for real analysis of text and help you better understand how you can apply NLP to learn from your data. Let's jump right in to our first examples by exploring regular expressions.

Regular expressions are just

sequences or patterns of characters used to search text. Analysts use regular expressions for all kinds of tasks, including

searching files in a directory using the command line,

finding articles that contain a specific pattern of text,

replacing specific strings,

and several other use cases. The most general way to use regular expressions is to specify what you want to search, and what you want to find.

Let me give two concrete examples in R,

you could simply search some words for every mention of a number

and return the index for where those words occur,

or look for all words that include an apostrophe

and return those words. If you are new to regular expressions, seeing \d probably doesn't make a whole lot of sense. So let's look at what \d is, and some other basic regular expressions.

In order to search text for a pattern, you need to use the correct syntax. This could be a single character, such as \w, representing a simple search, or a large group or characters, representing a complex search. First up, we look for alphanumeric characters with \w.

Next, we can find any single digit with \d. In order to expand these searches past a single letter or number, we use something called a wildcard.

In this case, adding "+" behind w or d, allows us to find a word, or a digit of any length.

Next, we can look for spaces, with \s. Allowing us to find breaks in long sequences of characters.

Finally, we can negate any expression by using a capital letter. In this case, we look for any non-space character. We can do the same for non-digits, and non-alphanumerics as well.

In R, writing the expression is only half the battle. We also need to use the right function. Both base R and the stringr package have great functions for searching text.

Using base R, we can use two very common functions. grep, which will find all matches of a pattern in a vector of strings,

and gsub, which will replace all matches of the regular expression in a string or vector. There are many other functions we can use, but if we master these two, we can master the others as well.

There are several great resources out there for practicing regular expressions and learning about the complex patterns that can be created. I have provided a link for one such example here. If you want to learn more about combining expressions, or including certain letters while excluding others, I would suggest exploring this resource.

Let's explore a few examples of using regular expressions.

#R #RTutorial #DataCamp #Natural #Language #Processing #Regular #expression
Рекомендации по теме
Комментарии
Автор

The matching characters exercises were very useful. Thanks!

Jaffizy
Автор

Would you mind fixing the CC sometime? It is way off. Thanks!

SylvanBat
Автор

It is not like the usual regular expressions in Unix. In R we are adding extra \, for \d. I am still not getting used to this change, and how to change the pattern in R in compatible.

bhargavapothakamuri