Python Regular Expressions - Named Groups

preview_player
Показать описание
In this video I wish to talk about Named Groups.

References:
Mastering Regular Expressions by J Friedl

Script:
In this video I wish to talk about Groups.

So lets start with a simple example.
Lets assume we went grocery shopping earlier in the day and have a list of things we bought and the prices.
Now we wish to extract these prices

So we build a pattern assuming that the decimal and numbers following it are optional.
We then use the findall function to find all the costs.

Now perhaps it would be nice to separate each cost into its integer and decimal part.
Can we do it in the pattern rather then doing it later?

Yes we can and groups helps us do this.

So all we do is to enclose the components we want in parenthesis .
So one set of brackets for the integer part and one for the decimal.

Let’s test again - As you can see the items are grouped the way we want them to .

Now lets consider a second example that goes into more detail.

We have a file that contains names of actors.
We have the name, phone number, email address, age, country they come from and their twitter id.

Python allows us to easily read such a file into a structure called a Data Frame.

But let us take a step back.

Tasks generally are not that simple.
In most cases after reading data from an input source like a database or file, we would need to do some pre -processing.

So consider some requirements given by a client for this Actor information.

1. The name should be broken up into a First Name and Last Name,
2. The telephone number should have a dash between the country code and the number.
3. The Country should be replaced with the 2 digit ISO country code and
4. The twitter handle should be prefixed with the @ sign

As you can see, our requirement needs us not just to search for a pattern but also search and replace.
Another thing to note is that our solution has additional columns - For example Name is broken into first name and last name.

So where do we start?

We will start by deciding on the column names we need in our final solution.

The next thing we will do is group each line into the above 7 columns.
So we need a Regex pattern.
But this is going to be a long pattern . Thus from a readability point of view, we will distribute the pattern across multiple lines.

I am building a basic pattern here. The idea is to get a MVP or minimal viable product.
We can always fine tune later.

So let us quickly test things.
Rather than reading the entire file, lets just copy the first line and test.

Ok, great. We see the content of each group.
But it would be better if we knew which group the data belonged to - Enter the concept of named groups.

Rather than using these un-named groups, we will create named groups - like so.

Please note the syntax for the named group.

Now I will do things a bit differently in order to see the effect of named groups.
Firstly let us use the match function to search the pattern.

The match and search functions return a match object that have some useful methods which we can invoke.
The find-all function on the other hand returns a list of strings.

Right, now that we have out match object, we can invoke its methods.
So we can use the groups() method to see all the sub groups in the match.

More importantly though, in this case, we can see all the named groups using the groupdict() function which, no surprises, creates a dictionary object.

Now it will certainly help if we could run our regex for all items in the file and not just the first line.

So let us do a few things.

Firstly, since we may need to use this regex pattern multiple times, let us create a compiled object.
We will use the compile function for this. This also helps us bundle any important arguments into the compiled object.

Notice how we have chained multiple flags in the function.

We will now read our file information .

Now let us iterate over each line and match our pattern.
The finditer() find archives this for us.

But we want to see the named groups so lets call the groupdict() function.
Let us store all this information into a list

And to conclude, we will store this information into a Data Frame.

Our work is not complete yet.
We need to do some data transformations as was described earlier in the video.
But we will do so in the next video
Рекомендации по теме