Unicode Normalization for NLP in Python

preview_player
ะŸะพะบะฐะทะฐั‚ัŒ ะพะฟะธัะฐะฝะธะต
โ„•๐• -๐• ๐•Ÿ๐•– ๐•š๐•Ÿ ๐•ฅ๐•™๐•–๐•š๐•ฃ ๐•ฃ๐•š๐•˜๐•™๐•ฅ ๐•ž๐•š๐•Ÿ๐•• ๐•จ๐• ๐•ฆ๐•๐•• ๐•–๐•ง๐•–๐•ฃ ๐•ฆ๐•ค๐•– ๐•ฅ๐•™๐•–๐•ค๐•– ๐•’๐•Ÿ๐•Ÿ๐• ๐•ช๐•š๐•Ÿ๐•˜ ๐•—๐• ๐•Ÿ๐•ฅ ๐•ง๐•’๐•ฃ๐•š๐•’๐•Ÿ๐•ฅ๐•ค. ๐•‹๐•™๐•– ๐•จ๐• ๐•ฃ๐•ค๐•ฅ ๐•ฅ๐•™๐•š๐•Ÿ๐•˜, ๐•š๐•ค ๐•š๐•— ๐•ช๐• ๐•ฆ ๐••๐•  ๐•’๐•Ÿ๐•ช ๐•—๐• ๐•ฃ๐•ž ๐• ๐•— โ„•๐•ƒโ„™ ๐•’๐•Ÿ๐•• ๐•ช๐• ๐•ฆ ๐•™๐•’๐•ง๐•– ๐•”๐•™๐•’๐•ฃ๐•’๐•”๐•ฅ๐•–๐•ฃ๐•ค ๐•๐•š๐•œ๐•– ๐•ฅ๐•™๐•š๐•ค ๐•š๐•Ÿ ๐•ช๐• ๐•ฆ๐•ฃ ๐•š๐•Ÿ๐•ก๐•ฆ๐•ฅ, ๐•ช๐• ๐•ฆ๐•ฃ ๐•ฅ๐•–๐•ฉ๐•ฅ ๐•“๐•–๐•”๐• ๐•ž๐•–๐•ค ๐•”๐• ๐•ž๐•ก๐•๐•–๐•ฅ๐•–๐•๐•ช ๐•ฆ๐•Ÿ๐•ฃ๐•–๐•’๐••๐•’๐•“๐•๐•–.

We also find that text like this is incredibly commonโ€Š-โ€Šparticularly on social media.

Another pain-point comes from diacritics (the little glyphs in ร‡, รฉ, ร…) that you'll find in almost every European language.

These characters have a hidden property that can trip up any NLP modelโ€Š-โ€Štake a look at the Unicode for two versions of ร‡:

Latin capital letter C with cedilla: \u00C7

Latin capital letter C + combining cedilla: \u0043\u0327

Both are completely different, despite rendering as the same character.

To deal with all of these text variants we need to use Unicode normalization - which we will cover in this video.

๐Ÿค– 70% Discount on the NLP With Transformers in Python course:

Medium article:

Friend link (free access):
ะ ะตะบะพะผะตะฝะดะฐั†ะธะธ ะฟะพ ั‚ะตะผะต
ะšะพะผะผะตะฝั‚ะฐั€ะธะธ
ะะฒั‚ะพั€

That's great bro, clean and simple explanation loved it a lot !

SuperMaker.M
ะะฒั‚ะพั€

Thank you very much, you were a great help.

mayankmaurya
ะะฒั‚ะพั€

What method do you use to normalize punctuation? For example, โ€œ vs ". I attempted to use unicode normalization with NFKC, but it didn't normalize these two quotation marks to be equal (==). In addition to quotation marks, there are many other punctuation marks that are nearly equivalent but are not normalized using NFKC. Any recommendations or thoughts about normalizing them?

dshefman