Unicode Normalization for NLP in Python

Показать описание

ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖.

We also find that text like this is incredibly common - particularly on social media.

Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you'll find in almost every European language.

These characters have a hidden property that can trip up any NLP model - take a look at the Unicode for two versions of Ç:

Latin capital letter C with cedilla: \u00C7

Latin capital letter C + combining cedilla: \u0043\u0327

Both are completely different, despite rendering as the same character.

To deal with all of these text variants we need to use Unicode normalization - which we will cover in this video.

🤖 70% Discount on the NLP With Transformers in Python course:

Medium article:

Friend link (free access):

Рекомендации по теме

Комментарии

That's great bro, clean and simple explanation loved it a lot !

SuperMaker.M

Thank you very much, you were a great help.

mayankmaurya

What method do you use to normalize punctuation? For example, “ vs ". I attempted to use unicode normalization with NFKC, but it didn't normalize these two quotation marks to be equal (==). In addition to quotation marks, there are many other punctuation marks that are nearly equivalent but are not normalized using NFKC. Any recommendations or thoughts about normalizing them?

dshefman

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

Unicode normalization for nlp in python

Tebeka Shorts: Unicode String Normalization

Unicode Normalization | python | 🔥#coding

Normalize Unicode Text to a Standard Representation - Python Recipe

Text Normalization : [ 14 ] Natural Language Processing(NLP)

NLP Lecture 2(c) - Text Normalization

ASCII, Unicode, UTF-8: Explained Simply

Practical Serialization In Go: Unicode Normalization

PYTHON : Normalizing Unicode

Text normalisation and tokenisation (NLP817 2.2)

LESSON 2.3: NATURAL LANGUAGE PROCESSING: Rules of Tokenization | Text Normalization

Special Character Normalization With NFD, NFC, NFKD, NFKC

python: unicode names and why they're bad (intermediate) anthony explains #356

17 Remove Accented Chars | Text Preprocessing and Mining for NLP | KGP Talkie

Lesson 6 Text Normalization

Modern NLP: Low-level Text Processing- Session 1, part 2

Javascript Basics · String · normalize() (method)

PYTHON : What is the best way to remove accents (normalize) in a Python unicode string?

[Artificial Intelligence]|Text Normalization Using NLTK | Eduonix

Text Processing Tokenization and Text Normalization

Text Normalization | Part 1 | Text Preprocessing | Text Analytics with Python

Natural language processing in Python using NLTK. Part 1/3

Text Normalization | Part 2 | Text Preprocessing | Text Analytics with Python