NLP Demystified 2: Text Tokenization

preview_player
Показать описание

The usual first step in NLP is to chop our documents into smaller pieces in a process called Tokenization. We'll look at the challenges involved and how to get it done.

Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Рекомендации по теме
Комментарии
Автор

Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline

futuremojo
Автор

This guy posted a mind-blowing series and then left. Thank you, you're a legend!

pictzone
Автор

Je viens de France et je viens juste de tomber sur cette superbe playlist qui est pour moi la plus complète sur youtube ! Merci, un grand merci à vous ! C'est difficile de trouver des formations d'une telle qualité.

anissahli-glud
Автор

Thanks for posting this series buddy!!

MrNeelthehulk
Автор

Thank you so much for offering such high quality content 🎉

alpalp
Автор

Great to know more about NLP concepts. In Hugging face tutorials, there are some concepts are not mentioned. I guess these concepts may be a little outdated in the era of transformer.

FrankCai-er
Автор

Hello. Thank you for such a detailed course. I have a question about using pre-trained language models. My language (Azerbaijani) is not yet available in the library. Are you covering this topic further or is it not worth wasting time on learning without this model?

CC-nzoc
Автор

Hi, fantastic course! Wondering if by any chance there are solutions available to the exercises in the notebooks? I checked the github and collab but was unable to find solutions for the exercises.

BadEnoughDudeRescues
Автор

These are very helpful videos, thank you! There are still a few concepts that are unclear. You have mentioned that documents are segmented to a list of sentences, and each sentence segmented into a list of tokens. This implies that the list of tokens is empty to begin with, and after tokenization, we end up with a list of tokens(token vocabulary?) specific to the corpus we provide. But later, when you start the tokenization using spaCy, you are loading some db??? What is this doing? Shouldn't spaCy just be a program/tool that has some "advanced rules" to tokenize a document that we provide, and create a new token vocabulary from scratch, and not use it's own db/list created from some unknown corpus as some starting point? And finally, why tokenize a sentence at a time- because a document size can be large? Could it have read in a fixed number of words at a time, say 100 words, and then tokenized them? A "sentence" should have no meaning for the tokenizer, is this right? Actually, how does a tokenizer even "know" when a sentence starts/ends?!? Thanks for any clarifications!

SatyaRao-fhny
Автор

Since hugging face and openAPI provide APIs for use, could we skip spaCy, NLTK, these relatively old library?

FrankCai-er
Автор

Is it still possible to connect to a local runtime? I can't see an obvious connect button. May delete this if I solve it, thanks for any help!

oluOnline
Автор

Interesting to see AI developers reword phraseology concepts and language morphemes into "token" corporate key words.
English majors and Language doctorates are laughing, 😆:;🤣. And asking why 🤔?

michaelcharlesthearchangel