NLP Demystified 2: Text Tokenization

Показать описание

The usual first step in NLP is to chop our documents into smaller pieces in a process called Tokenization. We'll look at the challenges involved and how to get it done.

Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Рекомендации по теме

Комментарии

Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline

futuremojo

This guy posted a mind-blowing series and then left. Thank you, you're a legend!

pictzone

Je viens de France et je viens juste de tomber sur cette superbe playlist qui est pour moi la plus complète sur youtube ! Merci, un grand merci à vous ! C'est difficile de trouver des formations d'une telle qualité.

anissahli-glud

Thanks for posting this series buddy!!

MrNeelthehulk

Thank you so much for offering such high quality content 🎉

alpalp

Great to know more about NLP concepts. In Hugging face tutorials, there are some concepts are not mentioned. I guess these concepts may be a little outdated in the era of transformer.

FrankCai-er

Hello. Thank you for such a detailed course. I have a question about using pre-trained language models. My language (Azerbaijani) is not yet available in the library. Are you covering this topic further or is it not worth wasting time on learning without this model?

CC-nzoc

Hi, fantastic course! Wondering if by any chance there are solutions available to the exercises in the notebooks? I checked the github and collab but was unable to find solutions for the exercises.

BadEnoughDudeRescues

These are very helpful videos, thank you! There are still a few concepts that are unclear. You have mentioned that documents are segmented to a list of sentences, and each sentence segmented into a list of tokens. This implies that the list of tokens is empty to begin with, and after tokenization, we end up with a list of tokens(token vocabulary?) specific to the corpus we provide. But later, when you start the tokenization using spaCy, you are loading some db??? What is this doing? Shouldn't spaCy just be a program/tool that has some "advanced rules" to tokenize a document that we provide, and create a new token vocabulary from scratch, and not use it's own db/list created from some unknown corpus as some starting point? And finally, why tokenize a sentence at a time- because a document size can be large? Could it have read in a fixed number of words at a time, say 100 words, and then tokenized them? A "sentence" should have no meaning for the tokenizer, is this right? Actually, how does a tokenizer even "know" when a sentence starts/ends?!? Thanks for any clarifications!

SatyaRao-fhny

Since hugging face and openAPI provide APIs for use, could we skip spaCy, NLTK, these relatively old library?

FrankCai-er

Is it still possible to connect to a local runtime? I can't see an obvious connect button. May delete this if I solve it, thanks for any help!

oluOnline

Interesting to see AI developers reword phraseology concepts and language morphemes into "token" corporate key words.
English majors and Language doctorates are laughing, 😆:;🤣. And asking why 🤔?

michaelcharlesthearchangel

NLP Demystified 2: Text Tokenization

NLP Demystified 2: Text Tokenization

NLP Demystified 12: Capturing Word Meaning with Embeddings

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

NLP Text preprocessing and tokenization.

NLP Demystified 1: Introduction

NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

Text Tokenization

Natural Language Processing In 5 Minutes | What Is NLP And How Does It Work? | Simplilearn

Text Analysis Made Easy: Tokenization for ML Algorithms

Tokenization in NLP #codersarts #python #NLP

NLP Demystified 11: Essential Training Techniques for Neural Networks

Ep 2 | Next-Level Text Classification: Data Exploration and BERT Tokenization Demystified | PYTHON

NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)

Tokenization | NLP | Python

NLP Demystified 9: Automatically Finding Topics in Documents with Latent Dirichlet Allocation

What is pre-tokenization?

NLP Demystified 13: Recurrent Neural Networks and Language Models

Tokenization in Natural Language Processing | What Is Tokenization

NLP Demystified 8: Text Classification With Naive Bayes (+ precision and recall)

Easiest tokenizer : How to use SentencePiece to tokenize text

Natural Language Processing White Space Tokenizer | Natural Language Processing | NLP | Python

NLP Demystified 6: TF-IDF and Simple Document Search

3. Natural Language Processing - Tokenization