Zachariah Miller: The necessity of pipelines in Natural Language Processing | PyData Miami 2019

preview_player
Показать описание
Full Title: "The necessity of pipelines in Natural Language Processing and how we should think about them"

Natural Language Processing (NLP) is an inherently messy, iterative process. A well-designed data flow can make all the difference between a scalable NLP project and project that makes everyone involved weep. During this talk, we'll investigate one way of integrating pre-made tools into a self-contained pipeline - using entry level tools like SkLearn and NLTK.

We'll begin by introducing NLP and discussing the inherent iterative nature of NLP projects, focusing on the messy combinations that come out of those iterations. After deciding that NLP is a nightmare for anyone that prefers organization to chaos, we'll discuss the benefit of pipelines. Then we'll discuss the modern toolset for doing NLP, with things like Spacy, GenSim, and other "commercial grade" open source software that isn't as beginner-friendly as building your own toolset in many ways. After deciding we need to build our own, we'll talk about how we should design a reproducible, efficient, and save-able pipeline. Finally, we'll go through some code that's designed to make all of this happen and talk about some of the choices being made and how we can use this as a base pipeline for more complex tools like document classification, topic modeling, and article recommendation engines.

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Рекомендации по теме