Data Science in 30 Minutes #5: Exploring Wikipedia with Apache Spark

preview_player
Показать описание
Spark is one of the most popular distributed computation engines for processing and analyzing big data, and recently released a significant version 2.0 update. The Spark execution engine works efficiently on complex tasks and its core ecosystem (which includes graph analysis, machine learning, SQL, and streaming) shares a common interface. We'll go through a brief demo using the Python API, covering how to ingest Wikipedia data into Spark, perform typical ETL tasks, and answer questions about user behavior and usage patterns. By the end of the session, participants should be introduced to the fundamental principles of Spark execution, how to interact with and query distributed data, and where to go to learn more about Spark and improve their skills further.

The Data Incubator is a data science education company based in NYC, DC, and SF with both corporate training and recruiting services. For data science corporate training, we offer customized, in-house corporate training solutions in data and analytics. For data science hiring, we run a free 8 week fellowship training PhDs to become data scientists. The fellowship selects 2% of its 2000+ quarterly applicants and is free for Fellows. Hiring companies (including EBay, Capital One, Pfizer) pay a recruiting fee only if they successfully hire. For more information, visit our website:

About the speakers:

Ariel M'ndange-Pfupfu studied physics at Stanford and got an engineering PhD from Northwestern. Since joining The Data Incubator as a Data Scientist in Residence, he's worked on a variety of data science and software engineering projects, as well as curriculum development and instruction.

Michael Li founded The Data Incubator, a New York-based training program that turns talented PhDs from academia into workplace-ready data scientists and quants. The program is free to Fellows, employers engage with the Incubator as hiring partners.
Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup to focus on what he really loves.
Michael lives in New York, where he enjoys the Opera, rock climbing, and attending geeky data science events.
Рекомендации по теме