Data Science in 30 Minutes #5: Exploring Wikipedia with Apache Spark

Показать описание

Spark is one of the most popular distributed computation engines for processing and analyzing big data, and recently released a significant version 2.0 update. The Spark execution engine works efficiently on complex tasks and its core ecosystem (which includes graph analysis, machine learning, SQL, and streaming) shares a common interface. We'll go through a brief demo using the Python API, covering how to ingest Wikipedia data into Spark, perform typical ETL tasks, and answer questions about user behavior and usage patterns. By the end of the session, participants should be introduced to the fundamental principles of Spark execution, how to interact with and query distributed data, and where to go to learn more about Spark and improve their skills further.

The Data Incubator is a data science education company based in NYC, DC, and SF with both corporate training and recruiting services. For data science corporate training, we offer customized, in-house corporate training solutions in data and analytics. For data science hiring, we run a free 8 week fellowship training PhDs to become data scientists. The fellowship selects 2% of its 2000+ quarterly applicants and is free for Fellows. Hiring companies (including EBay, Capital One, Pfizer) pay a recruiting fee only if they successfully hire. For more information, visit our website:

About the speakers:

Ariel M'ndange-Pfupfu studied physics at Stanford and got an engineering PhD from Northwestern. Since joining The Data Incubator as a Data Scientist in Residence, he's worked on a variety of data science and software engineering projects, as well as curriculum development and instruction.

Michael Li founded The Data Incubator, a New York-based training program that turns talented PhDs from academia into workplace-ready data scientists and quants. The program is free to Fellows, employers engage with the Incubator as hiring partners.
Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup to focus on what he really loves.
Michael lives in New York, where he enjoys the Opera, rock climbing, and attending geeky data science events.

Рекомендации по теме

Data Science in 30 Minutes #5: Exploring Wikipedia with Apache Spark

Data Science in 30 Minutes #1

Data Science in 30 Minutes: Predicting Content Demand with Machine Learning

Data Science in 30 Minutes: Combining Cognitive Science and Machine Learning

Data Science in 30 Minutes: A Conversation with Gregory Piatetsky-Shapiro, President of KDnuggets.

Data Science in 30 Minutes (RAPIDS)

Data Science in 30 Minutes: Choosing Your Data Science Career Path with Field Cady

Data Science in 30 Minutes: Uber's Chief Scientist Explores Frontiers of Machine Learning and A...

Data Science in 30 Minutes: Kirk Borne - A Fortuitous Career in Data Science

Data Visualization in Data Science for AIML: End-to-End Session 30

Data Science in 30 Minutes: Examining Machine Learning Trends with Cloudera's Shioulin Sam

Data Science in 30 Minutes: Data Science in the Service of Humanity

Data Science in 30 Minutes: Understanding & Monitoring Investor Behavior with R Analysis

Data Science in 30 Minutes #6: Building and Testing a Complete Trading Strategy

Data Science in 30 Minutes: Scikit-Learn with Core-Contributor Andreas Müller

Data Science In 5 Minutes | Data Science For Beginners | What Is Data Science? | Simplilearn

Data Science in 30 Minutes: Establishing a Standard for Partisan Gerrymandering

Data Science in 30 Minutes: Personalized Healthcare Powered by Data Science with Ryan Copping

Data Science in 30 Minutes: Infrastructure for Usable Machine Learning with Matei Zaharia

Data Science in 30 Minutes #2: Neural Networks and word2vec

Data Science in 30 Minutes: How Smart Machines Think with Sean Gerrish

Data Science in 30 Minutes: Data Privacy and Big Data Ethics with @data_nerd, Carla Gentry

Data Science in 30 Minutes: Why Big Data Needs Thick Data with Tricia Wang

Data Science in 30 Minutes #5: Exploring Wikipedia with Apache Spark

Data Science in 1 Minute