Scale R to Big Data with Hadoop & Spark

preview_player
Показать описание
In this talk, we will show you Microsoft R Server, which is a Hadoop or Spark cluster where R is installed on every computer and is equipped with distributed processing libraries to utilize each and every computer in parallel. We’ll show you how to run your normal native R code via SSH, and how to get an RStudio server up and running on the cluster.

R is currently one of the most popular data science languages in the world. However, it’s always had constraints around scaling out to big data. What happens when you expand beyond a couple gigabytes of data? You packed up your data and you used something else; Python, Java, or Mahout to name a few. Now it’s possible to stick with R throughout your production analysis all the way to deployment, regardless of the data size.

Companies like Apache, Revolution Analytics, Microsoft, and H20 showed us this year that distributed computing in R is possible. Today we’ll take a look at what the Microsoft stack is doing in terms of scaling R up to big data.

We’ll show you how to wrangle data out of an HDFS and build machine-learning models from your large dataset. Then shows you how to pack up that model and deploy it to an elastically scaled web service so that anyone may call upon it for predictions and insights.

Outline:
· Setup a Spark cluster with R installed (R server)
· Wrangle data that is inside HDFS using R
· Build and deploy a machine learning model using R

Code and Prep Work (if you want to follow along):

Table of Contents:
0:00 Overview
1:20 Machine learning scaling
4:13 Popularity for data science
4:47 R as a movement
8:38 R limits
19:55 Sparks
21:40 R servers
26:56 R server on HDinsights
45:52 IDE
50:43 RStudio
1:02:44 Processing times

--

--

Unleash your data science potential for FREE! Dive into our tutorials, events & courses today!

--

📱 Social media links

--

Also, join our communities:

_

#hadoop #spark #rprogramming #bigdata
Рекомендации по теме
Комментарии
Автор

Now I understand the basics. Thank you very much.

jensharbers
Автор

Is scaleR still available? Or are there newer solution to deal with the memory problem? Can the package be used if you are not using a server? Very nice talk by the way :)

suzannevangestel
Автор

Do we have a function in sparkR or sparklyR to read netcdf files?

rajanikumar