Build Large-Scale Data Analytics and AI Pipeline Using RayDP

preview_player
Показать описание
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Connect with us:
Рекомендации по теме
Комментарии
Автор

I cannot make this run on Databricks. It keeps saying: "java gateway process exited before sending its port number databricks"
Is there any config that I need to do?

DZitLee
join shbcf.ru