Best Practices for Engineering Production-Ready Software with Apache Spark

preview_player
Показать описание
Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering. The code we develop on such larger-scale projects must be modular, robust, readable, testable, reusable and performant. At Montevideo Labs we have many years of experience helping our clients to architect large Spark systems capable of processing data at peta-byte scale. In previous Spark Summits, we described how we productionalized an unattended Machine Learning system in Spark that trains thousands of ML models daily that are deployed for real-time serving at extremely low latency. In this instance, we will share lessons learned taking other Spark products to production in top tech US companies.

Throughout the session we will address the following questions along with the relevant best practices: How to make your Spark code readable, debuggable, reusable and testable? How to architect Spark components for different processing schemes, like batch ETL, low-latency services and model serving? How to package and deploy Spark applications to the cloud? In particular, we will do a deep dive into how to take advantage of Spark's laziness (and DAG-generation) to structure our code based on best software engineering practices regardless of efficiency issues. Instead of only focusing on code efficiency when structuring our Spark code, we can leverage this 'laziness' to follow the best software patterns and principles to write elegant, testable and highly maintainable code. Moreover, we can encapsulate Spark-specific code in classes and utilities and keep our business rules cleaner. We will aid this presentation with live demos to illustrate the concepts introduced.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.

Connect with us:
Рекомендации по теме
Комментарии
Автор

Thank you for sharing the information.

howzattt
Автор

Great example !, anyone knows something like this but using python dataframe and delta tables? and maybe more focuse on data engenieer perpestive?

julsgranados