A Journey from Scikit learn to Spark (Stanimir Dragiev and Patrick Baier)

preview_player
Показать описание
Zalando is Europe’s leading online fashion retailer and currently on its way to become the platform for all fashion related business — from designers to large scale logistics solutions. The new platform architecture challenges the company’s current in-house solutions to become more scalable, dependable and versatile. This talk describes our journey of rewriting an in-production classification system from scratch using Scala and Spark to run on AWS. Along the way, we will look at the drawbacks that are inherent to our old Python/Scikit-learn based solution running a static cluster, most prominently: hard maintenance (technological debt), data bottlenecks, too coarse-grained parallelisation. Next, we will present our new Spark based solution and demonstrate how we were able to mitigate the previously identified pain points by leveraging the features that Scala and Spark bring into play, in particular: strong typing, data parallelisation and easy scale out. To measure the gain of our new solution, we will provide an in depth comparison of both solutions. For this purpose, we conducted measurements that highlight the performance gains we experienced with Spark, including learning and prediction times. The talk concludes with an insight into the lessons we learned.
Рекомендации по теме