Josh Wills | Strata Data Conference 2013

preview_player
Показать описание
“I am Forrest Gump, I have a toothbrush, I have a lot of data and I scrub,” Josh Wills, Data Scientist Cloudera, whimsically described his profession to John Furrier and Dave Vellante inside theCube, live from Strata 2013. He also added that while he thinks of himself mostly as a mathematician, a data scientist is a lot like a data janitor.

John Furrier pointed out that data is now part of the developer community and wanted to know which are the best tools to scrub the data. Wills explained that when it comes to developer tools, the conversation could be described as a “religious debate,” as the tools depending a lot on each developer’s preference. Python, Aurora, SAS, they are all good scripting languages, his personal choice being the first two. “Some kind of scripting language” is a basic tool for a data scientist, but there isn’t a generally adopted best tool.

Talking about unstructured data that needs to be coded on, the need to analyze multiple sets of data and available solutions, Josh Wills expressed a preference for in-memory tools such as Spark and SAS, which provide a great way of exploring data. In what samples for data sets, he stated larger samples are preferable to smaller ones, especially when preparing data sets for other people to analyze,

John Furrier asked about existing collaborative tools in what data science is concerned, how they support team work, through cloud or other vehicles. While such tools would be a great idea, Josh Wills pointed out that nothing worth mentioning exists in this direction. He explained that at this point an inter-office, global collaboration solution is out of the question, a lightweight tool allowing people in the same office to collaborate would be very useful for data scientists. A collaboration tool allowing to share data analysis and data set preparation for data scientists in one location would be a great starting point.

One of the defining qualities of a data scientist is being relentless, Wills said. “If the tool does not answer my question, I google another tool.” A question without an answer is unacceptable to a data scientist.

Sharing projects he works on at Cloudera and is excited about, Wills said he is currently involved in simplifying data science and making everything simple, easy to use, so that machine level techniques become available to the general audience – a programmer or a statistician would then easily use data science in their daily activities.
Рекомендации по теме