Towards increasing a data scientists’ productivity by 1000x

We have spent the past few years exploring a vital question: What would it take for machine learning to reach its full potential, and to become integrated into every aspect of our day-to-day lives? Like others, we investigated various possibilities, each focusing on generating better and more accurate models. We considered enabling data scientists to easily explore different forms of data models, including latent variable models, discriminatory models, and deep learning models. Then again, we thought, as the sheer volume of data increases, perhaps the priority should be to scale up machine learning algorithms so that they can use all the data available—or maybe the key lies in automatically tuning the models.

To investigate these innovations' ability to move data science further into the public sphere, we applied them to vast data stores associated with different organizations. Machine learning experts at these organizations responded by pointing out that their main problem, which had gone chronically unaddressed, was entirely different—that, as they often said, “the data is a mess.” None of the solutions above were designed to mitigate this, and while the themes of those investigations were and are powerful in their own right, we found ourselves in the midst of redefining what mattered. Here is what we found:

  • -Scaling doesn't matter — With the advent of big data, we thought that perhaps we would need to enable well-known classification algorithms, such as support vector machines, neural networks, and logistic regression, to run with millions or billions of training examples. The data in a full database is granular and large, on the level of terrabytes—for example, every interaction a customer has with a online platform is recorded, or a physiological signal is recorded 125 times per second. But once a predictive problem is defined to form relevant training examples, this data goes through a number of filters and transformations. For example, when we want to predict the onset of a certain condition for patients in a medical database, we end up finding only a small percentage of patients as cases and select a equal number as controls. The training data fed into machine learning algorithms eventually scales with the number of features generated, but is limited by the number of training examples. For everything but recommender systems, by the time this data reached the machine learning algorithm, it was manageable and did not require extensive scaling.

  • - Diverse modeling doesn't matter - Next, we investigated whether it would be important to provide data scientists an easy way to explore multiple modeling techniques—random forests, neural networks, support vector machines, or even modeling techniques for longitudinal data, like dynamic bayesian networks. (We also thought it might be necessary to train models simultaneously on the cloud.) While we tried a number of different modeling techniques, none of them proved more effective than the parallel strategy of embedding more intelligence into features and generating new ones. For example, a feature that measured the average time a customer spent looking at products of a certain category over the past week provided better accuracy than a feature that used the best classifier possible, but merely counted the total number of events the customer had.

  • - Tuning doesn't matter - Next, we thought that tuning the hyperparameters of the single classifier might prove impactful. Again, we found that better features do more to improve model performance than tuning. Some claim that one of the chief benefits of automatically tuning machine learning models, as pursued within the subfield "AutoML," is that they allow humans to focus on feature engineering, a much more important and human-driven part of the process.

So what does matter?

As we dug deeper, we found that working with fine-grained, detailed and intricately connected data on the order of terrabytes overwhelmed many machine learning experts like us, as we were more experienced and comfortable finding mathematical structure in data represented at a slightly higher level—that of features. We also found that this highly granular data prevented us from identifying and generating multiple predictive problem definitions, each of which required several steps before the relevant training examples could be isolated from the data. Recognizing this, we realized that to be able to further our goal of applying machine learning to all societal problems, we needed to expand the number of predictive models that can be built from data. We have to develop methods that allow us to easily define prediction problems, take all the steps to identify and isolate training examples, and generate features in a machine learning-ready format. Over the past two years, we have developed several methods for this—Deep Feature Synthesis, the Trane language, and the Label-Segment-Featurize framework, which defines the concept of prediction engineering and thus enables users to build several thousands of predictive models from a set of data. Below, we present the three papers that resulted from this research.



Deep Feature Synthesis: Towards automating data science endeavors

Authors: James Max Kanter, Kalyan Veeramachaneni
Published in: IEEE International Conference on Data Science and Advanced Analytics 2015

We develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation...[ pdf ]

What would a data scientist ask? Automatically formulating and solving prediction problems

Authors: Benjamin Schreck, Kalyan Veeramachaneni
Published in: IEEE International Conference on Data Science and Advanced Analytics 2016 [ pdf ]

Label, Segment, Featurize: a cross domain framework for prediction engineering

Authors: James Max Kanter, Owen Gillespie, Kalyan Veeramachaneni
Published in: IEEE International Conference on Data Science and Advanced Analytics 2016

We introduce "prediction engineering" as a formal step in the predictive modeling process. We define a generalizable 3 part...[ pdf ]