Do Machine Learning automation and AutoML tools pose a threat to Data Scientists? It's a question that's on the minds of more and more Data Science professionals, as well as aspiring Data Scientists worried about their future careers. For the time being, however, the complete automation of Data Science seems unlikely.
To cope with the shortage of Data Scientists and other Machine Learning engineers, a number of “AutoML” tools have emerged in recent years. These Machine Learning automation tools were originally designed to eliminate the most tedious tasks in Machine Learning model development, or even to compensate for the absence of professionals.
However, over the years, the various AutoML frameworks have evolved and improved. Today, they are so powerful that they can even outperform human experts in some cases. This is the finding of a study conducted by researchers at the German Fraunhofer Institute.
For their investigation, the researchers used 12 popular datasets from the OpenML platform. Six of these datasets are supervised classification tasks, while the other six are supervised regression tasks. These are indeed the two most popular types of machine learning tasks.
The team also used the open source AutoML benchmark tool, offering full OpenML dataset integration for numerous AutoML frameworks and automated benchmarking features. The benchmarks were launched with the default parameters defined in config.yalk in the AutoML Benchmark project.
Four AutoML frameworks were reviewed: TPOT, H2O, Auto-sklearn and AutoGluon. Some are among the most recent, others among the most popular. There are frameworks dedicated solely to Deep Learning, and others based on scikit-learn.
The runtime per fold has been set to one hour. For supervised classification, the best of the four frameworks was given a runtime of five hours per fold, so that its results could be compared with those of humans.
For classification tasks, the ROC AUC (auc) and “accuracy” evaluation methods were used. For supervised regression tasks, the root-mean-square-error (rmse) and mean absolute error (mae) methods were chosen.
In terms of hardware, the researchers used a server equipped with two Intel Xeon Silver 4114 CPUs at 2.20 GHz for a total of 20 cores, four DDR4 Synchronous 2666MHz 64GB DIMM memory modules, and two NVIDIA GeForce GTX 1080 Ti graphics cards for a total of over 22GB of VRAM.
AutoML equals or surpasses Data Scientists in many situations
At the end of the test, the researchers were astonished to discover that AutoML performed as well as or better than humans on primary metrics in 7 out of 12 cases. These seven cases are “easy” classification or regression tasks. Its performance on other metrics shows no significant difference.
Thus, the study concludes that most of the results obtained by AutoML are only slightly better or slightly worse than those obtained by humans. The best supervised classification framework, H2O, achieves an AUC score of 0.7892 in 5 hours per fold, compared with 0.799 in one hour per fold.
In the future, researchers predict that the gap between human data scientists and AutoML will narrow. However, given that Machine Learning applications are mainly used in interdisciplinary cases, AutoML tools cannot act as stand-alone solutions. They must therefore be seen as a complement to the skills of Data Scientists.
Why automation won't "kill" Data Scientists?
Despite AutoML’s performance, it is unlikely that automation will make the Data Science profession disappear. In areas such as data processing or Data Visualization, it will certainly make it easier for business leaders to reap the benefits of Big Data without the intervention of a human Data Scientist.
According to Gartner, around 40% of Data Science tasks will be automated by the end of 2020. However, automation is unlikely to eradicate this profession, for three main reasons.
Firstly, automation is ultimately just a means of speeding up processes. As Alexander Gray, Vice President of AI at IBM Reserach, explains, “Data Scientists embrace automation tools because they allow them to save time and think rather than engage in tedious tasks”.
As automation tools become more powerful and intelligent, they will increasingly support data scientists and change the way they work. This will enable them to do more, and increase the impact of their work within their companies. Nevertheless, they will remain tools.
The second reason is that automated tools cannot “realize” that they are making mistakes. These tools may make it possible to do things faster and better, but they can also propagate human errors very quickly if they are based on the wrong foundations.
According to Alexander Gray, even teams of researchers from the world’s top universities can make mistakes in statistical nuances, resulting in poor-quality data models.
Consequently, data scientists will remain indispensable for detecting errors and understanding the underlying principles of the tools. All the more so as, as artificial intelligence becomes more and more a part of our daily lives, the slightest error could have a major impact.
The third reason, and perhaps the most important, is that only humans can truly understand the problems an organization needs to solve. The challenge of Data Science is not always purely technical, and any professional can testify to this.
The Data Scientist must be able to interpret a problem correctly, to select the right data source or even to interpret the results appropriately. For example, he or she will need to define a time frame for data analysis, or choose the right control groups for a precise comparison. Human judgment remains essential to data science.
For these three reasons, Big Data jobs cannot be automated. Ironically, by lowering the cost of access to data science, automation could even make it affordable for more companies, and increase the demand for data scientists…