In the age of Big Data, a number of jobs have emerged, including that of Data Scientist. If you've never heard of them, then I recommend you read this article first, but for those of you who already know what a Data Scientist does, we're going to look at the range of tools they use.
Let’s take this diagram as a starting point, to see the different stages that data goes through. The Data Scientist will mainly be involved in the last stage. We are going to talk about the tools used in these stages, but they may differ from one company to another.
The first step is to collect the data through data sources. Python, the flagship language of Data Science, is commonly used to collect this data. You can also use webscraping to retrieve data from web pages via Selenium.
You can also query company data using SQL.
What is Visualisation? One of the tools of the data scientist
Data visualisation allows you to uncover information hidden in your data and discover trends within your dataset. Matplotlib and Seaborn are everyday tools for data scientists. Visualisation allows you to make sense of your data at a glance. It’s a fast way to obtain information through visual exploration, reliable reports and information sharing.
All categories of users can make sense of the growing amount of data in your business. Visualisation enables the brain to process, absorb and interpret large quantities of information.
Data analysis / Preprocessing
Data processing is generally carried out by a data scientist (or a team of data scientists). It is important that this is done correctly so as not to have a negative impact on the following stages.
When working with raw data, the data scientist converts it into a more readable form, giving it the necessary format and context so that it can be interpreted and used by Machine Learning or Deep Learning models.
Although we might naively think that all we need is a large amount of data to have a high-performance algorithm, the data we have is most of the time unsuitable and needs to be processed before it can be used: this is the pre-processing stage.
Modelling is a way of modelling phenomena in order to make strategic decisions.
Modelling means representing the behaviour of a phenomenon in order to help solve a specific business problem.
In machine learning, the algorithm is built on an “internal representation” so that it can perform the task it is asked to do (prediction, identification, etc.).
To do this, it first needs to enter a set of example data so that it can train and improve, hence the word learning. This set of data is called the training set. An entry in the data set can be called an instance or an observation.
So there are two possible ways of modelling:
- To analyse and explain
- To predict
These two dimensions can be present in varying proportions: it’s not just one or the other. But there is a tension between them: the most predictive models are generally not the most explanatory, and vice versa.
MLOps stands for Machine Learning Operations. The definition of MLOps is a set of practices and tools that fall within the Data domain. It is a specialisation of the Data Scientist profession.
- ML for Machine Learning
- Ops for Operations
The development of MLOps methods responds to the growing needs of companies to carry out data projects, by adopting efficient methods for the development, deployment and control of a Machine Learning system.
Machine Learning Operations tools and practices are primarily designed to increase business productivity by making as many data-driven projects as possible exploitable. MLOps optimise each production launch, facilitating the transition from concept mode to real project. It continuously monitors and updates the process to be followed in the light of new data. This is known as a “data-driven” strategy.
Above all, MLOps is a culture to be developed. A culture that capitalises on the ability to automate and act throughout a model’s lifecycle.
If you want to learn how to use all the tools you’ve just read about, check out the details of the Data Scientist training course at DataScientest.