In recent years many new professions are emerging, some of which you probably barely know. These new professional activities will play an important role in the years to come. One of these figures is precisely the Data Scientist. In this article you will see in more detail what is the work of the Data Scientist, what should be his skills and what activities he must perform.
The Data Scientist
First, it is necessary to clarify that the name Data Scientist is assigned to several work activities often very different from each other. This is mainly due to the fact that it is a field of very recent activities which gathers many interdisciplinary activities. So it has yet to take a clear and precise form.
Typically you define a Data Scientist as the person working to extract knowledge and information from large volumes of data, regardless of its form.
Field of Big Data is one of them. In fact, the management of huge amounts of data, their storage and their subsequent analysis are becoming gradually more and more laborious and complex activities, enriching itself with new technologies and tools in this regard. This is leading to a call for gradually more and more specialized skills
So the Data Scientist will acquire good knowledge in the field of computer science. He will have to use programming languages to implement the necessary application tools for them to perform its activities (Software engineer). It also needs to be able to use the libraries and applications to perform machine learning. Finally, the Data Scientist must also understand the mathematical concepts useful for statistical analyzes if he wants to be able to extract information from raw data (Data Engineer).
What makes the data scientists different from other similar activities (data analyst or data engineer)?
First, other professional activities that also perform tasks on data, are focused on the interpretation of data obtained from observations that have already occurred and recorded in the past. While the activity of a Data Scientist should focus primarily on providing patterns that correspond to the data that will be generated in the near future, using current and historical data.
If we analyze together the term data scientists, the ‘scientist‘ term means a professional who applies a systematic study. While the ‘data‘ term means that the object of study are both quantitative and qualitative variables that contain information. So data scientist should literally be the person who systematically studies the organization and information ownership.
The skills required
You have just seen that, carrying out activities in a multidisciplinary environment, the Data Scientist must be able to understand many concepts from areas that are very different between them.
In addition, the success of this activity lies in the degree of knowledge about the techniques of extraction, management and manipulation of data. Techniques that require a combination of skills covering many aspects of both computational and statistical.
Below you can see an image showing all the necessary skills and their relevance in relation to the size of the circles.
So in general the figure represented above can be summarized in the following manner.
The Data Scientist should have skills that allow him access to the data with a mind and an eye mathematically set. In fact, he should be able to interpret and represent data in a mathematical way. He will then gain experience in the following areas:
- machine learning
- data mining
- data analysis
The Data Scientist should be able to use any programming language to access, explore and model the data. So the knowledge of at least one of these languages is essential in order to work physically with the data:
The Data Scientist will also need to have previous experience from the computing world, especially in the areas of software development such as Java and C ++. He must also be familiar with many aspects of computational and software engineering. It is also fundamental the knowledge of Hadoop.
- Java o C++
- Software engineering
The Data Scientist tools
Now you’ll see some tools necessary to optimally perform this activity.
As for the data analysis, the best tools of the trade are precisely the programming languages. As said before, Python and R are used for programming and SQL for the extraction of data from the database.
As for data warehousing, a data scientist have to deal with the data and so must have a good familiarity with databases. MySQL and PostgreSQL are two excellent database. But as for the world of BigData, programs such as Hive and Redshift will prove as successful solutions.
In addition, a Data Scientist must be able to view the data in a professional manner, using the most advanced technologies. These technologies should make the implementation of graphics as easy as possible. D3.js and Tableau are excellent tools for data visualization.
Finally a Data Scientist must be able to implement all the algorithms and the most modern techniques of Machine Learning. This can be done using libraries like scikit-learn which works on Python. There is also Spark MLlib which is the machine learning library for Apache Hadoop and Spark.
In this article you saw the professional figure of the Data Scientist, In what consists, what skills should have and what instruments should be able to use. Other future articles will delve into the use and functionality of many of these tools with more or less detailed descriptions and tutorials on how to install and use.