Today, we would like to share with you how useful it could be for a data scientist at Damavis to have a good understanding about data engineering. For those of you who have a background in statistics and mathematics and are not completely sure what it means to have a “good understanding about data engineering”, let us give you some guidelines.
Data Scientist at Damavis
Let’s start by briefly defining what a pure data scientist does at Damavis. From our point of view, this kind of profile must have a deep understanding about mathematics and statistics. Knowing how to use specific statistical tools to fit machine learning models in order to come up with forecasts is not enough.
We like to say “a good statistician is not the one who gives you the forecast, but the one who gives you an accurate confidence interval”. In order to provide these confidence intervals, the data scientist must have a good understanding of the statistical assumptions behind every model in order to assess whether such level of confidence is a valid measure or not.
What about causal inference? Those of you who have tried to do causal inference before, probably know that fitting a model using a statistical software is usually not enough in order to come up with valid inferences. Good knowledge about econometrics is essential in these cases.
Let’s now dive into when data scientists become potential rock stars at Damavis:
Damavis is a Big Data consultancy firm based in Mallorca, a beautiful island located at the Balearic Islands (Spain). We are not a large company (meet our team) so a very useful skill at Damavis is flexibility: Having the ability to get involved in several kinds of projects becomes very handy at our company. So being able to jump from a forecasting project to a big data application development project is very useful. This is because our clients’ needs might require us to, all of the sudden, devote our full attention to a different type of project. Having this ability allows Damavis to easily and quickly adapt to our customer needs.
The next question to answer is, what do we mean by a “good understanding about data engineering”? Here is a list of some of the skills we value from data scientists:
Skills we value from Data Scientists
- Strong programming skills. It is important to keep in mind that programming in a notebook or using a statistical programming language such as R, Stata, Matlab, SAS… might not be enough in order to guarantee a good understanding of programming principles. This is why we value a good understanding of languages such as Python, Scala or Java, among many others, given that programming concepts such as classes, methods, traits, extensions, inheritance, and so on, are used.
From our point of view, having strong programming skills imply having a good understanding of the SOLID principles, which are a group of good programming practices.
- Being familiar with the Hadoop Ecosystem. Understanding tools such as HDFS, YARN, Spark, Sqoop, among other tools from the Hadoop Ecosystem, can be very handy given that we constantly interact with technologies that allow our customers to scale horizontally, allowing them to adapt quickly to big data contexts.
- Familiarization with orchestration and visualization tools. In most of the projects we develop there are usually two common denominators: a process that has to be executed regularly in an automated way and a need for making dashboards that communicate to our clients the reasoning behind a machine learning model. So having a good understanding about these two types of tools is definitely useful.
At Damavis we are especially in love with open source technologies, so we like to work with the orchestrator Apache Airflow and the visualization tool Apache Superset. That being said, in our daily basis we are used to working with privative software, so having familiarity with privative tools would definitely be a plus as well.
Of course many can argue that data scientists do not have to learn these concepts from the world of data engineering. However, as explained above, flexibility is very handy when working at Damavis. In the following picture, you can see the whole team explaining their achievements and the challenges that they face in their projects in our daily meetings.
To lead by example, here you have a few articles that our data scientist Daniel Bestard wrote about Apache YARN. A series of particularly useful articles for data engineer profiles:
- Introduction to Apache Hadoop: How to configure and run one of the most common open source tools used in big data contexts.
- Introduction to Apache YARN: How to configure Apache YARN to execute parallel jobs.
- First steps with Apache YARN customization: Introduction to a more advanced configuration setting to improve YARN’s performance
Do you also think it is cool that data scientists are able to understand the language of data engineers?