Feb 5, 20224 min read

State of Data Science 2021: Interesting findings from Anaconda's Survey

Updated: Feb 7, 2022

Anaconda, a leading open source data science platform hosting Python and R lanuguages conducts annual data science surveys focused on evaluating data science growth, adaptation within commercial and academic environments, and valuable insights on preparation for the future of the field.

The Anaconda report can be found here. A quick registration is required to access the full report. However, in this post, I will summarise interesting findings and how they relate to my work as a data analyst.

Sample Population

The respondents came from social media, Anaconda email database, and Anaconda.org. The survey does not provide further information on selection criteria. In a total of 4,299 participants from 140 countries, 74% are between 18 and 40, with Millennials accounting for 50% of the sample size. No surprise, the issue of gender diversity persists in this male-dominated STEM field, with only 23% of women included in the survey.

Experience level

Entry-level roles make up 15% of current Job levels. There appears an issue of competency in data science teams as senior, managerial and administrative role make up most of the other 85% percent. In recent conversations, many entry-level employees have experienced career burnout due to the high expectations from working with teammates far more advanced In their careers.

Time Allocation

In many areas, Data Specialists believe data preparation and cleansing take up 80% of Data tasks. Surprisingly, in this survey, only 39% respondents allocated time to both processes. Data preparation can be a tedious and time-consuming task. Interestingly, automation is not considered a preferred solution. Instead, having a human in the mix with high skill, experience with data quality review ensures more accurate results and provides context for the data.

Asides low time allocation to data quality, only 11% respondents deploy their data models. Deployment remains an issue in many data science teams. The percentage of deployments suggests that only a few data science tasks are tested in real-world scenarios.

Reasons for not deploying

Of the three reasons highlighted for not deploying machine learning models, Respondents suggest selecting the right tool for a data task as the main reason delayed project delivery times and outcomes. With many organisations adopting various tools based on a vast array of reasons, a question to the Data Science Community would be: What are the must-haves for getting tasks from production to deployment in such a fast-paced industry?

Task Blockers by roles

Interestingly, the major blocker is not "we are not deploying the model to production". In a different post, we will examine each blocker separately. The focus in this image is that data scientists highlighted "a skill gap in my organisation" as a primary blocker. Recently studies have proposed ways to bridge the gap between STEM Employment and Undergraduate Education.

As seen, most in Undergraduate Education students learn Data Visualisation and Python, while the areas STEM Employment lack seem to be heavily neglected.

Programming Language of Choice

Many employers encourage open-source software. Python SQL and remain the most popular.

Involvement in Decision Making

Making data-driven decision making is crucial to any organisation. 35% of Data specialists reported business decisions are based on their involvement in interpreting insights as a team. However, data literacy at the managerial level is a missing skill. 52% of managers require additional data education. The percentage is a significant issue as data understanding is crucial to extracting and deploying models.

ML/AI Bias

As machine learning applications and research become increasingly popular, concerns about data bias, overfitting and underfitting have caused machine learning algorithms to gain more scrutiny.

In this survey, 31% of data specialists strongly believe the problem is bias from social impacts distorting data and models.

There is an urgent need for teams and organisations to plan steps towards ensuring fairness and mitigating bias. Unfortunately, only 10% have begun implementing at least one action.

Additionally, only 10% of data specialists responded to have already implemented at least one of the crucial steps to improve model expandability and interoperability.

Implementing eXplainable AI to data models will improve acceptance to real-world scenarios since many current rely on manual process even with the high accuracy of machine learning algorithms.

Regardless of the concerns with bias, 55% of the survey participants hope to see more automation or AutoML in data science. In a future survey, it will be beneficial to know what scenarios data specialists will apply to AutoML processes since most believe AutoML is not a solution for data preparation and cleaning.

Conclusion

Many data specialists and data teams require properly defined processes to deploy data models successfully. There is a skill gap as most data specialists are in entry-level roles or administrative roles. Undergraduate Education could bridge the gap between skill unavailability in STEM jobs, thereby creating more intermediate positions. Furthermore, Knowledge of Python, SQL, and R remains valuable to data specialists. Similarly, business understanding is a required skill to further improve data tasks. Data preparation and cleaning require human and automated mixes to ensure data quality instead of relying on automated solutions only only. Finally, although data bias is a persistent issue with data models, many data specialists are optimistic to see more automated and AutoML processes applied to real-world problems.