Organizations around the globe are seeking to unlock the value that can be provided by data. In this endeavor, they hire data scientists massively, hoping to immediately drive results. It turns out, however, that many businesses fail to make the best use of their data scientists because they are unable to provide them with the right environment and raw material. In this article, we examine the main elements hindering data scientists' productivity, and we explore the solutions available.
What is a data scientist?
Officially, data scientists' job consists in building predictive models using advanced mathematics, statistics, and various programming tools. In practice, however, there are misconceptions about the role. In most organizations, data scientists' occupations include retrieving data, cleaning data, building models, and present their findings in business terms. Data scientists encounter key challenges at each step of their working process, drastically slowing down their progress and leading to frustration in data teams. Although there are much more than 5 challenges in data scientists' life, the biggest pain-points we have identified are: finding the right data, getting access to it, understanding tables and their purpose, clean the data, and explain in laypeople's terms how they work links to the organization's performance. We explain these challenges and propose solutions to take the rocks away from their path.
1) Finding the data
The first step of any data science project is unsurprisingly to find the data assets needed to start working. The surprising part is that the availability of the "right" data is still the most common challenge of data scientists, directly impacting their ability to build strong models. But why is data so hard to find?
The first issue is that most companies collect tremendous volumes of data without determining first whether it is really going to be consumed, and by whom. This is driven by a fear of missing out on key insights that could be derived from it, and the availability of cheap storage. The dark side of this data-collection frenzy is that organizations end up gathering useless data, taking the focus away from actionability. This makes it harder for data users to find the truly relevant data assets for the business strategy. Businesses need to ensure they collect relevant data that is going to be utilized. For that, it is key to understand exactly what needs to be measured in order to drive decision-making, and this varies according to the various organizations.
Secondly, data is scattered in multiple sources, making it difficult for data scientists to find the right asset. Part of the solution is to consolidate the information in a single place. That's why so many companies use a data warehouse, in which they store the data from all their various sources.
However, having a single source of truth for your data assets is not enough without data documentation. What use can you make of a huge data repository if you don't know what's in it? The key for data scientists to find the tables relevant to their work is to maintain a neatly organized inventory of data assets. That is, each table should be enriched with context about what it contains, who imported it in the company, which dashboard and KPI it is related to, and any other information that can help data scientists locate it. This inventory can be maintained manually, in an excel spreadsheet shared with your company's employees. If that's what you need at the moment, we've got a template in store here, and we explain how to use it effectively. If your organization is too large for manual documentation, the alternative solution is to use a data cataloging tool to bring visibility to your data assets. If you prefer this option, make sure you choose a tool that suits your company's needs. We've listed the various options here.
2) Getting access to the data
Once data scientists locate the right table, the next challenge is accessing the latter. Security and compliance issues are making it harder for data scientists to access datasets. As organizations transition into cloud data management, cyberattacks have become quite common. This has led to two major issues:
- Confidential data is becoming vulnerable to these attacks
- The response to cyberattacks has been to tighten regulatory requirements for businesses. As a result, data scientists are struggling to get consent to use the data, which drastically slows down their work. Worse, when they are refused access to a dataset.
Organizations thus face the challenge of keeping data secure and ensure strict adherence to data protection norms such as GDPR, while allowing the relevant parties to access the data they need. Failing at one of these two objectives will either lead to expensive fines and time-consuming audits, or to the impossibility of leveraging data efficiently.
Again, the solution lies in cataloging tools. Data catalogs make regulatory compliance a flawless process while making sure the right people can access the data they need. This is mainly achieved through features of access management, whereby you can grant/restrict access in one click to tables based on employees' statuses. This way, data scientists will seamlessly to the datasets they need. You will find further information here about how data catalogs can be used as regulatory compliance tools.
3) Understanding the data
You would think that once data scientists find and obtain access to a specific table, they can finally work their magic and build powerful predictive models. sadly, still not. They usually sit scratching their head for ridiculous amounts of time with questions of the type:
- What does the column name 'FRPT33' even mean?
- Who can I ask this to?
- Why are there so many missing values?
Although these questions are simple, getting an answer isn't. There is no ownership over datasets in organizations, and finding the person that knows the meaning of the column name you are enquiring about is like trying to find a needle in a haystack.
The solution to prevent data scientists in your organization from spending too much time on these basic questions is again to ... document data assets. As simple as that. If you can have a written definition for every column in every table of your data warehouse, you will see the productivity of your data scientists skyrocket. Does that seem tedious? We assure you, it takes less time than letting undocumented assets roam around your business with unproductive data scientists spending 80% of their time trying to figure them out. Also, modern data documentation solutions have automation features, meaning that when you define a single column in a table, this definition is propagated to all other columns bearing a similar name in other tables.
4) Data cleaning
Unfortunately, real-life data is nothing like hackathon data or Kaggle data. It is much messier. The result? Data scientists spend most of their time pre-processing data to make it consistent before analyzing it, instead of building meaningful models. This tedious task involves cleaning the data, removing outliers, encoding variables, and so on. Although data pre-processing is often considered the worst part of a data scientist's job, it is crucial that models are built on clean, high-quality data. Otherwise, machine learning models learn the wrong patterns, ultimately leading to wrong predictions. How then can data scientists spend less time pre-processing data while ensuring only high quality data is used for training machine learning models?
One solution lies in using augmented analytics. It is the use of technologies such as machine learning and AI to assist with data preparation to augment how data scientists pre-process data. This allows for the possibility of automating certain aspects of data cleansing which can save data scientists significant amounts of time while keeping the same productivity levels.
5) Communicating the results to non-technical stakeholders.
Data scientists' work is meant to be perfectly aligned with business strategy, as the ultimate goal of data science is to guide and improve decision-making in organizations. Hence, one of their biggest challenges is to communicate their results to business executives. In fact, managers and other stakeholders are ignorant of the tools and the works behind models. They have to base their decisions on data scientists' explanations. If the latter can't explain how their model will affect the performance of the organization, their solution is unlikely to be executed. There are two things making this communication to non-technical stakeholders a challenge:
- First, data scientists often have a technical background, making it difficult for them to translate their data findings into clear business insights. But this is something that can be practiced. They can adopt concepts such as "data storytelling" to provide a powerful narrative to their analysis and visualizations.
- Second, business terms and KPI's are poorly defined in most companies. For example, everyone knows roughly what the ROI is made of in a company, but there is rarely a common understanding across all departments of how it is computed exactly. There ends up being as many ROI definitions as they are employees calculating it. And it's the same story for other KPIs and business terms. This makes it even harder for data scientists to understand and explain the impact of their work related to specific KPIs. How on earth are they then expected to convince business executives to implement their solutions? The solution is simple. Define your KPI's and make sure everyone has a common understanding of each metric. Proper business KPI's will allow you to measure exactly the business impact generated by data scientists' analyses. A good way of building a single source of truth for your KPIs and business terms is to use a data catalog. This solution ensures everyone is aligned regarding key definitions for your business.
Data scientists' productivity, your data team's productivity in general are greatly impacted by factors that could be easily avoided. Collecting relevant data, centralizing data assets, documenting your tables, clearly defining business terms and KPIs: these good practices are easy to put in place, and will radically affect the productivity of your data team while bringing frustration levels down.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and get a free 14 day demo.
Subscribe to the Castor Blog
You might also like
Fantastic tool for data discovery and documentation
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.”
Michal, Head of Data, Printify