The 5 things every data analyst should know

And why it is not Python, nor SQL

3 min read

May 9, 2023

By Arnaud de Turckheim

#1 If your analysis doesn’t have any bias, then look again

Problem Definition

A bias is an inclination for or against an idea. Most of the time, this is totally unconscious, it takes place mainly when our results are exactly how we expect them to be. We are all human beings, if we have expectations about something, and after digging the data a bit, our first results are as per our expectations, then we tend to stop right there. When our results aren’t how we expect them to be we can keep digging until there are.

How to avoid that?

Think about what could make your analysis results wrong. I see two main drivers of such bias.

The scope of your analysis

Try Changing the date range focus or even the data used may get you different results. The classic challenges deal with seasonality and mix effects. Be mindful of cohorts effects

The methodology of your analysis

This one flirts with statistics 101, now that you’ve got the right scope of time and data points, think carefully about how you aggregate them to get results. Outliers are to be considered, aggregation metric too. Always check the Mean Versus Median.

#2 Most first drafts can be done in Excel

That title is a bit provocative. Yes, python is powerful and allows you to save and repeat your data processing. But there is the cost to that. First, it takes time, especially if you’re not a python hotshot. Second, collaboration is tougher with non-tech users. If you need non-code-savvy people to work with you on your data app, then python will slow them down.

As a data player, you’ll want to do projects in Python, simply to ramp up. But choose them carefully. If you have a super tight schedule and excel does the job then go for excel. You can migrate later to python as it is always easier to learn one thing at a time. It’s hard to do a brand new data app with a language you’re not comfortable with. First do the analysis with a tool you know well, then migrate it to the new language.

#3 Get yourself a tool that keeps your query history

Ever got a data request similar to the one you had 3 months ago? Happens too many times per year, wishing you had a nice history of all queries you ran in the last 365 days…

Check out Castor to do so

#4 Don’t fix the data, fix the process that creates it

Let’s start with a real-life example.

One of the data pipelines of one of my previous companies kept breaking because of a not-unique issue: a table field was supposed to be a primary but there were duplicates. That field was client_id and normally a client was supposed to be in one and only one country.

So whenever we had this issue we had to find the client linked to several countries and fix it. We would also remind the sales team of the “one country rule”.

Should we make a dedicated alerting system on this specific matter? Should we add a transformation layer on top? Should we remove that “unique” check? None of these. We must (and haven’t yet) simply enforce that rule when the data is created at the source, aka, in Salesforce by Salespeople.

As much as possible get to the root cause of your data issues, and make people understand that good data requires processes that are optimized for it. Processes are indeed made first to improve the business, but for the sake of having good data, they must factor in the data dependencies.

#5 Share your analysis as widely as possible

Too many data players wait for their data app to be perfect before sharing it. Share it now (with a “WIP” disclaimer at the beginning if you want). Do not spend more than a few days without having a peer review of your work. It will give you perspective.

Conclusion

Yes, hard skills (Python, SQL, R…) are key to get started with your analysis but personally, I am looking more into soft skills (good communication, ability to see the big picture, straight-to-the-point, hacky).

Happy to have a constructive debate in the comments

‍

About us

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.

Want to check it out? Reach out to us and we will show you a demo.

‍

Subscribe to the Castor Blog

New Release

Table of Contents

Why Look for Atlan Alternative?

Resources

Louise de Leyritz

August 18, 2023

How to Build Your Data Team

Castor is a modern data catalog. Overview of mid-market data team organization models.

Learn more

Louise de Leyritz

August 25, 2023

Top 10 metrics for a Data Team

Uncover the top 10 metrics that every data team should track to optimize performance, decision-making, and unlock the full potential of their data.

Learn more

Laure Dassy

August 21, 2023

What is Analytics Engineering?

Explore the emerging field of analytics engineering and its impact on data-driven organizations with CastorDoc.

Learn more

Get in Touch to Learn More

See Why Users Love Coalesce Catalog

Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data

#1 If your analysis doesn’t have any bias, then look again

Problem Definition

How to avoid that?

#2 Most first drafts can be done in Excel

#3 Get yourself a tool that keeps your query history

#4 Don’t fix the data, fix the process that creates it

#5 Share your analysis as widely as possible

Conclusion

About us

Subscribe to the Castor Blog

You might also like

Get in Touch to Learn More