The relativity of raw data

July 26, 2016 1 minute read

Data scientists often say that they want access to the ‘raw data’ – but what does that term mean?

An important characteristics of raw data [is that it] is relative to your reference frame.

The raw data is raw to you if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.

The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provenance of the data back to the original source.

(via Simply Statistics)

If you’re interested in how to maintain this “chain of custody”, check out Recipy, an “effortless method to record provenance in Python”. You simply add a single line of code to your Python code, and Recipy will keep track of how exactly your output files were created.

Share on

Twitter Facebook LinkedIn

Declare your python dependencies within your Jupyter notebook

January 16, 2025 less than 1 minute read

Reproducible workflows are simplified with tools like Nix for shell scripts and juv for Jupyter notebooks, enabling dependency declarations directly within s...

Why you should really prepare for your one-on-ones

January 16, 2025 less than 1 minute read

Maximize the impact of your 1-on-1 meetings by preparing thoroughly, not just with your direct reports but also with your managers, to boost both job perform...

Why You’re Not Getting Value from Your Data Science

May 16, 2023 1 minute read

If companies want to get value from their data, they need to focus on accelerating human understanding of data, scaling the number of modeling questions they...

List only untracked files

May 3, 2023 less than 1 minute read

Using ls-files

Francis T. O'Donovan