Having clean data is important for data scientists. Oftentimes to achieve this, one needs to deal with missing data. Naively, it might seem intuitive to simply remove it, but this is often not the right answer (especially if training data is scarce).
On the topic of data cleaning, we’ll be answering the following question:
You’re preparing a 3000 sample training dataset with 40 features. It has 6% of the values missing (assume these are randomly distributed). If you remove all samples that have missing data, what’s the probability that a given sample will be removed?
Try to solve it :)
Don’t read on until you’ve got an answer.
If it was a breeze, then you might enjoy my homework question at the bottom of this post as well.
In practice we probably wouldn’t need to worry about solving this analytically, thanks to computational methods. For example with pandas we could just call
df.isnull().any(axis=1).sum() / len(df) to get the probability that a row would be removed.
With this in mind, let’s solve the question computationally using a Monte Carlo simulation.
Running 1000 simulations…
>>> results = [monte_carlo_sim(seed) for seed in range(1000)]
generated df with 6.0% missing
killing 91.5% of rows
generated df with 6.1% missing
killing 92.0% of rows
generated df with 6.1% missing
killing 91.8% of rows
We get the following distribution and average result of 91.6%.
So we can see that a sample (row) has ~91.6% chance of being removed. That’s pretty high, considering that only 6% of the values are missing!
Monte Carlo is fun but analytic solutions are more interesting if you ask me.
One way to approach the problem is to figure out the probability of a row being OK (no missing data), and then subtracting that probability from one.
The chances of a row being OK is given by the product of individual probabilities for each feature existing. Since the missing values are randomly distributed, these are each going to be 1–0.06 = 0.94.
Subtracting this from 1, we find that the probability of a bad row is ~91.58%
That was fun, but we can do even better and directly compute the probability of a bad row using the binomial distribution’s probability mass function:
This gives us the probability of seeing k “successes” from n independent trials, where the probability of success is p. Because there are many different ways of getting a bad row (1/50 values missing, 2/50 values missing … etc), we need to sum over all these situations. In other words, we need to sum over all these different values for k.
Let’s evaluate this with python:
91.58% chance of row being removed
As you can see, we get the same probability as before :)
Now for something a bit more difficult. Let me know your answer in the comments.
You’re preparing a 2000 sample training dataset with 75 features. It has 8% of the values missing (assume these are randomly distributed). If you remove all samples that have greater than 5 missing features, what’s the probability that a given sample will be removed?
Good luck, and thanks for reading!