Most of the time, data analytics is based on what is called “secondary data”, defined as data that was gathered previously for reasons other than this particular analytics exercise. Most data that business track as a routine part of their operations is therefore secondary data.
Today we see many examples in business of the benefits of using secondary data for analytics. Those benefits are indeed huge. This presentation will, however, uses examples that make you think about some of the potential limitations of using such data. It is essential for a good data scientist to have these perspectives as well.
Does smoking cause cancer? How do you conclusively answer this question?
Carefully controlled experiments are the gold standard in answering such questions. An experimenter might randomly select a subset of laboratory rats, for example, and have a random half of them “smoke”, while the other half did not.
If all else remains similar with the rats and if rats that smoked had a greater propensity for cancer in the future than non-smokers, then a statement can be made about likely causality.
There are many textbooks devoted to the topic of conducting such experiments, and analyzing the results to make inferences regarding potential causality.
What if experiments were not possible, what sort of secondary data might we need to answer this question? Let’s look at some scenarios.
Consider Scenario A. You have complete information on every single person in the world, including genetic information, lifestyle and you manage to track everything they did in their lifetimes.
You also track whether they smoked, and know if they had cancer or not. In such cases it might be possible to analyze the data to see if, after controlling for everything else, the variable capturing smoking habits still strongly predicted likelihood of cancer. But this scenario of course is simply non-existent because such data is impossible to obtain.
Consider Scenario B. You are given complete data on patients who are diagnosed with cancer, including whether they smoked. Can you analyze this data to answer the question?
Nothing can be shown conclusively here of course, since the data only has information on patients with cancer. There is no data on those without. What if those people smoked as well?
Consider Scenario C. This third scenario sometimes happens in real life. There might be two groups of who are otherwise similar, except for one thing. Perhaps a random group from one country moves to a different place. Assume the group that migrated started smoking to fit into the lifestyle of the new place, and this group also had greater cancer incidence over time. Then perhaps smoking is the reason since the groups were otherwise similar.
Another example that highlights some problems that come from secondary data relates to the challenger space shuttle tragedy in 1986.
One of the reasons that led to the explosion was that some components in the shuttle called the O-rings failed. The launch was also on an usually cold day where the temperature was near freezing, and there were some concerns that low temperatures can be a problem.
If you have the data on prior launches where there was an O-ring failure and plot the failures against the launch temperature then the data seems to suggest that failures do not depend on the launch temperature.
However if you plot data from all launches, including the ones with zero O-ring failures, then you see that most of the time when there were no failures, the launch temperature was much higher. From this graph, the likelihood of O-ring failures at 30 degree F seems very high.
Secondary data can sometimes simply be incomplete, or biased. It is important to think about these issues carefully before diving into any deep analyses.
Image Credit:
Associated Press Photo
NASA
When doing any analytics exercise, first ask yourself, ideally, what data would you need to help you draw correct conclusions.
Then look at what data you have available, and see if there are any issues with the way the secondary data was gathered.
In general, randomness is good. Having data from a random group helps you generalize findings to the population at large.
Having biased or incomplete data is usually bad. Of course, your results in such cases might still hold for similar such scenarios. For instance, if your data was biased in that it was mostly about men for some reason, your findings might still generalize to all men.
Your responsibility is to ensure data correctness, completeness and the fact that this is likely representative of the data that we likely apply any learned model on. If you feel your data is not representative of the population you intend to use the models for then, you have two options.
First, admit defeat. Often it is better to do this than to analyze bad data. Of course, in such cases, your best bet is to go and find the actual data you think you need!
Second, do the analysis, but very carefully explain the limitations posed by the data problems. The recipient of your analysis will appreciate the honesty, and possibly save millions of dollars for the business in the process.