In the era of big data, one piece of analysis that is frequently overlooked is the problem of finding patterns when there are actually no apparent patterns. In statistics this is referred to as Type I error. As scientists, we are always on the lookout for a new or interesting breakthrough that could explain a phenomenon. We hope to see a pattern in our data that explains something or that can give us an answer. The primary goal of hypothesis testing is to limit Type I error. This is accomplished by using small α values. For example, a α value of 0.05 states that there is a 1 in 20 chance that the test will show that there is something significant when in actuality there isn’t. This problem compounds when testing multiple hypotheses. When running multiple hypothesis tests, we are likely to encounter Type I error. As more data becomes available for analysis, Type I error needs to be controlled. One of my projects required testing the difference between the means of two microarray data samples. Microarray data contains thousands of measurements but is limited in the number of observations. A common analysis approach is to measure the same genes under different conditions. If there is a significant enough difference in the amount of gene expression between the two samples, we can say that the gene is correlated with a particular phenotype. One way to do this is to take the mean of each phenotype for a particular
gene and formulate a hypothesis to test whether there is a significant difference between the means. Given that we were running thousands of these tests at α = 0.05, we found several differences that were significant. The problem was that some of these could be caused by random chance.
Many corrections exist to control for false indications of significance. The Bonferroni correction is one of the most conservative. This calculation lowers the level below which you will reject the null hypothesis (your p value). The formula is
alpha/n, where n equals the
number of hypothesis tests that you are running. Thus, if you were to run 1,000 tests of significance at α = 0.05, your
p value should be less than
0.00005 (0.05/1,000) to reject the null hypothesis. This is obviously a much more stringent value. A large number of the previously significant values were no longer significant, revealing the true relationships within the data. The corrected significance gave us confidence that the observed expression levels were due to differences in the cellular gene expression rather than noise. We were able to use this information to begin investigating what proteins and pathways were active in the genes expressing the phenotype of interest. By solidifying our understanding of the causal relationships, we focused our research on the areas that could lead to new discoveries about gene function and, ultimately to improved medical treatments.
Reason and common sense are foundational to Data Science. Without it, data is simply a collection of bits. Context, inferences and models are created by humans and carry with them biases and assumptions. Blindly trusting your analyses is a dangerous thing that can lead to erroneous conclusions. When you approach an analytic challenge, you should always pause to ask yourself the following questions:
› What problem are we trying to solve? Articulate the answer
as a sentence, especially when communicating with the end- user. Make sure that it sounds like an answer. For example, “Given a fixed amount of human capital, deploying people with these priorities will generate the best return on their time.”
› Does the approach make sense?
Write out your analytic plan. Embrace the discipline of writing, as it brings structure to your thinking. Back of the envelope calculations are an existence proof of your approach. Without this kind of preparation, computers are power tools that can produce lots of bad answers really fast.
› Does the answer make sense?
Can you explain the answer? Computers, unlike children, do what they are told. Make sure you spoke to it clearly by validating that the instructions you provided are the ones you intended. Document your assumptions and make sure they have not introduced bias in your work.
› Is it a finding or a mistake?
Be skeptical of surprise findings. Experience says that it if seems wrong, it probably is wrong. Before you accept that conclusion, however, make sure you understand
› Does the analysis address the original intent? Make sure
that you are not aligning the answer with the expectations of the client. Always speak the truth, but remember that answers of “your baby is ugly” require more, not less, analysis.
› Is the story complete? The goal
of your analysis is to tell an actionable story. You cannot rely on the audience to stitch the pieces together. Identify potential holes in your story and fill them to avoid surprises. Grammar, spelling and graphics matter; your audience will lose confidence in your analysis if your results look sloppy.
› Where would we head next?
No analysis is every finished, you just run out of resources. Understand and explain what additional measures could be taken if more resources are found.
» Tips From the Pros
Better a short pencil than a long memory. End every day by documenting where you are; you may learn something along the way. Document what you learned and why you changed your plan.
» Tips From the Pros
Test your answers with a friendly audience to make sure your findings hold water. Red teams save careers.