I cannot recommend this paper enough. It should be essential reading for anyone involved in data analysis.
Shmueli distinguishes two different aims when analysing data: prediction and explanation. She describes in detail how the modelling and analysis process should differ whether you are doing one or the other. She even shows a concrete example where the model that works best for prediction is different to the model that works best for explanation. This was a key insight for me. Previously I had assumed the intuitively appealing idea that the best model for one will also be the best for the other. I’m glad to have this corrected. I see this idea advanced all the time, and now I know for sure that it’s false.
Another key message from Shmueli is that even though our primary aim will be either prediction or explanation, we should, if possible, assess our models on both criteria. We would expect good models to perform reasonably well in either setting, and it will usually be insightful to assess both.
Bin Yu gave a talk earlier this week on ‘mind-reading’, showcasing her group’s work on reconstructing movies from brain signal measurements. In one step of their modelling process, they do a trade-off between ‘explainability’ and ‘predictability’. Specifically, they chose a model that was easier to interpret at the expense of a bit of predictive performance. This is the first time I’ve seen anyone do this explicitly. It reminds me of the bias-variance trade-off and talks directly to the ideas in Shmueli’s paper.