Howard Bondell
In today’s world where technology runs rampant, as statisticians, we are well aware of the influx of potential variables that we are able to measure. Due to this data explosion, methods for variable selection have received a great deal of attention in the past few decades. While the idea of variable selection is not a new topic, it differs greatly in scale from what was formally the norm. Traditionally, there were a few, or perhaps if luck would have it, dozens of variables to sort through. But now, we have 100’s or 1,000’s, or even more. Although a vast number of new methods have been developed to handle this larger scale, the way that we as statisticians evaluate these methods often reverts back to the more traditional.
Of course, upon proposing a new method, we would like to compare it to established approaches via simulation for which the true data generating mechanism is known, unlike in reality. This is a necessary development in order to examine the properties of the proposal. Data is generated from a model, usually with only a handful of predictors being truly relevant, and the remainder as simply noise. A large number of datasets are generated, and the methods are then compared on how often each was able to identify the correct set of predictors that appear in the true generating model. This is the traditional method of comparison.
However, in the high-dimensional world, does anyone really expect any method to obtain the exact correct subset of predictors, with no mistakes? In reality, it is not even clear what this means. I would argue that this traditional measure to compare selection methods is no longer very useful. Unless the designed simulation is excessively simple, it seems unlikely in the high-dimensional case that any method would select the exact subset of correct predictors even once. In a somewhat realistic setting, all methods would have exactly zero for its proportion correct!
So how should we compare methods?
In many cases, the desired output from a selection method is not the presentation of what is thought to be the “correct model”, but an ordering of the variables that can be presented for further investigation. Often times, the size of the desired subset is not dictated by some statistical stopping rule, but instead may be dictated by the resources that are available for this further investigation.
In aligning with this goal, comparison of variable lists are potentially more relevant, and surely more informative. Consider, for simplicity, the sequence of variable subsets generated by a method such as forward selection. For forward selection, these are nested subsets, but they do not need to be. We can just order the subsets by increasing complexity. A variable selection method can typically be viewed in two stages. First, the approach provides this sequence of variable subsets, and second, it provides a stopping rule to pick out the subset to report.
Traditional evaluation combines these two stages and evaluates only the final model. This is a shame, as it entangles the two different parts. Instead, we should be evaluating the first stage on its own. This can be directly accomplished via the use of Receiver Operating Characteristic Curves, or, perhaps more relevant to the high-dimensional setting, Precision-Recall (or False Discovery) Curves.
Comments/Discussion
This article makes a great deal of sense and I'm interested in getting more details. Can you give some references where I can read more about Precision-Recall and False Discover Curves.
All the best,
Butch Tsiatis
Here is a reference that discusses some of the differences between the Precision-Recall Curve and the ROC Curve.
Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, ICML '06, 233-240.
There is a version available on the author's website http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf.
Submit a question or comment. |
Each month, one of our program investigators will introduce him/herself and will discuss their research, new research directions, or advancements made toward our goal of improving the clinical trial process. Readers are encouraged to send questions or comments. In addition, we will announce new software releases, publications, and upcoming events.
Subscribe |
Date: February 15, 2016
Date: January 15, 2016
Date: December 16, 2015
Date: November 24, 2015
Date: October 15, 2015
Date: September 17, 2015
Date: August 24, 2015
Date: July 16, 2015
Date: June 4, 2015