We take a moment to comment that evaluating a variable set for predictivity, what we have called here VSA, is different from evaluating a given classifier, which is the prediction stage, usually following or in conjunction with VS. The latter considers evaluating , a special function applied to a particular set of explanatory variables , for a given outcome variable , whereas the former considers the potential predictivity of the set of explanatory variables for that outcome for all possible . Our work here focuses simply on VSA. Variable sets assessed as highly predictive in our framework can then be flexibly used in various models for prediction purposes as pleases the researcher.
We are now in an odd situation where we have identified variable sets that could not have been found using conventional approaches and yet we wish to evaluate the predictivity of our identified variable sets against these conventional approaches. Nevertheless, we endeavor to do so. A couple options arise for approaches to compare against: training prediction rate and out-of-sample testing prediction rate. We will show that the -score-based measure provides a useful and meaningful estimated lower bound to the correct prediction rate and correlates well with the out-of-sample test rate, whereas the training rate statistic, the sample analog of , does not. As such, our approach has an important benefit to prediction research: Compared with methods such as cross-validation of error rates, the -score is efficient in the use of sample data, in the sense that it uses all observations instead of separating data into testing and training.
Simulations.
We offer simulations to illustrate how (i) the -score can serve as a lower bound to the true predictivity of a given variable set even as noisy variables are adjoined, (ii) thereby serving as a screening mechanism, and (iii) finding the maximum -score when conducting a BDA leads to finding the variable set with the highest corresponding level of predictivity. BDA reduces a variable set one variable at a time, by eliminating the weakest element until reaches a peak.
We consider a module of three important variables
(see
Fig. 2 for the disease model used) among six unimportant variables
using sample sizes of 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. (See
Simulation Details for more detailed model setting and simulation details.) We demonstrated that the
estimates
* , which is related to an asymptotic lower bound (
Eq. 6) for
, as
. It would be helpful to see how
performs at fixed, reasonable sample sizes. We compare the
-score derived predictivity lower bound against the Bayes’ theoretical prediction rate in our simulations to illustrate this. The out-of-sample correct prediction rate is presented in the simulations here as a further benchmark against which the
-score can be compared when data are limited, as is the case in real-world applications. The out-of-sample correct prediction rate is derived from the most optimistic context achievable in the real world, whereby future testing data are infinite. In all of the simulations, the
-score of a set of influential variables drops when a noisy variable is added. This drop is subsequently seen in the
-score derived bound for the correct prediction rate. The
-score can screen out noisy variables, which makes it useful in practical data applications.
To illustrate how these statistics fare in accurately capturing the level of predictivity of each variable set under consideration, we consider their performance given already having found
and
as important. We then add
, which should ideally correspond with an increase in the statistic. We continue adding the remaining noisy variables one at a time to this “good” set of variables and observe how the statistics evaluate the new, larger set of variables for predictivity. In
Fig. 3, violin plots show distributions of training rate, the
-score lower bound, and the ideal out-of-sample prediction rate under each setting across the simulations. Theoretical Bayes’ rate is also plotted as a reference, which remains flat when noisy variables are added. This is because the Bayes’ rate is defined purely by the partition formed from the informative variables and does not change when adjoining noisy variables (
) and creating finer partitions.
Several patterns emerge in these simulations. First, and most importantly, the -score-derived prediction rate seems to be a reasonable lower bound to the Bayes’ rate. This holds even in moderate sample sizes.
The second pattern is that the estimated
-score lower bound peaks at the variable set that is inclusive of all influential variables (
,
, and
) and no additional noisy variables. This is a characteristic of the out-of-sample correct prediction rate as well. For instance, if we consider the top row of
Fig. 3 and start from the right of the
x axes in each of the three plots with the largest set of variables inclusive of both influential and noisy variables (
), continual removal of the noisy variables (sliding to the left of the
x axis) until we reach the variable set (
,
,
) results in higher predictivity as measured by the
-score lower bound. We can note that the
-score lower bounds drop upon further removing the influential
variable from the set (
). Thus, the variable set that appears with the maximum
-score derived lower bound here both identifies the largest possible variable set of influential variables with no noisy variables and is also reflective of a conservative lower bound of the correct prediction rate for that variable set. We note that once we have found the variable sets with the highest
-scores and calculated the corresponding lower bound of the correct prediction rate, we can adjust this lower bound rate for its bias to derive an improved estimate of the correct prediction rate.
A third pattern is that the training rate suffers from overfitting when adjoining noisy variables even when the variable set includes a true influential subset of variables. If the variable set is irreducible, however, the training rate estimator reflects the Bayes’ correct prediction rate well; thus, the training rate estimator can perform reasonably well conditional on already identifying (). The training rate estimator cannot be used to screen to that variable set first, however.
Finally, and as we might expect, the training set rate explodes due to overfitting in high dimensions as noisy variables are adjoined to the partition formed by the informative variables (, , ). Although the training set prediction rate seems to improve as the sample size increases, it cannot be used to screen out noisy variables, and is therefore difficult to use as a statistic to select highly predictive variable sets. The predictivity rates found through this statistic also dramatically depart from the out-of-sample testing rate. It tends to ever-optimistically evaluate variable sets for their future predictions even when noisy variables are added. This stands in stark contrast to the out-of-sample prediction rate because it lowers in prediction rate with the addition of useless variables. We notice that there is a trend that the -score prediction rate does not remain flat. The score increases when removing a noisy variable and reducing to a variable set of only influential variables, indicating the additional advantage of the -score as a lower bound; the -score prefers a simpler model even when the Bayes’ rate remains the same, selecting for more parsimonious partitions that attain the Bayes’ rate, simultaneously a closer reflection of the out-of-sample prediction rate.
Recall the correct prediction rate is based on an absolute difference of probabilities summed over all s. Suppose we start with influential variables only, with correct prediction rate, the highest we can attain out of all possible variable sets. Adding noisy variables to this set, variables that add no signal but simply create a finer partition, still returns . When estimating the correct prediction rate using sample data, though, the training estimate of value generally keeps increasing if noisy variables are added; the researcher does not know when to stop the search for influential variables, making selecting for highly predictive variables difficult. Ideally, we would like to “punish” adding such noisy variables to our variable set, so having a measure that balances between favoring coarser partitions but still recognizing actual new variables with strong enough signals (non noisy variables) is important. The -score seems to support such an effect—preferring coarser partitions unless an additional variable (and therefore finer partition) provides enough signal in the data to justify keeping it.
Noisy variables in sample data may be indicative of actually noisy variables or influential variables with weak signals due to the sample size. Thus, we note there are cases where the -score might not recognize these variables when their signals would require unrealistic sample sizes to be found through the measure. An example of this would be if a good predictor is highly complex (perhaps a combination of very many variables) and the observations are sparse in the partition. Because the -score places greater weight on where the data tend to appear (note the term in the score), when most of the partition cells contain no observations or at most one observation, this can often look like noise.
The main draw of the -score is its ability to screen for influential variable sets. The variable sets inclusive of the three influential variables (, , and ) alone display the highest -scores. Searching for variable sets with the highest -scores thus tends to return highly influential variables only. Using the training prediction rate as a guiding measure for screening, however, would continually seek for ever-larger variable sets, regardless of whether they include noisy variables or not.