Framework for making better predictions by directly estimating variables’ predictivity

Contributed by Herman Chernoff, October 13, 2016 (sent for review June 4, 2016; reviewed by David L. Banks and Ming Yuan)
November 29, 2016
113 (50) 14277-14282

Significance

Good prediction, especially in the context of big data, is important. Common approaches to prediction include using a significance-based criterion for evaluating variables to use in models and evaluating variables and models simultaneously for prediction using cross-validation or independent test data. The first approach can lead to choosing less-predictive variables, because significance does not imply predictivity. The second approach can be improved through considering a variable’s predictivity as a parameter to be estimated. The literature currently lacks measures that do this. We suggest a measure that evaluates variables’ abilities to predict, the I-score. The I-score is effective in differentiating between noisy and predictive variables in big data and can be related to a lower bound for the correct prediction rate.

Abstract

We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the I-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the I-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the I-score on real data to demonstrate the statistic’s predictive performance on sample data. We conjecture that using the partition retention and I-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.
Prediction is a highly important goal for many scientists and has become increasingly difficult as the quantity and complexity of available data have grown. Complex and high-dimensional data particularly demand attention. However, the literature on prediction does not yet have a clear theoretical framework that allows for characterizing a variable’s predictivity directly [see A Brief Literature Review on VS for a brief review on the literature of variable selection (VS)]. Rather, VS for variable sets in the context of prediction is currently conducted in two common ways. The first is VS through identification of variables correlated with the outcome, measured through tests of statistical significance—such as the chi-square test. The second is through VS of variables that seem to do well in an independent set of test data, as measured through testing sample error rates. The first approach is still very much in use for predicting health outcomes (see ref. 1, among others) but its prediction performance has been disappointing (e.g., refs. 1 and 2). We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under a significance-test-based search setting, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) is not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.
Fig. 1.
Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significance-tests.
Open in viewer
We show in ref. 3 that the first approach suffers from the problem that significant variables are not necessarily predictive, and vice versa, so targeting significant variables might miss the goal of VS for higher predictivity. This problem is prevalent in simple as well as complex data. The second way for VS sets aside testing (or validation) data to see how well selected predictors might do on “new data.” However, as is in the case of genome-wide association study (GWAS) data, researchers frequently lack large enough sample sizes for this approach to be efficient. Reuse of training data in the form of cross-validation is often adopted in practice.
We suggest that an alternative, and perhaps logical, approach to prediction should start with defining the theoretical prediction rates of a set of variables as a parameter of interest. It would be productive then to create measures designed to directly measure such a parameter, rather than relying on the estimated prediction rate by cross-validation. We call such an approach “variable set assessment,” or VSA. We hope that designing measures that directly estimate a variable set’s true ability to predict may prove to be both fruitful and efficient in the use of sample data for good prediction. Here, we propose such a prediction-based framework. Grounded in statistical theory, we highlight an avenue of research toward creating sensible measures that target highly predictive variable sets through assessing their predictivity directly. We emphasize genetic data, although we will show that the methods proposed are easily tailored to other high-dimensional data in the natural and social sciences.

A Brief Literature Review on VS

A related and extremely important literature is that of VS or feature selection, which refers to the practice of selecting a subset of an original group of variables that is later used to construct a model. Often VS is used on data of large dimensionality with modest sample sizes (7). In the context of high-dimensional data, such as GWAS, this dimensionality reduction can be a crucial step. VS approaches are commonly proposed to efficiently search for the best variable sets according to a specified criterion. Most performance measures are developed to maximize the probability of selecting the truly important variables but are not direct measures of predictivity. Therefore, popular VS approaches do not return reliable assessment of the predictivity of variable sets. In contrast, we will propose considering VSA through a reliable, model-free measure used to assess the potential predictivity of a variable set. Unlike projection- or compression-based approaches (such as principal component analysis or use of information theory), VSA methods do not change the variables themselves.
The types of approaches and tools developed for feature selection are both diverse and varying in degrees of complexity. However, there is general agreement that three broad categories of feature selection methods exist: filter, wrapper, and embedded methods. Filter approaches tend to select variables through ranking them by various measures (correlation coefficients, entropy, information gains, chi-square, etc.). Wrapper methods use “black box” learning machines to ascertain the predictivity of groups of variables; because wrapper methods often involve retraining prediction models for different variable sets considered, they can be computationally intensive. Embedded techniques search for optimal sets of variables via a built-in classifier construction. A popular example of an embedded approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many to zero. Often cross-validation is used to evaluate the prediction rates.
Often, though not always, the goal of these approaches is statistical inference. When this is the case, the researcher might be interested in understanding the mechanism relating the explanatory variables with a response. Although inference is clearly important, prediction is an important objective as well. In this case, the goal of these VS approaches is in inferring the membership of variables in the “important set.” Various numerical criteria have been proposed to identify such variables [e.g., Akaike information criterion (AIC) and Bayesian information criterion (BIC), among others; see chapter 7 in ref. 8 for a review], which are associated with predictive performance under model assumptions made for the derivation of these criteria. However, these criteria were not designed to specifically correlate with predictivity. Indeed, we are unaware of a measure that directly attempts to evaluate a variable set’s theoretical level of predictivity. This paper proposes a model-free parameter for predictivity and its sample estimate. For a more comprehensive survey of the feature/VS literature see, among others, refs. 7, 9, 10, and 11.
Although a spectrum of VS approaches exists, many scientists have taken the approach of tackling prediction through the use of important and hard-to-discover influential variables found to be statistically significant in previous studies. When these efforts are in the context of high-dimensional data and alongside work investigating variables known to be influential, it might seem reasonable to hope that variables found to be significant can prove useful for predictive purposes as well. This approach is in some ways most similar to a univariate filter method, because it is independent of the classifier and has no cross-validation or prediction step for VS. We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under the context of a significance-test based search for variable sets, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) are not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.
Many methods also use out-of-sample testing error rates or cross-validation to ascertain whether prediction is done well. This approach was not designed to specifically find a theoretically correct prediction rate for a given variable set; rather, it is simply a performance evaluation of future predictions from a pattern recognition technique on selected variable sets (trained on training data). Sometimes the variable sets in the training data are selected through statistics such as the adjusted R squared, AIC, or BIC. When pn(or even in instances where p>n), a standard in big data, however, these statistics can fail to be useful. Again, these criteria were not designed to be directly correlated with a given variable set’s predicitivity. Using out-of-sampling testing and/or cross-validation techniques additionally requires either setting aside valuable sample data to make sure the variable sets selected under the training set are indeed highly predictive and not just overfitting the data or is often computationally burdensome. It becomes important then that we have a good screening mechanism when conducting VSA for removing noisy variables (and thus finding predictive ones), even with constrained amounts of sample data. We show in our simulations how poorly we can do in VSA for prediction through training set compared with out-of-sample testing prediction rates (with “infinite” future testing data—a mostly unattainable, but ideal, scenario). An ideal measure for predictivity (or a good VSA measure) reflects a variable set’s predictivity. In doing so, it would also guide VSA through screening out noisy variables and should correlate well with the out-of-sample correct prediction rate. We present a potential candidate measure, the I-score, for evaluating the predictivity of a given variable set in this paper.

Toy Example

To highlight some of our key issues, consider a small artificial example. Suppose an observed variable Yis defined as
Y={X1+X2(modulo 2)with prob.1/2,X2+X3+X4(modulo 2)with prob.1/2,
[1]
where X1,X2,X3and X4are 4 of 50 observed and potentially influential variables {Xi;1i50}. Each Xican take values 0 and 1. A collection of discrete variables Smay be regarded as a discrete variable that takes on a finite number of values. Each value defined by Sconstitutes a cell. The collection of all cells forms a partition, ΠS, based on the discrete variables in S. We also assume that the Xiwere selected independently to be 1 with probability 0.5, again the simplest case without affecting the general results. Clearly, none of the individual Xihas a marginal effect on Y.
Scenario I. A statistician knows the model and wishes to compute which variable sets are predictive of Y, and how predictive, when 𝐗=(X1,X2,,X50)is given. Because Ydepends only on the first four Xvariables, it is obvious there are two clusters of variable sets S1={X1,X2}and S2={X2,X3,X4}that are potentially useful in his prediction. We treat the highest correct prediction rate possible for a given variable set as an important parameter and call this predictivity (θc). Using the knowledge of the model, we can compute the predictivity for S1as θc(S1)=0.75. The predictivity for S2is θc(S2)=0.75also. Incidentally, the predictivity of the union of S1and S2, θc(S1S2), is also 0.75.
The statistician realizes that using variable sets S1and S2he can predict Ycorrectly 75% of the time. This is indeed the case because, for instance, upon observing 𝐗=(X1,,X50)the statistician predicts
Y^=X1+X2(modulo 2).
It is easy to verify that the strategy of predicting with S1returns a 75% prediction accuracy in expectation. This is also the highest percent accuracy S1can theoretically achieve. We discuss this in depth shortly. This result extends to S2as well.
Scenario II. In practice, the statistician rarely has knowledge of the model and instead observes only the data. We suggest that the statistician use the partition retention (PR) approach and its corresponding I-score (which we present formally in Alternative Measure: I-Score; see ref. 4 for the original presentation of the approach or see Eq. S2 for IΠ𝐗) to identify the influential variable sets. Suppose with 400 observations the researcher wishes to identify variable sets with high predictivity and to infer their abilities to predict. Using the PR approach he can use the I-score to screen for variable sets with high potential predictivity. In this example, S1and S2are consistently returned with the highest I-scores (23.71 and 12.79) in simulations. Using the inequality in Eq. 7, which we derive in the following section, the lower bounds for the predictivity of θc(S1)and θc(S2)are calculated to be 67 and 62%, respectively. Eq. 7 does not require knowledge of the true model as defined in Eq. 1.

Theoretical Prediction Rates

We contribute to the prediction literature by introducing the prediction rate as a parameter to be directly estimated. We show that the PR method’s I-score, a sample-based statistic, can be used to construct an asymptotically consistent lower bound for the prediction rate.
We deal here with the special case of case control studies where the explanatory variables are discrete, and the outcome variable takes only two values, case or control. These results are easily generalized for classification problems, where the dependent variable can take on a finite number of possible values. Consider GWAS data of the usual type, with cases and controls. Assume that there are ndcases and nucontrols. Using the traditional Bayesian binary classification setting, we ideally have a prior probability, π(w=d), that the state of the next individual, w, is a disease case, d, and π(w=u)=1π(w=d)that the next individual is a control, u. In the following we shall assume that both dand uare equally likely and that the cost of an incorrect classification is the same for both possibilities. We generalize to different cost functions and priors for dand uin Generalization to Arbitrary Priors and Generalization to Different Loss and Cost Functions. Let the joint distribution of the feature value 𝐗and wbe P(𝐱,w). The joint distribution can be expressed as P(w,𝐱)=π(w|𝐱)P(𝐱)=P(𝐱|w)π(w), where π(w|𝐱)is the posterior distribution and π(w)is the prior. It is easy to see that the best classification rule can be derived by Bayes’ decision rule for minimizing the posterior probability of error: dif π(d|𝐱)>π(u|𝐱), otherwise u. Here the variable set 𝐗=(X1,X2,,Xm), with each Xitaking one of the values in {0,1,2}, corresponding to the three possible genotypes for each SNP. In this way, 𝐗forms a partition, denoted by ΠX, with 3m=m1elements: Π𝐗={𝐗=𝐱j,j=1,,m1:𝐱j=(xj1,xj2,,xjm),xjk{0,1,2},1km}.
Assuming equal priors, that is, π(d)=π(u)=12, the correct prediction rate θcon 𝐗using the full Bayes’ decision rule can be calculated as
θc(𝐗)=θc[p𝐗d,p𝐗u]=12𝐱Π𝐱max{p𝐗d(𝐱),p𝐗u(𝐱)},
where p𝐗d(𝐱)and p𝐗u(𝐱)stand for P(𝐱|w=d)and P(𝐱|w=u), respectively. We can easily derive (see Technical Notes, Technical Note 1)
θc[p𝐗d,p𝐗u]=12+14jΠ𝐗|P(j|d)P(j|u)|.
[2]
This suggests that we can achieve better prediction rates by choosing variable sets corresponding to the probability pairs that lead to large values of jΠ𝐗|P(j|d)P(j|u)|. In this theoretical setting, it is easy to show that θcincreases or stays the same when another variable is added to the current variable set. This means adding many noisy variables leads to maintaining the same θc. Therefore, when sample size is no constraint, we are never hurt in our search for highly predictive variables by simply adding explanatory variables to our current set. However, in the realistic world of sample size constraints, a direct search for a variable set with a larger sample estimate of θcwill fail; we offer a heuristic explanation as to why in the following section. We refer to this direct search of θcwith sample data as the sample analog throughout.

Problems with the Sample Analog.

The value of θcis unknown and must be estimated. We may naturally turn to the naive sample estimate of its true theoretical values, which is sometimes referred to as the training rate. However, this estimated value of θc(where the cell probabilities are replaced by the observed proportions) is nondecreasing with the addition of more variables to a given variable set under evaluation. As the partition becomes increasingly finer, we reach a point where there is at maximum a single observation within each partition cell and 100% correct sample prediction rate is attained. This is true regardless of the true prediction rate. Then, the final estimated prediction rate is equivalent to 100%, rendering it useless as a method for finding predictive variable sets and screening out noisy ones. This is a direct result of a sparsity problem that does not occur in our theoretical world but certainly plagues the sample-size-constrained real world. (See Technical Notes, Technical Note 2 for a more detailed explanation.)We need instead a sample-based measure that can discern adding noisy versus influential variables and identify variable sets with large prediction rates for a given moderate sample size.

Alternative Measure: I-Score.

We consider this obstacle and suggest an alternative measure, a lower bound to θc, which we estimate using the I-score of the PR method (4) in sample data. The I-score converges asymptotically to a constant multiple of
θI(Π𝐗)=jΠ𝐗[P(j|d)P(j|u)]2.
[3]
To relate θIto θcdefined in Eq. 2, we first examine the following Lemma 1, which is derived in Technical Notes, Technical Note 3.
Lemma 1. For Kreal values {zj;1iK}, j=1Kzj=aand j=1K|zj|=b, we have
j=1Kzj2a2+b22.
[4]
In the case of zj=(P(j|d)P(j|u))for jΠ𝐗, we have a=0. It then follows that
2i=1k[P(j|d)P(j|u)]2i=1k|P(j|d)P(j|u)|.
This suggests that a strategy seeking variable sets with larger values of θIcan have the parallel effect of encouraging selection of variable sets with larger values of θc, yielding better predictors. In the following, we present Theorem 1 and Corollary 2 (see Technical Notes, Technical Note 5 and Technical Note 6 for proofs).
Theorem 1. Under the assumptions that ndnλ, a value strictly between 0 and 1, and π(d)=π(u)=1/2, then
limnsn2IΠ𝐗n=𝒫λ2(1λ)2jΠ𝐗[P(j|d)P(j|u)]2
[5]
where =𝒫indicates that the left-hand side converges in probability to the right-hand side and sn2=ndnu/n2(see Technical Notes, Technical Note 5 for more detail).
We now show that θIdefined in Eq. 3 is a parameter relevant to θc(𝐗). Together with Lemma 1, we can use the I-score to derive a useful asymptotic lower bound to the prediction rate of a variable set 𝐗, θc(𝐗), as presented in Corollary 2.
Corollary 2. Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct prediction rate:
θc(𝐗)𝒫12+142limnIΠ𝐗nλ(1λ).
[6]
Using sample data, the estimated lower bound for θcis then
12+142IΠ𝐗nλ(1λ).
[7]
The lower bounds presented in the toy example were obtained using the above Eq. 7.
We extend to an arbitrary prior in Corollary 3 (see Generalization to Arbitrary Priors for discussion and proof).
Corollary 3. Under the assumptions of an arbitrary prior π(d)and ndnλas n, the correct prediction rate is
θc[p𝐗d,p𝐗u]=12+12jΠ𝐗|P(j|d)π(d)P(j|u)π(u)|.
[8]
The last generalization of the proposed framework accounts for incurring different costs (or losses) when making incorrect predictions (see Generalization to Different Loss and Cost Functions for discussion). Note that searching for 𝐗with larger I-scores is asymptotically equivalent to searching for larger values of the lower bound in Eq. 6 which is closely related to the correct predictivity of a given variable set 𝐗, θc(𝐗). For example, if a variable set 𝐗has a large I-score (substantially larger than 1; see ref. 4), it is a strong indication that 𝐗itself could be a variable set with high predictivity. This stands in contrast to many current approaches to prediction [e.g., random forest and least absolute shrinkage and selection operator (LASSO)] that are evaluated for predictivity via cross-validation, which is computer-intensive.

Desirable Properties of the I-Score.

We note that the I-score is one possible approach to approximating the prediction rate in the sample analog form, and that the search for other potential scores is desirable and needed. Nevertheless, several properties of Iare particularly appealing.
First, Irequires no specification of a model for the joint effect of {X1,X2,,Xm}on Ybecause it is designed to capture the discrepancy between the conditional means of Yon {X1,X2,,Xm}and the mean of Y. Second, as mentioned earlier, the I-score does not monotonically increase with the addition of any and all variables as would the sample analog form of θc. Rather, given a variable set of size mwith m1truly influential variables, the I-score is typically higher under the influential m1variables than under all mvariables. If m1variables are influential in the sense that any smaller subset of variables is less influential, then removal of a variable to size m2will decrease the I-score in expectation. This natural tendency of the I-score to “peak” at variable set(s) that lead to high predictivity in the face of noisy variables under the current sample size is crucial.
Most important to note, we showed that the I-score can help find variables with high θcby identifying variables that have high values of θI(recall θI=jΠ𝐗[P(j|d)P(j|u)]2), which is related to the lower bound of θc. An important step to finding these highly predictive variable sets and discarding noisy ones through finding high I-scores is using the backward dropping algorithm (BDA) developed in ref. 4. The algorithm requires drawing many starting sets of variables and recursively dropping random variables and calculating I-scores. For more information, see ref. 4 or BDA.

Generalization to Arbitrary Priors

A problem that emerges when dealing with case-control data such as GWAS is that prior information on observing the next person as a disease case is unknown and not easily estimated from empirical data. Priors are defined by circumstances and contexts within which the case-control data are sampled—each dataset requires its own unique and unknown prior at that point in time.
Corollary 3. Under the assumptions of an arbitrary prior π(d)and ndnλas n, the correct prediction rate can be easily seen as
θc[pXd,pXu]=12+12jΠX|P(j|d)π(d)P(j|u)π(u)|.
Let the modified score IΠnbe defined as
nsn2IΠn=14jΠXnj2[y¯j(π(d)λ)(1y¯j)(π(u)1λ)]2.
Then we have
limnsn2IΠnn=𝒫14jΠX[P(j|d)π(d)P(j|u)π(u)]2.
[S5]
Similar lower bounds to Corollary 2 can then be derived as
θc[pXd,pXu]=12+12jΠX|P(j|d)π(d)P(j|u)π(u)|12+12limnλ(1λ)IΠn2na2
[S6]
where a=jΠ𝐗(P(j|d)π(d)P(j|u)π(u))=π(d)π(u).
Similar to Corollary 1, Eq. S5 is a direct consequence of Eq. S6 and Lemma 1 (but with zjreplaced by |P(j|d)π(d)P(j|u)π(u)|).

Generalization to Different Loss and Cost Functions

Thus far we have used a 0–1 loss on the binary classification problem. The 0–1 loss treats false negatives and false positives equally. In real applications, the scientist may wish to weigh the costs of different incorrect predictions differently. For instance, failing to detect a cancer patient may be deemed a more costly mistake to make than that of misclassifying a healthy patient because ameliorating the former mistake later on can be more difficult. The different cost amounts in making a loan decision is another example. The cost of lending to a defaulter may be seen as greater than that of the loss-of-business cost of declining a loan to a nondefaulter due to some positive level of risk aversion. Let loss function Lbe defined as
L(d,u)=ld,L(u,d)=lu
[S7]
and
L(d,d)=L(u,u)=0
[S8]
where ldand luare the prices paid (or losses incurred) for misclassifying a diseased individual to the healthy class or a healthy person to a diseased class, respectively. We can derive the optimum Bayes’ solution by minimizing the expected predicted loss, that is, to assign future observations to the class with less loss, given its jvalue. We simply assign a test sample with partition (predictor) jto dif
P(j|d)π(d)L(d,u)<P(j|u)π(u)L(u,d)
otherwise, assign to u. Equivalently, choose dif
P(j|d)π(d)ld<P(j|u)π(u)lu
otherwise u.In this way, the expected loss of adopting this rule is thus:
el=12jΠ𝐗min{aj,bj},
where aj=P(j|d)π(d)ldand bj=P(j|u)π(u)lu. The random rule of classifying an individual to the healthy class or disease class has an expected loss of
γ=12(aj+bj)=12(π(d)ld+π(u)lu),
a constant independent of the partition Πx. The “gain” in θcl(interpreted as less the expected loss of Bayes’ rule) can be defined as
θcl=12jΠ𝐗max{aj,bj}=12jΠ𝐗(aj+bj)el=γel.
Because γis independent of Xand ΠX, it is desirable to search for Xwith larger θclto achieve better “gains.” Again we have
θcl=γ2+θclel2=γ2+14jΠ𝐗|ajbj|
After standardizing by γ, we obtain the improved prediction rate as
θc=θclγ=12+14γjΠ𝐗|ajbj|
Collecting the above discussion together, let the cost-based I-score IΠ𝐗cbe defined as
nsn2IΠ𝐗c=14γjΠ𝐗nj2[y¯j(π(d)λ)ld(1y¯j)(π(u)1λ)lu]2n24γjΠ𝐗[P(j|d)π(d)ldP(j|u)π(u)lu]2.
[S9]
We present the following lower bound in Corollary 4. Let
jΠ𝐗(P(j|d)π(d)l2P(j|u)π(u)l1)=π(d)l2π(u)l1=a.
Corollary 4. Under the assumptions of Corollary 2 and using the loss function Ldescribed in Eqs. S7 and S8, then
limnsn2IΠ𝐗cn=𝒫14γjΠ𝐗[P(j|d)π(d)ldP(j|u)π(u)lu]2.
[S10]
Furthermore, one can derive a similar lower bound for the correct prediction rate θcas
θc=12+14γjΠ𝐗|ajbj|𝒫limn(12+14γλ(1λ)IΠ𝐗cna2)=12+14γlimnλ(1λ)IΠ𝐗cna2
[S11]
The proofs for Eqs. S10 and S11 are quite similar to that for Corollary 3 given above; we shall omit them.

Technical Notes

Technical Note 1: Alternative Formulation of the Theoretical Prediction Rate.

Recall that the expected error of adopting the above Bayes’ decision rule (under a 0/1 loss) is
θe[p𝐗d,p𝐗u]=12𝐱Π𝐱min{p𝐗d(𝐱),p𝐗u(𝐱)}.
The correct prediction rate θcon 𝐗is defined as
θc(𝐗)=θc[p𝐗d,p𝐗u]=1θe[p𝐗d,p𝐗u]=12𝐱Π𝐱max{p𝐗d(𝐱),p𝐗u(𝐱)}
where θeis the error rate. For simplicity of presentation, we can represent the above as
θc=12jΠ𝐱max{P(j|d),P(j|u)}
where jis short for 𝐱j, a cell in the partition Π𝐗formed by the variables 𝐗.
It is easy to show that
12{θc[p𝐗d,p𝐗u]θe[p𝐗d,p𝐗u]}=θc[p𝐗d,p𝐗u]12=14jΠ𝐗|P(j|d)P(j|u)|.
Therefore,
θc[p𝐗d,p𝐗u]=12+14jΠ𝐗|P(j|d)P(j|u)|.

Technical Note 2: Issue with Sample Analog of θc.

Suppose 𝐗m={X1,,Xm}and 𝐗m+1={X1,,Xm,Xm+1}. The partition formed by 𝐗mis
Π𝐗m={A1,,Am1},
whereas the partition formed by 𝐗m+1is
Π𝐗m+1={A1B,,Am1B,A1Bc,,Am1Bc}={Π𝐗mB,Π𝐗mBc}
where B={𝐗m+1=1}.Let
Π𝐗m1=Π𝐗m{Xm+1=1}andΠ𝐗m0=Π𝐗m{𝐗m+1=0},
where Π𝐗m1and Π𝐗m0form two subpartitions of Π𝐗m+1, i.e., Π𝐗m+1=Π𝐗m0Π𝐗m1. Then
|p^Π𝐗m(d)p^Π𝐗m(u)||p^Π𝐗m0(d)p^Π𝐗m0(u)|+|p^Π𝐗m1(d)p^Π𝐗m1(u)|,
where p^()is the sample estimator. We see that the sample analog inherently favors an increase in number of partition cells (i.e., adding more variables).

Technical Note 3: Proof of Lemma 1.

It is obvious that |a|b. Let S1be the sum of the positive values of zjand S2the sum of the negative values. Let T1be the sum of the squares of the positive values and T2the sum of the squares of the negative values. It follows that S1+S2=aand S1S2=band thus S1=(a+b)/2and S2=(ab)/2. Then clearly T1S12and T2S22. Consequently,
j=1Kzj2=T1+T2S12+S22=a2+b22
[S1]
which is equivalent to the inequality in Eq. 4 and equality is attained when there are at most one positive and one negative component if |a|<b.

Technical Note 4: Technical Details on I-Score.

The influential score (I-score) is a statistic derived from the PR method. Several forms and variations were associated with the PR method before it was finally coined with this name in 2009 (4). We introduce the PR method and the I-score briefly here.
Consider a set of nobservations of a disease phenotype Y(dichotomous or continuous) and a large number Sof SNPs, X1,X2,,XS.Randomly select a small group, m, of the SNPs. Following the same notation as in previous sections, we call this small group 𝐗={Xk,k=1,,m}. Recall that Xktakes values 0,1,and 2(corresponding to three genotypes for a SNP locus: AA, A/B, and B/B). There are then m1=3mpossible values for 𝐗’s. The nobservations are partitioned into m1cells according to the values of the mSNPs (Xk’s in 𝐗), with njobservations in the jth cell. We refer to this partition as Π𝐗. The proposed I-score (denoted by IΠ𝐗) is designed to place greater weight on cells that hold more observations:
IΠ𝐗=j=1m1njn(Y¯jY¯)2sn2/nj=j=1m1nj2(Y¯jY¯)2i=1n(YiY¯)2
[S2]
where sn2=1ni=1n(YiY¯)2. We note that the I-score is designed to capture the discrepancy between the conditional means of Yon {X1,X2,,Xm}and the mean of Y.
In this paper, we consider the special problem of a case-control experiment where there are ndcases and nucontrols and the variable Yis 1 for a case and 0 for a control. Then sn2=(ndnu)/n2where n=nd+nu.

Technical Note 5: Proof of Theorem 1.

We prove that the I-score approaches a constant multiple of θIasymptotically.
Under the null hypothesis of no association between 𝐗={Xk,k=1,,m}and Y,IΠ𝐗can be asymptotically expressed as j=1m1λjχj2(a weighted average), where λjis between 0 and 1 and j=1m1λjis equal to 1j=1m1pj2, where pjis the cell j’s probability. {χj2}are m1chi-squares, each with degree of freedom, df=1(see ref. 4).
Furthermore, the above formulation and properties of IΠ𝐗apply to the specified Ymodel with case-control study (where Y=1designates case and Y=0designates control) as demonstrated in ref. 4. More specifically, in a case-control study with ndcases and nucontrols (letting n=nd+nu), nsn2IΠ𝐗can be expressed as the following:
nsn2IΠ𝐗=jΠ𝐗nj2(Y¯jY¯)2=jΠ𝐗(nd,jm+nu,jm)2(nd,jmnd,jm+nu,jmndnd+nu)2=(ndnund+nu)2jΠ𝐗(nd,jmndnu,jmnu)2
where nd,jmand nu,jmdenote the numbers of cases and controls falling in jth cell, and Π𝐗stands for the partition formed by mvariables in 𝐗. Since the PR method seeks the partition that yields larger I-scores, one can decompose the following:
nsn2IΠ𝐗=jΠ𝐗nj2(Y¯jY¯)2=An+Bn+Cn
where An=jΠ𝐗nj2(Y¯jμj)2, Bn=jΠ𝐗nj2(Y¯μj)2, and Cn=jΠ𝐗2nj2(Y¯jμj)(Y¯μj). Here, μjand μare the local and grand means of Y, that is, E(Y¯j)=μj;Y¯=μ=ndnd+nufor fixed n. It is easy to see that both terms Anand Cn, when divided by n2converge to 0in probability as n. We turn to the final term, Bn. Note that
limnBnn2=𝒫limnjΠ𝐗(nj2n2)(μjμ)2
In a case-control study, we have
μj=ndP(j|d)ndP(j|d)+nuP(j|u)
and
μ=ndnd+nu
Because for every j, njnconverges (in probability) to pj=λP(j|d)+(1λ)P(j|u)as n, if limnndn=λ, a fixed constant between 0and 1, it follows that
Bnn2=jΠ𝐗(nj2n2)(μjμ)2𝒫jΠ𝐗pj2(λP(j|d)λP(j|d)+(1λ)P(j|u)λ)2asn=jΠ𝐗{λP(j|d)λ[λP(j|d)+(1λ)P(j|u)]}2=jΠ𝐗{λ(1λ)P(j|d)[λ(1λ)P(j|u)]}2=λ2(1λ)2jΠ𝐗[P(j|d)P(j|u)]2
Thus, ignoring the constant term in the above equation, the I-score can guide a search for Xpartitions, which will lead to finding larger values of the summation term jΠ𝐗[P(j|d)P(j|u)]2. We have proven Theorem 1.

Technical Note 6: Proof of Corollary 2.

Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct predictive rate:
θc(𝐗)𝒫12+142limnIΠ𝐗nλ(1λ).
[S3]
Proof: From Eq. 2,
θc(𝐗)=12+14jΠ𝐗|P(j|d)P(j|u)|(Lemma 1)12+142jΠ𝐗(P(j|d)P(j|u))2=12+142θI(Π𝐗)(Theorem 1)=𝒫12+142limnsn2IΠ𝐗nλ2(1λ)2=𝒫12+142limnIΠ𝐗nλ(1λ).
[S4]
The asymptotic lower bound of Eq. 2 is a simple consequence of Lemma 1 and Theorem 1. In theory, the above corollary allows us to apply a useful lower bound for identifying good variable sets with large I-scores. In practice, however, once the variable sets are found (through their large I-scores), the true prediction rates can be greater than the identified lower bounds. Theorem 1 provides a simple asymptotic behavior of the I-score under some strict assumptions. We offer similar derivations below following two levels of relaxations of the constraints.
We remark that with additional work one can show that the convergence given in Eq. 2 can be extended to be uniformly over all partitions {Π}with bounded number of cells and for all λthat stay away from 0 to 1.

BDA

The BDA§ is a greedy algorithm to search for the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset sampled in some way from the variable space. The details are as follows.
i)
Training set: Consider a training set {(y1,x1),,(yn,xn)}of nobservations, where xi=(x1i,,xpi)is a p-dimensional vector of explanatory variables. Typically pis very large. All explanatory variables are discrete.
ii)
Sampling from variable space: Select an initial subset of kexplanatory variables 𝐗b={Xb1,,Xbk},b=1,,B.
iii)
Compute I-score: I(𝐗b)=jΠ𝐗bnj2(Y¯jY¯)2.
iv)
Drop variables: Tentatively drop each variable in 𝐗band recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset 𝐗b, which has one variable less than 𝐗b.
v)
Return set: Continue the next round of dropping on 𝐗buntil only one variable is left. Keep the subset that yields the highest I-score in the whole dropping process. Refer to this subset as the return set 𝐑b. Keep it for future use.
If no variable in the initial subset has influence on Y, then the values of Iwill not change much in the dropping process. However, when influential variables are included in the subset then the I-score will increase (decrease) rapidly before (after) reaching the maximum.

Using the I-Score in Sample-Constrained Settings

We have shown that I/nasymptotically approaches a constant multiple of θI(which is related to a lower bound of θc) and has several desirable properties. We take this opportunity to explore and illustrate an application of the I-score measuring predictivity with sample data. To provide additional evidence of the I-score’s ability to measure true predictivity, we consider a set of simulations for which we know the “true” levels of predictivity for all variable sets. We also provide a real data application on breast cancer for which the I-score approach has done very well in predicting.
We take a moment to comment that evaluating a variable set for predictivity, what we have called here VSA, is different from evaluating a given classifier, which is the prediction stage, usually following or in conjunction with VS. The latter considers evaluating f(𝐱), a special function f()applied to a particular set of explanatory variables 𝐱, for a given outcome variable y, whereas the former considers the potential predictivity of the set of explanatory variables 𝐱for that outcome yfor all possible f(). Our work here focuses simply on VSA. Variable sets assessed as highly predictive in our framework can then be flexibly used in various models for prediction purposes as pleases the researcher.
We are now in an odd situation where we have identified variable sets that could not have been found using conventional approaches and yet we wish to evaluate the predictivity of our identified variable sets against these conventional approaches. Nevertheless, we endeavor to do so. A couple options arise for approaches to compare against: training prediction rate and out-of-sample testing prediction rate. We will show that the I-score-based measure provides a useful and meaningful estimated lower bound to the correct prediction rate and correlates well with the out-of-sample test rate, whereas the training rate statistic, the sample analog of θc, does not. As such, our approach has an important benefit to prediction research: Compared with methods such as cross-validation of error rates, the I-score is efficient in the use of sample data, in the sense that it uses all observations instead of separating data into testing and training.

Simulations.

We offer simulations to illustrate how (i) the I-score can serve as a lower bound to the true predictivity of a given variable set even as noisy variables are adjoined, (ii) thereby serving as a screening mechanism, and (iii) finding the maximum I-score when conducting a BDA leads to finding the variable set with the highest corresponding level of predictivity. BDA reduces a variable set one variable at a time, by eliminating the weakest element until Ireaches a peak.
We consider a module of three important variables {X1,X2,X3}(see Fig. 2 for the disease model used) among six unimportant variables {X7,,X12}using sample sizes of 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. (See Simulation Details for more detailed model setting and simulation details.) We demonstrated that the Inλ(1λ)estimates* θI, which is related to an asymptotic lower bound (Eq. 6) for θc, as n. It would be helpful to see how Iperforms at fixed, reasonable sample sizes. We compare the I-score derived predictivity lower bound against the Bayes’ theoretical prediction rate in our simulations to illustrate this. The out-of-sample correct prediction rate is presented in the simulations here as a further benchmark against which the I-score can be compared when data are limited, as is the case in real-world applications. The out-of-sample correct prediction rate is derived from the most optimistic context achievable in the real world, whereby future testing data are infinite. In all of the simulations, the I-score of a set of influential variables drops when a noisy variable is added. This drop is subsequently seen in the I-score derived bound for the correct prediction rate. The I-score can screen out noisy variables, which makes it useful in practical data applications.
Fig. 2.
A three-SNP disease model.
Open in viewer
To illustrate how these statistics fare in accurately capturing the level of predictivity of each variable set under consideration, we consider their performance given already having found X2and X3as important. We then add X1, which should ideally correspond with an increase in the statistic. We continue adding the remaining noisy variables one at a time to this “good” set of variables and observe how the statistics evaluate the new, larger set of variables for predictivity. In Fig. 3, violin plots show distributions of training rate, the I-score lower bound, and the ideal out-of-sample prediction rate under each setting across the simulations. Theoretical Bayes’ rate is also plotted as a reference, which remains flat when noisy variables are added. This is because the Bayes’ rate is defined purely by the partition formed from the informative variables and does not change when adjoining noisy variables (X7,,X12) and creating finer partitions.
Fig. 3.
Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example, X1, X2, and X3. The top row of graphs compares the (red) I-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.
Open in viewer
Several patterns emerge in these simulations. First, and most importantly, the I-score-derived prediction rate seems to be a reasonable lower bound to the Bayes’ rate. This holds even in moderate sample sizes.
The second pattern is that the estimated I-score lower bound peaks at the variable set that is inclusive of all influential variables (X1, X2, and X3) and no additional noisy variables. This is a characteristic of the out-of-sample correct prediction rate as well. For instance, if we consider the top row of Fig. 3 and start from the right of the x axes in each of the three plots with the largest set of variables inclusive of both influential and noisy variables (X1,X2,X3,X7,,X12), continual removal of the noisy variables (sliding to the left of the x axis) until we reach the variable set (X1,X2, X3) results in higher predictivity as measured by the I-score lower bound. We can note that the I-score lower bounds drop upon further removing the influential X1variable from the set (X1,X2,X3). Thus, the variable set that appears with the maximum I-score derived lower bound here both identifies the largest possible variable set of influential variables with no noisy variables and is also reflective of a conservative lower bound of the correct prediction rate for that variable set. We note that once we have found the variable sets with the highest I-scores and calculated the corresponding lower bound of the correct prediction rate, we can adjust this lower bound rate for its bias to derive an improved estimate of the correct prediction rate.
A third pattern is that the training rate suffers from overfitting when adjoining noisy variables even when the variable set includes a true influential subset of variables. If the variable set is irreducible, however, the training rate estimator reflects the Bayes’ correct prediction rate well; thus, the training rate estimator can perform reasonably well conditional on already identifying (X1,X2,X3). The training rate estimator cannot be used to screen to that variable set first, however.
Finally, and as we might expect, the training set rate explodes due to overfitting in high dimensions as noisy variables are adjoined to the partition formed by the informative variables (X1, X2, X3). Although the training set prediction rate seems to improve as the sample size increases, it cannot be used to screen out noisy variables, and is therefore difficult to use as a statistic to select highly predictive variable sets. The predictivity rates found through this statistic also dramatically depart from the out-of-sample testing rate. It tends to ever-optimistically evaluate variable sets for their future predictions even when noisy variables are added. This stands in stark contrast to the out-of-sample prediction rate because it lowers in prediction rate with the addition of useless variables. We notice that there is a trend that the I-score prediction rate does not remain flat. The score increases when removing a noisy variable and reducing to a variable set of only influential variables, indicating the additional advantage of the I-score as a lower bound; the I-score prefers a simpler model even when the Bayes’ rate remains the same, selecting for more parsimonious partitions that attain the Bayes’ rate, simultaneously a closer reflection of the out-of-sample prediction rate.
Recall the correct prediction rate is based on an absolute difference of probabilities summed over all Xs. Suppose we start with influential variables only, with θccorrect prediction rate, the highest we can attain out of all possible variable sets. Adding noisy variables to this set, variables that add no signal but simply create a finer partition, still returns θc. When estimating the correct prediction rate using sample data, though, the training estimate of θcvalue generally keeps increasing if noisy variables are added; the researcher does not know when to stop the search for influential variables, making selecting for highly predictive variables difficult. Ideally, we would like to “punish” adding such noisy variables to our variable set, so having a measure that balances between favoring coarser partitions but still recognizing actual new variables with strong enough signals (non noisy variables) is important. The I-score seems to support such an effect—preferring coarser partitions unless an additional variable (and therefore finer partition) provides enough signal in the data to justify keeping it.
Noisy variables in sample data may be indicative of actually noisy variables or influential variables with weak signals due to the sample size. Thus, we note there are cases where the I-score might not recognize these variables when their signals would require unrealistic sample sizes to be found through the measure. An example of this would be if a good predictor is highly complex (perhaps a combination of very many variables) and the observations are sparse in the partition. Because the I-score places greater weight on where the data tend to appear (note the ni2term in the score), when most of the partition cells contain no observations or at most one observation, this can often look like noise.
The main draw of the I-score is its ability to screen for influential variable sets. The variable sets inclusive of the three influential variables (X1, X2, and X3) alone display the highest I-scores. Searching for variable sets with the highest I-scores thus tends to return highly influential variables only. Using the training prediction rate as a guiding measure for screening, however, would continually seek for ever-larger variable sets, regardless of whether they include noisy variables or not.

Real Data Application: van’t Veer Breast Cancer Data.

To reinforce the previous sections, we briefly analyze real disease data. As noted before, part of this research team has discovered that applying the PR approach to real disease data has not only been quite successful in finding variable sets (thus encompassing higher-order interactions, traditionally rather tricky in big data), but has also resulted in finding variable sets that are very predictive that do not necessarily show up as significant through traditional significance testing. We present one discovered variable set (a total of 18 variable sets were found in ref. 5) found to be highly predictive for a breast cancer dataset that is not highly significant using a chi-square test. In Table 1 we investigate the top, five-variable set (in this case five genes) found to be predictive through both top I-score and performance in prediction in cross-validation and an independent testing set in ref. 5. To find how significant these variables are, we calculate the individual, marginal association of each variable in the marginal P value. Given the familywise P value threshold of 6.98× 105, none of these variables seems statistically significant. Measuring the joint influence of all five variables is not significant either. Using the variable sets (all 18 in ref. 5) that seemed to have the highest I-scores to predict on this dataset resulted in an out-of-sample testing error rate of 8%, in direct comparison with the literature’s best error rates of 30%. Using only the variable set displayed in Table 1 and the lower bound in Eq. 6 we can calculate the asymptotic lower bound of the correct prediction rate for this variable set as 59%. Thus, using only this variable set alone, we can achieve at least a 59% correct classification rate at minimum. For details on the final predictors, see ref. 5.
Table 1.
Real data example: van’t Veer breast cancer data (6)

Table reused from ref. 3 [Lo et al.]. The data can be downloaded from ccb.nki.nl/data/.

Open in viewer

Simulation Details

The simulation is based on a six-SNP disease model. The six SNPs are organized into two three-SNP modules (X1, X2, X3), and (X4, X5, X6). Six additional variables (X7, …, X12) are simulated to be noisy and unrelated to the disease. The frequencies for the minor allele of each SNP are all 0.5. The risk of the disease for an individual depends on the two three-SNP genotypes of these two modules. Each module defines two sets of genotypes, high risk genotypes and low risk genotypes, identically depicted in Fig. 2. If an individual has two low risk genotypes, he has odds of 1/60for having the disease. Here, odds is the ratio of the probability of an event occurring (disease) over the probability of the event not occurring (no disease). For an individual with one of the low-risk genotypes and one of the high-risk genotypes, the odds are increased to 1/10. If an individual has high-risk genotypes for both modules, the odds become 1. In this section, we present results for the first module (X1, X2and X3). In Fig. S1 we present results for both modules, or all six SNPs, together.
Fig. S1.
Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example, X1, X2, X3, X4, X5, and X6. The top row of graphs compares the (red) Iscore statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.
Open in viewer
The data can take on three sample size levels: 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. For each possible variable set we create a partition Πand calculate the p^idand p^iu(the estimated probability that an individual in cell jis a case or a control), respectively: nidndand niunuwhere i=1...mand m=|Π|where |Π|is the size of the partition Π. We conducted 300 simulations and evaluated a set of statistics on each of the variable sets for each simulation: the training prediction rate, Bayes’ prediction rate, out-of-sample prediction rate, and the I-score-derived lower bound estimate of the predictivity rate; see Fig. 3. Throughout, we assume prior probability of (0.5, 0.5) for case and control. The statistics are detailed below:
i)
Training prediction rate is defined as the following:
12j=1m1max(p^jd,p^ju)
ii)
Bayes’ rate: Recall this rate is constant across all variable sets that are inclusive of the truly influential variables, regardless of how many noisy variables are also included. This is the best predictivity one can achieve if knowledge of the influential variables is available. It is defined as
12j=1m1max(pjd,pju)
iii)
Out-of-sample prediction rate: This is conducted on the “infinite” future data to find pjdand pjufor the rate. The “infinite” future data are often unrealistic with real data but we present it for the purposes of this simulation and to clearly provide a gold standard against which to compare. It is defined as
j=1mpjdY^j+pju(1Y^j)
iv)
I-score lower bound predictivity rate as defined from Eq. 7.

Concluding Remarks

Prediction has become more important in recent decades and, with it, the need for tools appropriate for good prediction. A first step can be to assess variable sets for predictivity, which we call VSA. We show in other work that assessment of variables from a statistical significance criterion to predict is not ideal (3). A currently popular alternative solution is to select variables via sample-based, out-of-sample testing error rates. This approach is ad hoc in nature, sample-based, and is not measuring some theoretical underlying level of predictivity for a given variable set. Often validation of selected candidate variable sets requiressetting aside valuable sample data in out-of-sample testing or cross-validation. Sometimes the sample size may not suffice for validating variable set sizes larger than one or two variables, as is often the case in big data like GWAS. Cross-validation avoids setting aside sample data as independent test sets but is computationally difficult in big data without using independent samples. As such, prediction research would benefit from a theoretical framework that directly defines a variable set’s predictivity as a parameter of interest to estimate. We believe our work here is a preliminary and important effort in that direction, by considering what theoretically highly predictive variable sets are, and how we might try to find them. In fact, using measures such as the I-score could be an important new direction in the prediction literature because it neither uses the training sample prediction rate nor does it require an artificial or ad hoc regularization choice.
We identify the equation for the theoretical correct predictivity of variable sets (θc) in Eq. 2 and then demonstrate that, unfortunately, the training estimate for it is quite useless. As such, we offer an alternative measure. We show that the Inasymptotically approaches a lower bound to θIof Eq. 2 and is thus correlated with the correct predictivity rate of a given variable set. Importantly, we show that the I-score has a natural tendency to discard noisy variables, keep influential ones, and asymptotically approach this lower bound to θc. The I-score does well in identifying predictive variable sets in both our complex simulations as well as real data application.
We note that other measures with such desirable properties may also exist, and we encourage rigorous research in this direction. As a new field of inquiry, the search for measures that maximize predictivity may do much in the way of living up to the hopes of advancing predicting outcomes of interest, such as disease status. In some ways, this work is motivated by a practical consideration of finite samples. As noted in the setup of our framework, in a theoretical world of limitless data we can in fact find the variable sets with highest values of θc. However, our real world of finite sample sizes requires other sample-appropriate measures that may approximate but not achieve the θc. In other words, based on available sample size, the I-score, and any other such measure, detects not necessarily the maximum θcbut some θc,nH, the largest θccorrect prediction rate for which the corresponding Xvariables can be selected given n. Consider a situation where the true set of variables 𝐗that provide the theoretical maximum θcis very large. Suppose we have a sample of data that is quite modest. Selecting all variables 𝐗is not possible given the sample size n(too many of the cell frequencies are small or zero) and so a measure such as the I-score retrieves a set 𝐗that provides potentially the largest θcachievable given the sample constraint. This in some ways mirrors the common issue of not detecting true effects when the sample size is too small in statistical significance testing.
We leave the important discussion of how to combine identified predictive variable sets in different final prediction models outside the scope of this paper.

Simulation Results for Important Variable Set of Size 6

Here we present simulation results in Fig. S1 for the six SNPs according to the model described in the main text. All other simulation parameters were the same as the three-SNP example.

Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

Supporting Information

Supporting Information (PDF)
Supporting Information

References

1
K Gransbo, et al., Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J Intern Med 274, 233–240 (2013).
2
SL Zheng, et al., Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358, 910–919 (2008).
6
LJ van’t Veer, et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
7
Y Saeys, I Inza, P Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).
8
T Hastie, R Tibshirani, J Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2nd Ed, New York, 2009).
9
I Guyon, A Elisseeff, An introduction to variable and feature selection. J Mach Learn Res 3, 1157–1182 (2003).
10
J Hua, WD Tembe, ER Dougherty, Performance of feature-selection meth-ods in the classification of high-dimension data. Pattern Recogn 42, 409–424 (2009).
11
V Bolón-Canedo, N Sánchez-Maroño, A Alonso-Betanzos, A review of feature selection methods on synthetic data. Knowl Inform Syst 34, 483–519 (2013).
12
G James, D Witten, T Hastie, R Tibshirani An Introduction to Statistical Learning: With Applications in R (Springer, New York, 2014).

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 113 | No. 50
December 13, 2016
PubMed: 27911830

Classifications

Submission history

Published online: November 29, 2016
Published in issue: December 13, 2016

Keywords

  1. prediction
  2. variable selection
  3. high-dimensional data
  4. predictivity

Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

Notes

*This assumes that sn2λ(1λ)asn.
Here “predictive” refers to both high in I-score as well as having high correct prediction rates in k-fold cross-validation testing rates.
We note an inherent difficulty to presenting the reverse situation, that of finding the most significant variable sets in the breast cancer data and determining their predictivity rates. This is precisely because the PR approach allows for higher-order interaction searches, which is more difficult using current common approaches. Although it is possible to use common approaches to discover marginally significant variables, or possibly two-way interactions, and then determine their predictivity rates, capturing up to five-way (as shown in our presentation here using the PR approach) interactions is not yet feasible as of the date of this writing with current common approaches.
*“Unfortunately, the Cp, AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating σ^2(variance) is problematic. Similarly, problems arise in the application of the adjusted R2in the high-dimensional setting, because one can easily obtain a model with an adjusted R2value of 1” (12).
We use GWAS data to motivate our presentation of the I-score and PR method, but the approach applies to any data with discrete explanatory variables.
The PR method encompasses a BDA that is introduced in ref. 5; we directly cite and present the BDA in Supporting Information.
§
The presentation of the BDA is taken directly from section 2.2 of ref. 5. For further details, see ref. 5.

Authors

Affiliations

Notes

1
To whom correspondence may be addressed. Email: slo@stat.columbia.edu, chernoff@stat.harvard.edu, or tz33@columbia.edu.
Author contributions: S.-H.L initiated and oversaw the project; A.L., H.C., T.Z., and S.-H.L. designed research; A.L., H.C., T.Z., and S.-H.L. performed research; A.L., T.Z., and S.-H.L. analyzed data; and A.L., H.C., T.Z., and S.-H.L. wrote the paper.
Reviewers: D.L.B., Duke University; and M.Y., University of Wisconsin–Madison.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements

22
0
37
0
Smart Citations
22
0
37
0
Citing PublicationsSupportingMentioningContrasting
View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.




Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Media

    Figures

    Fig. 1.
    Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significance-tests.
    Fig. 2.
    A three-SNP disease model.
    Fig. 3.
    Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example, X1, X2, and X3. The top row of graphs compares the (red) I-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.
    Fig. S1.
    Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example, X1, X2, X3, X4, X5, and X6. The top row of graphs compares the (red) Iscore statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

    Tables

    Table 1.
    Real data example: van’t Veer breast cancer data (6)

    Other

    Share

    Share

    Share article link

    Share on social media

    References

    References

    1
    K Gransbo, et al., Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J Intern Med 274, 233–240 (2013).
    2
    SL Zheng, et al., Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358, 910–919 (2008).
    6
    LJ van’t Veer, et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
    7
    Y Saeys, I Inza, P Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).
    8
    T Hastie, R Tibshirani, J Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2nd Ed, New York, 2009).
    9
    I Guyon, A Elisseeff, An introduction to variable and feature selection. J Mach Learn Res 3, 1157–1182 (2003).
    10
    J Hua, WD Tembe, ER Dougherty, Performance of feature-selection meth-ods in the classification of high-dimension data. Pattern Recogn 42, 409–424 (2009).
    11
    V Bolón-Canedo, N Sánchez-Maroño, A Alonso-Betanzos, A review of feature selection methods on synthetic data. Knowl Inform Syst 34, 483–519 (2013).
    12
    G James, D Witten, T Hastie, R Tibshirani An Introduction to Statistical Learning: With Applications in R (Springer, New York, 2014).
    View figure
    Fig. 1.
    Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significance-tests.
    View figure
    Fig. 2.
    A three-SNP disease model.
    View figure
    Fig. 3.
    Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example, X1, X2, and X3. The top row of graphs compares the (red) I-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.
    View figure
    Fig. S1.
    Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example, X1, X2, X3, X4, X5, and X6. The top row of graphs compares the (red) Iscore statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.
    Table 1.
    Real data example: van’t Veer breast cancer data (6)
    1800
    1801
    1802
    1803
    1804