Framework for making better predictions by directly estimating variables’ predictivity

Lo, Adeline; Chernoff, Herman; Zheng, Tian; Lo, Shaw-Hwa

doi:10.1073/pnas.1616647113

Framework for making better predictions by directly estimating variables’ predictivity

Adeline Lo https://orcid.org/0000-0001-5791-5541, Herman Chernoff slo@stat.columbia.edu, Tian Zheng https://orcid.org/0000-0003-4889-0391 slo@stat.columbia.edu, and Shaw-Hwa Lo slo@stat.columbia.eduAuthors Info & Affiliations

Contributed by Herman Chernoff, October 13, 2016 (sent for review June 4, 2016; reviewed by David L. Banks and Ming Yuan)

November 29, 2016

113 (50) 14277-14282

https://doi.org/10.1073/pnas.1616647113

PDF/EPUB

Significance

Good prediction, especially in the context of big data, is important. Common approaches to prediction include using a significance-based criterion for evaluating variables to use in models and evaluating variables and models simultaneously for prediction using cross-validation or independent test data. The first approach can lead to choosing less-predictive variables, because significance does not imply predictivity. The second approach can be improved through considering a variable’s predictivity as a parameter to be estimated. The literature currently lacks measures that do this. We suggest a measure that evaluates variables’ abilities to predict, the

I

-score. The

I

-score is effective in differentiating between noisy and predictive variables in big data and can be related to a lower bound for the correct prediction rate.

Abstract

We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the

I

-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the

I

-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the

I

-score on real data to demonstrate the statistic’s predictive performance on sample data. We conjecture that using the partition retention and

I

-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.

Prediction is a highly important goal for many scientists and has become increasingly difficult as the quantity and complexity of available data have grown. Complex and high-dimensional data particularly demand attention. However, the literature on prediction does not yet have a clear theoretical framework that allows for characterizing a variable’s predictivity directly [see A Brief Literature Review on VS for a brief review on the literature of variable selection (VS)]. Rather, VS for variable sets in the context of prediction is currently conducted in two common ways. The first is VS through identification of variables correlated with the outcome, measured through tests of statistical significance—such as the chi-square test. The second is through VS of variables that seem to do well in an independent set of test data, as measured through testing sample error rates. The first approach is still very much in use for predicting health outcomes (see ref. 1, among others) but its prediction performance has been disappointing (e.g., refs. 1 and 2). We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under a significance-test-based search setting, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) is not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.

Fig. 1.

Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significance-tests.

Open in viewer

We show in ref. 3 that the first approach suffers from the problem that significant variables are not necessarily predictive, and vice versa, so targeting significant variables might miss the goal of VS for higher predictivity. This problem is prevalent in simple as well as complex data. The second way for VS sets aside testing (or validation) data to see how well selected predictors might do on “new data.” However, as is in the case of genome-wide association study (GWAS) data, researchers frequently lack large enough sample sizes for this approach to be efficient. Reuse of training data in the form of cross-validation is often adopted in practice.

We suggest that an alternative, and perhaps logical, approach to prediction should start with defining the theoretical prediction rates of a set of variables as a parameter of interest. It would be productive then to create measures designed to directly measure such a parameter, rather than relying on the estimated prediction rate by cross-validation. We call such an approach “variable set assessment,” or VSA. We hope that designing measures that directly estimate a variable set’s true ability to predict may prove to be both fruitful and efficient in the use of sample data for good prediction. Here, we propose such a prediction-based framework. Grounded in statistical theory, we highlight an avenue of research toward creating sensible measures that target highly predictive variable sets through assessing their predictivity directly. We emphasize genetic data, although we will show that the methods proposed are easily tailored to other high-dimensional data in the natural and social sciences.

A Brief Literature Review on VS

A related and extremely important literature is that of VS or feature selection, which refers to the practice of selecting a subset of an original group of variables that is later used to construct a model. Often VS is used on data of large dimensionality with modest sample sizes (7). In the context of high-dimensional data, such as GWAS, this dimensionality reduction can be a crucial step. VS approaches are commonly proposed to efficiently search for the best variable sets according to a specified criterion. Most performance measures are developed to maximize the probability of selecting the truly important variables but are not direct measures of predictivity. Therefore, popular VS approaches do not return reliable assessment of the predictivity of variable sets. In contrast, we will propose considering VSA through a reliable, model-free measure used to assess the potential predictivity of a variable set. Unlike projection- or compression-based approaches (such as principal component analysis or use of information theory), VSA methods do not change the variables themselves.

The types of approaches and tools developed for feature selection are both diverse and varying in degrees of complexity. However, there is general agreement that three broad categories of feature selection methods exist: filter, wrapper, and embedded methods. Filter approaches tend to select variables through ranking them by various measures (correlation coefficients, entropy, information gains, chi-square, etc.). Wrapper methods use “black box” learning machines to ascertain the predictivity of groups of variables; because wrapper methods often involve retraining prediction models for different variable sets considered, they can be computationally intensive. Embedded techniques search for optimal sets of variables via a built-in classifier construction. A popular example of an embedded approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many to zero. Often cross-validation is used to evaluate the prediction rates.

Often, though not always, the goal of these approaches is statistical inference. When this is the case, the researcher might be interested in understanding the mechanism relating the explanatory variables with a response. Although inference is clearly important, prediction is an important objective as well. In this case, the goal of these VS approaches is in inferring the membership of variables in the “important set.” Various numerical criteria have been proposed to identify such variables [e.g., Akaike information criterion (AIC) and Bayesian information criterion (BIC), among others; see chapter 7 in ref. 8 for a review], which are associated with predictive performance under model assumptions made for the derivation of these criteria. However, these criteria were not designed to specifically correlate with predictivity. Indeed, we are unaware of a measure that directly attempts to evaluate a variable set’s theoretical level of predictivity. This paper proposes a model-free parameter for predictivity and its sample estimate. For a more comprehensive survey of the feature/VS literature see, among others, refs. 7, 9, 10, and 11.

Although a spectrum of VS approaches exists, many scientists have taken the approach of tackling prediction through the use of important and hard-to-discover influential variables found to be statistically significant in previous studies. When these efforts are in the context of high-dimensional data and alongside work investigating variables known to be influential, it might seem reasonable to hope that variables found to be significant can prove useful for predictive purposes as well. This approach is in some ways most similar to a univariate filter method, because it is independent of the classifier and has no cross-validation or prediction step for VS. We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under the context of a significance-test based search for variable sets, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) are not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.

Many methods also use out-of-sample testing error rates or cross-validation to ascertain whether prediction is done well. This approach was not designed to specifically find a theoretically correct prediction rate for a given variable set; rather, it is simply a performance evaluation of future predictions from a pattern recognition technique on selected variable sets (trained on training data). Sometimes the variable sets in the training data are selected through statistics such as the adjusted R squared, AIC, or BIC. When

p ≫ n

(or even in instances where

p > n

), a standard in big data, however, these statistics can fail to be useful.^∗ Again, these criteria were not designed to be directly correlated with a given variable set’s predicitivity. Using out-of-sampling testing and/or cross-validation techniques additionally requires either setting aside valuable sample data to make sure the variable sets selected under the training set are indeed highly predictive and not just overfitting the data or is often computationally burdensome. It becomes important then that we have a good screening mechanism when conducting VSA for removing noisy variables (and thus finding predictive ones), even with constrained amounts of sample data. We show in our simulations how poorly we can do in VSA for prediction through training set compared with out-of-sample testing prediction rates (with “infinite” future testing data—a mostly unattainable, but ideal, scenario). An ideal measure for predictivity (or a good VSA measure) reflects a variable set’s predictivity. In doing so, it would also guide VSA through screening out noisy variables and should correlate well with the out-of-sample correct prediction rate. We present a potential candidate measure, the

I

-score, for evaluating the predictivity of a given variable set in this paper.

Toy Example

To highlight some of our key issues, consider a small artificial example. Suppose an observed variable

Y

is defined as

Y = {\begin{matrix} X_{1} + X_{2} (modulo 2) & with prob. 1 / 2, \\ X_{2} + X_{3} + X_{4} (modulo 2) & with prob. 1 / 2, \end{matrix}

[1]

where

X_{1}, X_{2}, X_{3}

and

X_{4}

are 4 of 50 observed and potentially influential variables

{X_{i}; 1 \leq i \leq 50}

. Each

X_{i}

can take values 0 and 1. A collection of discrete variables

S

may be regarded as a discrete variable that takes on a finite number of values. Each value defined by

S

constitutes a cell. The collection of all cells forms a partition,

Π_{S}

, based on the discrete variables in

S

. We also assume that the

X_{i}

were selected independently to be 1 with probability 0.5, again the simplest case without affecting the general results. Clearly, none of the individual

X_{i}

has a marginal effect on

Y

.

Scenario I. A statistician knows the model and wishes to compute which variable sets are predictive of

Y

, and how predictive, when

𝐗 = (X_{1}, X_{2}, \dots, X_{50})

is given. Because

Y

depends only on the first four

X

variables, it is obvious there are two clusters of variable sets

S_{1} = {X_{1}, X_{2}}

and

S_{2} = {X_{2}, X_{3}, X_{4}}

that are potentially useful in his prediction. We treat the highest correct prediction rate possible for a given variable set as an important parameter and call this predictivity (

θ_{c}

). Using the knowledge of the model, we can compute the predictivity for

S_{1}

as

θ_{c} (S_{1}) = 0.75

. The predictivity for

S_{2}

is

θ_{c} (S_{2}) = 0.75

also. Incidentally, the predictivity of the union of

S_{1}

and

S_{2}

,

θ_{c} (S_{1} \cup S_{2})

, is also 0.75.

The statistician realizes that using variable sets

S_{1}

and

S_{2}

he can predict

Y

correctly 75% of the time. This is indeed the case because, for instance, upon observing

𝐗 = (X_{1}, \dots, X_{50})

the statistician predicts

\hat{Y} = X_{1} + X_{2} (modulo 2) .

It is easy to verify that the strategy of predicting with

S_{1}

returns a 75% prediction accuracy in expectation. This is also the highest percent accuracy

S_{1}

can theoretically achieve. We discuss this in depth shortly. This result extends to

S_{2}

as well.

Scenario II. In practice, the statistician rarely has knowledge of the model and instead observes only the data. We suggest that the statistician use the partition retention (PR) approach and its corresponding

I

-score (which we present formally in Alternative Measure: I-Score; see ref. 4 for the original presentation of the approach or see Eq. S2 for

I_{Π_{𝐗}}

) to identify the influential variable sets. Suppose with 400 observations the researcher wishes to identify variable sets with high predictivity and to infer their abilities to predict. Using the PR approach he can use the

I

-score to screen for variable sets with high potential predictivity. In this example,

S_{1}

and

S_{2}

are consistently returned with the highest

I

-scores (23.71 and 12.79) in simulations. Using the inequality in Eq. 7, which we derive in the following section, the lower bounds for the predictivity of

θ_{c} (S_{1})

and

θ_{c} (S_{2})

are calculated to be 67 and 62%, respectively. Eq. 7 does not require knowledge of the true model as defined in Eq. 1.

Theoretical Prediction Rates

We contribute to the prediction literature by introducing the prediction rate as a parameter to be directly estimated. We show that the PR method’s

I

-score, a sample-based statistic, can be used to construct an asymptotically consistent lower bound for the prediction rate.

We deal here with the special case of case control studies where the explanatory variables are discrete, and the outcome variable takes only two values, case or control. These results are easily generalized for classification problems, where the dependent variable can take on a finite number of possible values. Consider GWAS data of the usual type, with cases and controls. Assume that there are

n_{d}

cases and

n_{u}

controls. Using the traditional Bayesian binary classification setting, we ideally have a prior probability,

π (w = d)

, that the state of the next individual,

w

, is a disease case,

d

, and

π (w = u) = 1 - π (w = d)

that the next individual is a control,

u

. In the following we shall assume that both

d

and

u

are equally likely and that the cost of an incorrect classification is the same for both possibilities. We generalize to different cost functions and priors for

d

and

u

in Generalization to Arbitrary Priors and Generalization to Different Loss and Cost Functions. Let the joint distribution of the feature value

𝐗

and

w

be

P (𝐱, w)

. The joint distribution can be expressed as

P (w, 𝐱) = π (w | 𝐱) \cdot P (𝐱) = P (𝐱 | w) \cdot π (w)

, where

π (w | 𝐱)

is the posterior distribution and

π (w)

is the prior. It is easy to see that the best classification rule can be derived by Bayes’ decision rule for minimizing the posterior probability of error:

d

if

π (d | 𝐱) > π (u | 𝐱)

, otherwise

u

. Here the variable set

𝐗 = (X_{1}, X_{2}, \dots, X_{m})

, with each

X_{i}

taking one of the values in

{0,1,2}

, corresponding to the three possible genotypes for each SNP. In this way,

𝐗

forms a partition, denoted by

Π_{X}

, with

3^{m} = m_{1}

elements:

Π_{𝐗} = {𝐗 = 𝐱_{j}, j = 1, \dots, m_{1} : 𝐱_{j} = (x_{j 1}, x_{j 2}, \dots, x_{j m}), x_{j k} \in {0,1,2}, 1 \leq k \leq m}

.

Assuming equal priors, that is,

π (d) = π (u) = \frac{1}{2}

, the correct prediction rate

θ_{c}

on

𝐗

using the full Bayes’ decision rule can be calculated as

θ_{c} (𝐗) = θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} \sum_{𝐱 \in Π_{𝐱}} \max {p_{𝐗_{d}} (𝐱), p_{𝐗_{u}} (𝐱)},

where

p_{𝐗_{d}} (𝐱)

and

p_{𝐗_{u}} (𝐱)

stand for

P (𝐱 | w = d)

and

P (𝐱 | w = u)

, respectively. We can easily derive (see Technical Notes, Technical Note 1)

θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} + \frac{1}{4} \sum_{j \in Π_{𝐗}} | P (j | d) - P (j | u) | .

[2]

This suggests that we can achieve better prediction rates by choosing variable sets corresponding to the probability pairs that lead to large values of

\sum_{j \in Π_{𝐗}} | P (j | d) - P (j | u) |

. In this theoretical setting, it is easy to show that

θ_{c}

increases or stays the same when another variable is added to the current variable set. This means adding many noisy variables leads to maintaining the same

θ_{c}

. Therefore, when sample size is no constraint, we are never hurt in our search for highly predictive variables by simply adding explanatory variables to our current set. However, in the realistic world of sample size constraints, a direct search for a variable set with a larger sample estimate of

θ_{c}

will fail; we offer a heuristic explanation as to why in the following section. We refer to this direct search of

θ_{c}

with sample data as the sample analog throughout.

Problems with the Sample Analog.

The value of

θ_{c}

is unknown and must be estimated. We may naturally turn to the naive sample estimate of its true theoretical values, which is sometimes referred to as the training rate. However, this estimated value of

θ_{c}

(where the cell probabilities are replaced by the observed proportions) is nondecreasing with the addition of more variables to a given variable set under evaluation. As the partition becomes increasingly finer, we reach a point where there is at maximum a single observation within each partition cell and 100% correct sample prediction rate is attained. This is true regardless of the true prediction rate. Then, the final estimated prediction rate is equivalent to 100%, rendering it useless as a method for finding predictive variable sets and screening out noisy ones. This is a direct result of a sparsity problem that does not occur in our theoretical world but certainly plagues the sample-size-constrained real world. (See Technical Notes, Technical Note 2 for a more detailed explanation.)We need instead a sample-based measure that can discern adding noisy versus influential variables and identify variable sets with large prediction rates for a given moderate sample size.

Alternative Measure: I-Score.

We consider this obstacle and suggest an alternative measure, a lower bound to

θ_{c}

, which we estimate using the

I

-score of the PR method (4) in sample data. The

I

-score converges asymptotically to a constant multiple of

θ_{I} (Π_{𝐗}) = \sum_{j \in Π_{𝐗}} {[P (j | d) - P (j | u)]}^{2} .

[3]

To relate

θ_{I}

to

θ_{c}

defined in Eq. 2, we first examine the following Lemma 1, which is derived in Technical Notes, Technical Note 3.

Lemma 1. For

K

real values

{z_{j}; 1 \leq i \leq K}

,

\sum_{j = 1}^{K} z_{j} = a

and

\sum_{j = 1}^{K} | z_{j} | = b

, we have

\sum_{j = 1}^{K} z_{j}^{2} \leq \frac{a^{2} + b^{2}}{2} .

[4]

In the case of

z_{j} = (P (j | d) - P (j | u))

for

j \in Π_{𝐗}

, we have

a = 0

. It then follows that

\sqrt{2 \sum_{i = 1}^{k} {[P (j | d) - P (j | u)]}^{2}} \leq \sum_{i = 1}^{k} | P (j | d) - P (j | u) | .

This suggests that a strategy seeking variable sets with larger values of

θ_{I}

can have the parallel effect of encouraging selection of variable sets with larger values of

θ_{c}

, yielding better predictors. In the following, we present Theorem 1 and Corollary 2 (see Technical Notes, Technical Note 5 and Technical Note 6 for proofs).

Theorem 1. Under the assumptions that

\frac{n_{d}}{n} \to λ

, a value strictly between 0 and 1, and

π (d) = π (u) = 1 / 2

, then

lim_{n \to \infty} \frac{s_{n}^{2} I_{Π_{𝐗}}}{n} \overset{𝒫}{=} λ^{2} {(1 - λ)}^{2} \sum_{j \in Π_{𝐗}} {[P (j | d) - P (j | u)]}^{2}

[5]

where

\overset{𝒫}{=}

indicates that the left-hand side converges in probability to the right-hand side and

s_{n}^{2} = n_{d} n_{u} / n^{2}

(see Technical Notes, Technical Note 5 for more detail).

We now show that

θ_{I}

defined in Eq. 3 is a parameter relevant to

θ_{c} (𝐗)

. Together with Lemma 1, we can use the

I

-score to derive a useful asymptotic lower bound to the prediction rate of a variable set

𝐗

,

θ_{c} (𝐗)

, as presented in Corollary 2.

Corollary 2. Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct prediction rate:

θ_{c} (𝐗) \overset{𝒫}{\geq} \frac{1}{2} + \frac{1}{4} \sqrt{2 lim_{n \to \infty} \frac{I_{Π_{𝐗}}}{n λ (1 - λ)}} .

[6]

Using sample data, the estimated lower bound for

θ_{c}

is then

\frac{1}{2} + \frac{1}{4} \sqrt{\frac{2 I_{Π_{𝐗}}}{n λ (1 - λ)}} .

[7]

The lower bounds presented in the toy example were obtained using the above Eq. 7.

We extend to an arbitrary prior in Corollary 3 (see Generalization to Arbitrary Priors for discussion and proof).

Corollary 3. Under the assumptions of an arbitrary prior

π (d)

and

\frac{n_{d}}{n} \to λ

as

n \to \infty

, the correct prediction rate is

θ_{c}^{*} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} + \frac{1}{2} \sum_{j \in Π_{𝐗}} | P (j | d) π (d) - P (j | u) π (u) | .

[8]

The last generalization of the proposed framework accounts for incurring different costs (or losses) when making incorrect predictions (see Generalization to Different Loss and Cost Functions for discussion). Note that searching for

𝐗

with larger

I

-scores is asymptotically equivalent to searching for larger values of the lower bound in Eq. 6 which is closely related to the correct predictivity of a given variable set

𝐗

,

θ_{c} (𝐗)

. For example, if a variable set

𝐗

has a large

I

-score (substantially larger than

1

; see ref. 4), it is a strong indication that

𝐗

itself could be a variable set with high predictivity. This stands in contrast to many current approaches to prediction [e.g., random forest and least absolute shrinkage and selection operator (LASSO)] that are evaluated for predictivity via cross-validation, which is computer-intensive.

Desirable Properties of the I-Score.

We note that the

I

-score is one possible approach to approximating the prediction rate in the sample analog form, and that the search for other potential scores is desirable and needed. Nevertheless, several properties of

I

are particularly appealing.

First,

I

requires no specification of a model for the joint effect of

{X_{1}, X_{2}, \dots, X_{m}}

on

Y

because it is designed to capture the discrepancy between the conditional means of

Y

on

{X_{1}, X_{2}, \dots, X_{m}}

and the mean of

Y

. Second, as mentioned earlier, the

I

-score does not monotonically increase with the addition of any and all variables as would the sample analog form of

θ_{c}

. Rather, given a variable set of size

m

with

m - 1

truly influential variables, the

I

-score is typically higher under the influential

m - 1

variables than under all

m

variables. If

m - 1

variables are influential in the sense that any smaller subset of variables is less influential, then removal of a variable to size

m - 2

will decrease the

I

-score in expectation. This natural tendency of the

I

-score to “peak” at variable set(s) that lead to high predictivity in the face of noisy variables under the current sample size is crucial.

Most important to note, we showed that the

I

-score can help find variables with high

θ_{c}

by identifying variables that have high values of

θ_{I}

(recall

θ_{I} = \sum_{j \in Π_{𝐗}} {[P (j | d) - P (j | u)]}^{2}

), which is related to the lower bound of

θ_{c}

. An important step to finding these highly predictive variable sets and discarding noisy ones through finding high

I

-scores is using the backward dropping algorithm (BDA) developed in ref. 4. The algorithm requires drawing many starting sets of variables and recursively dropping random variables and calculating

I

-scores. For more information, see ref. 4 or BDA.

Generalization to Arbitrary Priors

A problem that emerges when dealing with case-control data such as GWAS is that prior information on observing the next person as a disease case is unknown and not easily estimated from empirical data. Priors are defined by circumstances and contexts within which the case-control data are sampled—each dataset requires its own unique and unknown prior at that point in time.

Corollary 3. Under the assumptions of an arbitrary prior

π (d)

and

\frac{n_{d}}{n} \to λ

as

n \to \infty

, the correct prediction rate can be easily seen as

θ_{c}^{*} [p_{X_{d}}, p_{X_{u}}] = \frac{1}{2} + \frac{1}{2} \sum_{j \in Π_{X}} | P (j | d) π (d) - P (j | u) π (u) | .

Let the modified score

I_{Π_{n}}^{*}

be defined as

n s_{n}^{2} I_{Π_{n}}^{*} = \frac{1}{4} \sum_{j \in Π_{X}} n_{j}^{2} {[{\bar{y}}_{j} (\frac{π (d)}{λ}) - (1 - {\bar{y}}_{j}) (\frac{π (u)}{1 - λ})]}^{2} .

Then we have

lim_{n \to \infty} \frac{s_{n}^{2} I_{Π_{n}}^{*}}{n} \overset{𝒫}{=} \frac{1}{4} \sum_{j \in Π_{X}} {[P (j | d) π (d) - P (j | u) π (u)]}^{2} .

[S5]

Similar lower bounds to Corollary 2 can then be derived as

θ_{c}^{*} [p_{X_{d}}, p_{X_{u}}] = \frac{1}{2} + \frac{1}{2} \sum_{j \in Π_{X}} | P (j | d) π (d) - P (j | u) π (u) | \geq \frac{1}{2} + \frac{1}{2} \sqrt{lim_{n \to \infty} \frac{λ (1 - λ) I_{Π_{n}}^{*}}{2 n} - a^{2}}

[S6]

where

a = \sum_{j \in Π_{𝐗}} (P (j | d) π (d) - P (j | u) π (u)) = π (d) - π (u)

.

Similar to Corollary 1, Eq. S5 is a direct consequence of Eq. S6 and Lemma 1 (but with

z_{j}

replaced by

| P (j | d) π (d) - P (j | u) π (u) |

).

Generalization to Different Loss and Cost Functions

Thus far we have used a 0–1 loss on the binary classification problem. The 0–1 loss treats false negatives and false positives equally. In real applications, the scientist may wish to weigh the costs of different incorrect predictions differently. For instance, failing to detect a cancer patient may be deemed a more costly mistake to make than that of misclassifying a healthy patient because ameliorating the former mistake later on can be more difficult. The different cost amounts in making a loan decision is another example. The cost of lending to a defaulter may be seen as greater than that of the loss-of-business cost of declining a loan to a nondefaulter due to some positive level of risk aversion. Let loss function

L

be defined as

L (d, u) = l_{d}, L (u, d) = l_{u}

[S7]

and

L (d, d) = L (u, u) = 0

[S8]

where

l_{d}

and

l_{u}

are the prices paid (or losses incurred) for misclassifying a diseased individual to the healthy class or a healthy person to a diseased class, respectively. We can derive the optimum Bayes’ solution by minimizing the expected predicted loss, that is, to assign future observations to the class with less loss, given its

j

value. We simply assign a test sample with partition (predictor)

j

to

d

if

P (j | d) π (d) L (d, u) < P (j | u) π (u) L (u, d)

otherwise, assign to

u

. Equivalently, choose

d

if

P (j | d) π (d) l_{d} < P (j | u) π (u) l_{u}

otherwise

u .

In this way, the expected loss of adopting this rule is thus:

e^{l} = \frac{1}{2} \sum_{j \in Π_{𝐗}} \min {a_{j}, b_{j}},

where

a_{j} = P (j | d) π (d) l_{d}

and

b_{j} = P (j | u) π (u) l_{u}

. The random rule of classifying an individual to the healthy class or disease class has an expected loss of

γ = \frac{1}{2} \sum (a_{j} + b_{j}) = \frac{1}{2} (π (d) l_{d} + π (u) l_{u}),

a constant independent of the partition

Π_{x}

. The “gain” in

θ_{c}^{l}

(interpreted as less the expected loss of Bayes’ rule) can be defined as

θ_{c}^{l} = \frac{1}{2} \sum_{j \in Π_{𝐗}} \max {a_{j}, b_{j}} = \frac{1}{2} \sum_{j \in Π_{𝐗}} (a_{j} + b_{j}) - e^{l} = γ - e^{l} .

Because

γ

is independent of

X

and

Π_{X}

, it is desirable to search for

X

with larger

θ_{c}^{l}

to achieve better “gains.” Again we have

\begin{matrix} θ_{c}^{l} = \frac{γ}{2} + \frac{θ_{c}^{l} - e^{l}}{2} \\ = \frac{γ}{2} + \frac{1}{4} \sum_{j \in Π_{𝐗}} | a_{j} - b_{j} | \end{matrix}

After standardizing by

γ

, we obtain the improved prediction rate as

\begin{matrix} θ_{c} = \frac{θ_{c}^{l}}{γ} \\ = \frac{1}{2} + \frac{1}{4 γ} \sum_{j \in Π_{𝐗}} | a_{j} - b_{j} | \end{matrix}

Collecting the above discussion together, let the cost-based

I

-score

I_{Π_{𝐗}}^{c}

be defined as

n s_{n}^{2} I_{Π_{𝐗}}^{c} = \frac{1}{4 γ} \sum_{j \in Π_{𝐗}} n_{j}^{2} {[{\bar{y}}_{j} (\frac{π (d)}{λ}) l_{d} - (1 - {\bar{y}}_{j}) (\frac{π (u)}{1 - λ}) l_{u}]}^{2} \approx \frac{n^{2}}{4 γ} \sum_{j \in Π_{𝐗}} {[P (j | d) π (d) l_{d} - P (j | u) π (u) l_{u}]}^{2} .

[S9]

We present the following lower bound in Corollary 4. Let

\sum_{j \in Π_{𝐗}} (P (j | d) π (d) l_{2} - P (j | u) π (u) l_{1}) = π (d) l_{2} - π (u) l_{1} = a .

Corollary 4. Under the assumptions of Corollary 2 and using the loss function

L

described in Eqs. S7 and S8, then

lim_{n \to \infty} \frac{s_{n}^{2} I_{Π_{𝐗}}^{c}}{n} \overset{𝒫}{=} \frac{1}{4 γ} \sum_{j \in Π_{𝐗}} {[P (j | d) π (d) l_{d} - P (j | u) π (u) l_{u}]}^{2} .

[S10]

Furthermore, one can derive a similar lower bound for the correct prediction rate

θ_{c}

as

θ_{c} = \frac{1}{2} + \frac{1}{4 γ} \sum_{j \in Π_{𝐗}} | a_{j} - b_{j} | \overset{𝒫}{\geq} lim_{n \to \infty} (\frac{1}{2} + \frac{1}{4 γ} \sqrt{\frac{λ (1 - λ) I_{Π_{𝐗}}^{c}}{n} - a^{2}}) = \frac{1}{2} + \frac{1}{4 γ} \sqrt{lim_{n \to \infty} \frac{λ (1 - λ) I_{Π_{𝐗}}^{c}}{n} - a^{2}}

[S11]

The proofs for Eqs. S10 and S11 are quite similar to that for Corollary 3 given above; we shall omit them.

Technical Notes

Technical Note 1: Alternative Formulation of the Theoretical Prediction Rate.

Recall that the expected error of adopting the above Bayes’ decision rule (under a 0/1 loss) is

θ_{e} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} \sum_{𝐱 \in Π_{𝐱}} \min {p_{𝐗_{d}} (𝐱), p_{𝐗_{u}} (𝐱)} .

The correct prediction rate

θ_{c}

on

𝐗

is defined as

θ_{c} (𝐗) = θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] = 1 - θ_{e} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} \sum_{𝐱 \in Π_{𝐱}} \max {p_{𝐗_{d}} (𝐱), p_{𝐗_{u}} (𝐱)}

where

θ_{e}

is the error rate. For simplicity of presentation, we can represent the above as

θ_{c} = \frac{1}{2} \sum_{j \in Π_{𝐱}} \max {P (j | d), P (j | u)}

where

j

is short for

𝐱_{j}

, a cell in the partition

Π_{𝐗}

formed by the variables

𝐗

.

It is easy to show that

\frac{1}{2} {θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] - θ_{e} [p_{𝐗_{d}}, p_{𝐗_{u}}]} = θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] - \frac{1}{2} = \frac{1}{4} \sum_{j \in Π_{𝐗}} | P (j | d) - P (j | u) | .

Therefore,

θ_{c} [p_{𝐗_{d}}, p_{𝐗_{u}}] = \frac{1}{2} + \frac{1}{4} \sum_{j \in Π_{𝐗}} | P (j | d) - P (j | u) | .

Technical Note 2: Issue with Sample Analog of θ_c.

Suppose

𝐗_{m} = {X_{1}, \dots, X_{m}}

and

𝐗_{m + 1} = {X_{1}, \dots, X_{m}, X_{m + 1}}

. The partition formed by

𝐗_{m}

is

Π_{𝐗_{m}} = {A_{1}, \dots, A_{m_{1}}},

whereas the partition formed by

𝐗_{m + 1}

is

Π_{𝐗_{m + 1}} = {A_{1} \cap B, \dots, A_{m_{1}} \cap B, A_{1} \cap B^{c}, \dots, A_{m_{1}} \cap B^{c}} = {Π_{𝐗_{m}} \cap B, Π_{𝐗_{m}} \cap B^{c}}

where

B = {𝐗_{m + 1} = 1} .

Let

Π_{𝐗_{m}}^{1} = Π_{𝐗_{m}} \cap {X_{m + 1} = 1} and Π_{𝐗_{m}}^{0} = Π_{𝐗_{m}} \cap {𝐗_{m + 1} = 0},

where

Π_{𝐗_{m}}^{1}

and

Π_{𝐗_{m}}^{0}

form two subpartitions of

Π_{𝐗_{m + 1}}

, i.e.,

Π_{𝐗_{m + 1}} = Π_{𝐗_{m}}^{0} \cup Π_{𝐗_{m}}^{1}

. Then

| {\hat{p}}_{Π_{𝐗_{m}}} (d) - {\hat{p}}_{Π_{𝐗_{m}}} (u) | \leq | {\hat{p}}_{Π_{𝐗_{m}}^{0}} (d) - {\hat{p}}_{Π_{𝐗_{m}}^{0}} (u) | + | {\hat{p}}_{Π_{𝐗_{m}}^{1}} (d) - {\hat{p}}_{Π_{𝐗_{m}}^{1}} (u) |,

where

\hat{p} (\cdot)

is the sample estimator. We see that the sample analog inherently favors an increase in number of partition cells (i.e., adding more variables).

Technical Note 3: Proof of Lemma 1.

It is obvious that

| a | \leq b

. Let

S_{1}

be the sum of the positive values of

z_{j}

and

S_{2}

the sum of the negative values. Let

T_{1}

be the sum of the squares of the positive values and

T_{2}

the sum of the squares of the negative values. It follows that

S_{1} + S_{2} = a

and

S_{1} - S_{2} = b

and thus

S_{1} = (a + b) / 2

and

S_{2} = (a - b) / 2

. Then clearly

T_{1} \leq {S_{1}}^{2}

and

T_{2} \leq {S_{2}}^{2}

. Consequently,

\sum_{j = 1}^{K} {z_{j}}^{2} = T_{1} + T_{2} \leq {S_{1}}^{2} + {S_{2}}^{2} = \frac{a^{2} + b^{2}}{2}

[S1]

which is equivalent to the inequality in Eq. 4 and equality is attained when there are at most one positive and one negative component if

| a | < b

.

Technical Note 4: Technical Details on I-Score.

The influential score (

I

-score) is a statistic derived from the PR method. Several forms and variations were associated with the PR method before it was finally coined with this name in 2009 (4). We introduce the PR method and the

I

-score briefly here.^†

Consider a set of

n

observations of a disease phenotype

Y

(dichotomous or continuous) and a large number

S

of SNPs,

X_{1}, X_{2}, \dots, X_{S} .

Randomly select a small group,

m

, of the SNPs. Following the same notation as in previous sections, we call this small group

𝐗 = {X_{k}, k = 1, \dots, m}

. Recall that

X_{k}

takes values

0, 1,

and

2

(corresponding to three genotypes for a SNP locus: AA, A/B, and B/B). There are then

m_{1} = 3^{m}

possible values for

𝐗

’s. The

n

observations are partitioned into

m_{1}

cells according to the values of the

m

SNPs (

X_{k}

’s in

𝐗

), with

n_{j}

observations in the

j

th cell. We refer to this partition as

Π_{𝐗}

. The proposed

I

-score (denoted by

I_{Π_{𝐗}}

) is designed to place greater weight on cells that hold more observations:

I_{Π_{𝐗}} = \sum_{j = 1}^{m_{1}} \frac{n_{j}}{n} \cdot \frac{{({\bar{Y}}_{j} - \bar{Y})}^{2}}{s_{n}^{2} / n_{j}} = \frac{\sum_{j = 1}^{m_{1}} n_{j}^{2} {({\bar{Y}}_{j} - \bar{Y})}^{2}}{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}

[S2]

where

s_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}

. We note that the

I

-score is designed to capture the discrepancy between the conditional means of

Y

on

{X_{1}, X_{2}, \dots, X_{m}}

and the mean of

Y

.

In this paper, we consider the special problem of a case-control experiment where there are

n_{d}

cases and

n_{u}

controls and the variable

Y

is 1 for a case and 0 for a control. Then

s_{n}^{2} = (n_{d} n_{u}) / n^{2}

where

n = n_{d} + n_{u}

.

Technical Note 5: Proof of Theorem 1.

We prove that the

I

-score approaches a constant multiple of

θ_{I}

asymptotically.

Under the null hypothesis of no association between

𝐗 = {X_{k}, k = 1, \dots, m}

and

Y,

I_{Π_{𝐗}}

can be asymptotically expressed as

\sum_{j = 1}^{m_{1}} λ_{j} χ_{j}^{2}

(a weighted average), where

λ_{j}

is between 0 and 1 and

\sum_{j = 1}^{m_{1}} λ_{j}

is equal to

1 - \sum_{j = 1}^{m_{1}} {p_{j}}^{2}

, where

p_{j}

is the cell

j

’s probability.

{χ_{j}^{2}}

are

m_{1}

chi-squares, each with degree of freedom, df

= 1

(see ref. 4).

Furthermore, the above formulation and properties of

I_{Π_{𝐗}}

apply to the specified

Y

model with case-control study (where

Y = 1

designates case and

Y = 0

designates control) as demonstrated in ref. 4. More specifically, in a case-control study with

n_{d}

cases and

n_{u}

controls (letting

n = n_{d} + n_{u}

),

n s_{n}^{2} I_{Π_{𝐗}}

can be expressed as the following:

n s_{n}^{2} I_{Π_{𝐗}} = \sum_{j \in Π_{𝐗}} n_{j}^{2} {({\bar{Y}}_{j} - \bar{Y})}^{2} = \sum_{j \in Π_{𝐗}} {(n_{d, j}^{m} + n_{u, j}^{m})}^{2} {(\frac{n_{d, j}^{m}}{n_{d, j}^{m} + n_{u, j}^{m}} - \frac{n_{d}}{n_{d} + n_{u}})}^{2} = {(\frac{n_{d} n_{u}}{n_{d} + n_{u}})}^{2} \sum_{j \in Π_{𝐗}} {(\frac{n_{d, j}^{m}}{n_{d}} - \frac{n_{u, j}^{m}}{n_{u}})}^{2}

where

n_{d, j}^{m}

and

n_{u, j}^{m}

denote the numbers of cases and controls falling in

j

th cell, and

Π_{𝐗}

stands for the partition formed by

m

variables in

𝐗

. Since the PR method^‡ seeks the partition that yields larger

I

-scores, one can decompose the following:

n s_{n}^{2} I_{Π_{𝐗}} = \sum_{j \in Π_{𝐗}} n_{j}^{2} {({\bar{Y}}_{j} - \bar{Y})}^{2} = A_{n} + B_{n} + C_{n}

where

A_{n} = \sum_{j \in Π_{𝐗}} n_{j}^{2} {({\bar{Y}}_{j} - μ_{j})}^{2}

,

B_{n} = \sum_{j \in Π_{𝐗}} n_{j}^{2} {(\bar{Y} - μ_{j})}^{2}

, and

C_{n} = \sum_{j \in Π_{𝐗}} - 2 n_{j}^{2} ({\bar{Y}}_{j} - μ_{j}) (\bar{Y} - μ_{j})

. Here,

μ_{j}

and

μ

are the local and grand means of

Y

, that is,

E ({\bar{Y}}_{j}) = μ_{j}; \bar{Y} = μ = \frac{n_{d}}{n_{d} + n_{u}}

for fixed

n

. It is easy to see that both terms

A_{n}

and

C_{n}

, when divided by

n^{2}

converge to

0

in probability as

n \to \infty

. We turn to the final term,

B_{n}

. Note that

lim_{n} \frac{B_{n}}{n^{2}} \overset{𝒫}{=} lim_{n} \sum_{j \in Π_{𝐗}} (\frac{n_{j}^{2}}{n^{2}}) {(μ_{j} - μ)}^{2}

In a case-control study, we have

μ_{j} = \frac{n_{d} P (j | d)}{n_{d} P (j | d) + n_{u} P (j | u)}

and

μ = \frac{n_{d}}{n_{d} + n_{u}}

Because for every

j

,

\frac{n_{j}}{n}

converges (in probability) to

p_{j} = λ P (j | d) + (1 - λ) P (j | u)

as

n \to \infty

, if

{lim}_{n} \frac{n_{d}}{n} = λ

, a fixed constant between

0

and

1

, it follows that

\frac{B_{n}}{n^{2}} = \sum_{j \in Π_{𝐗}} (\frac{n_{j}^{2}}{n^{2}}) {(μ_{j} - μ)}^{2} \overset{𝒫}{\to} \sum_{j \in Π_{𝐗}} p_{j}^{2} {(\frac{λ P (j | d)}{λ P (j | d) + (1 - λ) P (j | u)} - λ)}^{2} as n \to \infty = \sum_{j \in Π_{𝐗}} {λ P (j | d) - λ [λ P (j | d) + (1 - λ) P (j | u)]}^{2} = \sum_{j \in Π_{𝐗}} {λ (1 - λ) P (j | d) - [λ (1 - λ) P (j | u)]}^{2} = λ^{2} {(1 - λ)}^{2} \sum_{j \in Π_{𝐗}} {[P (j | d) - P (j | u)]}^{2}

Thus, ignoring the constant term in the above equation, the

I

-score can guide a search for

X

partitions, which will lead to finding larger values of the summation term

\sum_{j \in Π_{𝐗}} {[P (j | d) - P (j | u)]}^{2}

. We have proven Theorem 1.

Technical Note 6: Proof of Corollary 2.

Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct predictive rate:

θ_{c} (𝐗) \overset{𝒫}{\geq} \frac{1}{2} + \frac{1}{4} \sqrt{2 lim_{n \to \infty} \frac{I_{Π_{𝐗}}}{n λ (1 - λ)}} .

[S3]

Proof: From Eq. 2,

\begin{matrix} θ_{c} (𝐗) = \frac{1}{2} + \frac{1}{4} \sum_{j \in Π_{𝐗}} | P (j | d) - P (j | u) | \\ (Lemma 1) \geq \frac{1}{2} + \frac{1}{4} \sqrt{2 \sum_{j \in Π_{𝐗}} {(P (j | d) - P (j | u))}^{2}} = \frac{1}{2} + \frac{1}{4} \sqrt{2 θ_{I} (Π_{𝐗}}) \\ (Theorem 1) \overset{𝒫}{=} \frac{1}{2} + \frac{1}{4} \sqrt{2 lim_{n \to \infty} \frac{s_{n}^{2} I_{Π_{𝐗}}}{n λ^{2} {(1 - λ)}^{2}}} \overset{𝒫}{=} \frac{1}{2} + \frac{1}{4} \sqrt{2 lim_{n \to \infty} \frac{I_{Π_{𝐗}}}{n λ (1 - λ)}} . \end{matrix}

[S4]

■

The asymptotic lower bound of Eq. 2 is a simple consequence of Lemma 1 and Theorem 1. In theory, the above corollary allows us to apply a useful lower bound for identifying good variable sets with large

I

-scores. In practice, however, once the variable sets are found (through their large

I

-scores), the true prediction rates can be greater than the identified lower bounds. Theorem 1 provides a simple asymptotic behavior of the

I

-score under some strict assumptions. We offer similar derivations below following two levels of relaxations of the constraints.

We remark that with additional work one can show that the convergence given in Eq. 2 can be extended to be uniformly over all partitions

{Π}

with bounded number of cells and for all

λ

that stay away from 0 to 1.

BDA

The BDA^§ is a greedy algorithm to search for the variable subset that maximizes the

I

-score through stepwise elimination of variables from an initial subset sampled in some way from the variable space. The details are as follows.

i)

Training set: Consider a training set

{(y_{1}, x_{1}), \dots, (y_{n}, x_{n})}

of

n

observations, where

x_{i} = (x_{1 i}, \dots, x_{p i})

is a

p

-dimensional vector of explanatory variables. Typically

p

is very large. All explanatory variables are discrete.

ii)

Sampling from variable space: Select an initial subset of

k

explanatory variables

𝐗_{b} = {X_{b_{1}}, \dots, X_{b_{k}}}, b = 1, \dots, B

.

iii)

Compute I-score:

I (𝐗_{b}) = \sum_{j \in Π_{𝐗_{b}}} n_{j}^{2} {({\bar{Y}}_{j} - \bar{Y})}^{2}

.

iv)

Drop variables: Tentatively drop each variable in

𝐗_{b}

and recalculate the

I

-score with one variable less. Then drop the one that gives the highest

I

-score. Call this new subset

{𝐗'}_{b}

, which has one variable less than

𝐗_{b}

.

v)

Return set: Continue the next round of dropping on

{𝐗'}_{b}

until only one variable is left. Keep the subset that yields the highest

I

-score in the whole dropping process. Refer to this subset as the return set

𝐑_{b}

. Keep it for future use.

If no variable in the initial subset has influence on

Y

, then the values of

I

will not change much in the dropping process. However, when influential variables are included in the subset then the

I

-score will increase (decrease) rapidly before (after) reaching the maximum.

Using the I-Score in Sample-Constrained Settings

We have shown that

I / n

asymptotically approaches a constant multiple of

θ_{I}

(which is related to a lower bound of

θ_{c}

) and has several desirable properties. We take this opportunity to explore and illustrate an application of the

I

-score measuring predictivity with sample data. To provide additional evidence of the

I

-score’s ability to measure true predictivity, we consider a set of simulations for which we know the “true” levels of predictivity for all variable sets. We also provide a real data application on breast cancer for which the

I

-score approach has done very well in predicting.

We take a moment to comment that evaluating a variable set for predictivity, what we have called here VSA, is different from evaluating a given classifier, which is the prediction stage, usually following or in conjunction with VS. The latter considers evaluating

f (𝐱)

, a special function

f (\cdot)

applied to a particular set of explanatory variables

𝐱

, for a given outcome variable

y

, whereas the former considers the potential predictivity of the set of explanatory variables

𝐱

for that outcome

y

for all possible

f (\cdot)

. Our work here focuses simply on VSA. Variable sets assessed as highly predictive in our framework can then be flexibly used in various models for prediction purposes as pleases the researcher.

We are now in an odd situation where we have identified variable sets that could not have been found using conventional approaches and yet we wish to evaluate the predictivity of our identified variable sets against these conventional approaches. Nevertheless, we endeavor to do so. A couple options arise for approaches to compare against: training prediction rate and out-of-sample testing prediction rate. We will show that the

I

-score-based measure provides a useful and meaningful estimated lower bound to the correct prediction rate and correlates well with the out-of-sample test rate, whereas the training rate statistic, the sample analog of

θ_{c}

, does not. As such, our approach has an important benefit to prediction research: Compared with methods such as cross-validation of error rates, the

I

-score is efficient in the use of sample data, in the sense that it uses all observations instead of separating data into testing and training.

Simulations.

We offer simulations to illustrate how (i) the

I

-score can serve as a lower bound to the true predictivity of a given variable set even as noisy variables are adjoined, (ii) thereby serving as a screening mechanism, and (iii) finding the maximum

I

-score when conducting a BDA leads to finding the variable set with the highest corresponding level of predictivity. BDA reduces a variable set one variable at a time, by eliminating the weakest element until

I

reaches a peak.

We consider a module of three important variables

{X_{1}, X_{2}, X_{3}}

(see Fig. 2 for the disease model used) among six unimportant variables

{X_{7}, \dots, X_{12}}

using sample sizes of 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. (See Simulation Details for more detailed model setting and simulation details.) We demonstrated that the

\frac{I}{n λ (1 - λ)}

estimates*

θ_{I}

, which is related to an asymptotic lower bound (Eq. 6) for

θ_{c}

, as

n \to \infty

. It would be helpful to see how

I

performs at fixed, reasonable sample sizes. We compare the

I

-score derived predictivity lower bound against the Bayes’ theoretical prediction rate in our simulations to illustrate this. The out-of-sample correct prediction rate is presented in the simulations here as a further benchmark against which the

I

-score can be compared when data are limited, as is the case in real-world applications. The out-of-sample correct prediction rate is derived from the most optimistic context achievable in the real world, whereby future testing data are infinite. In all of the simulations, the

I

-score of a set of influential variables drops when a noisy variable is added. This drop is subsequently seen in the

I

-score derived bound for the correct prediction rate. The

I

-score can screen out noisy variables, which makes it useful in practical data applications.

Fig. 2.

Open in viewer

To illustrate how these statistics fare in accurately capturing the level of predictivity of each variable set under consideration, we consider their performance given already having found

X_{2}

and

X_{3}

as important. We then add

X_{1}

, which should ideally correspond with an increase in the statistic. We continue adding the remaining noisy variables one at a time to this “good” set of variables and observe how the statistics evaluate the new, larger set of variables for predictivity. In Fig. 3, violin plots show distributions of training rate, the

I

-score lower bound, and the ideal out-of-sample prediction rate under each setting across the simulations. Theoretical Bayes’ rate is also plotted as a reference, which remains flat when noisy variables are added. This is because the Bayes’ rate is defined purely by the partition formed from the informative variables and does not change when adjoining noisy variables (

X_{7}, \dots, X_{12}

) and creating finer partitions.

Fig. 3.

Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics,

I

-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the

I

-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example,

X_{1}

,

X_{2}

, and

X_{3}

. The top row of graphs compares the (red)

I

-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

Open in viewer

Several patterns emerge in these simulations. First, and most importantly, the

I

-score-derived prediction rate seems to be a reasonable lower bound to the Bayes’ rate. This holds even in moderate sample sizes.

The second pattern is that the estimated

I

-score lower bound peaks at the variable set that is inclusive of all influential variables (

X_{1}

,

X_{2}

, and

X_{3}

) and no additional noisy variables. This is a characteristic of the out-of-sample correct prediction rate as well. For instance, if we consider the top row of Fig. 3 and start from the right of the x axes in each of the three plots with the largest set of variables inclusive of both influential and noisy variables (

X_{1}, X_{2}, X_{3}, X_{7}, \dots, X_{12}

), continual removal of the noisy variables (sliding to the left of the x axis) until we reach the variable set (

X_{1}

,

X_{2}

,

X_{3}

) results in higher predictivity as measured by the

I

-score lower bound. We can note that the

I

-score lower bounds drop upon further removing the influential

X_{1}

variable from the set (

X_{1}, X_{2}, X_{3}

). Thus, the variable set that appears with the maximum

I

-score derived lower bound here both identifies the largest possible variable set of influential variables with no noisy variables and is also reflective of a conservative lower bound of the correct prediction rate for that variable set. We note that once we have found the variable sets with the highest

I

-scores and calculated the corresponding lower bound of the correct prediction rate, we can adjust this lower bound rate for its bias to derive an improved estimate of the correct prediction rate.

A third pattern is that the training rate suffers from overfitting when adjoining noisy variables even when the variable set includes a true influential subset of variables. If the variable set is irreducible, however, the training rate estimator reflects the Bayes’ correct prediction rate well; thus, the training rate estimator can perform reasonably well conditional on already identifying (

X_{1}, X_{2}, X_{3}

). The training rate estimator cannot be used to screen to that variable set first, however.

Finally, and as we might expect, the training set rate explodes due to overfitting in high dimensions as noisy variables are adjoined to the partition formed by the informative variables (

X_{1}

,

X_{2}

,

X_{3}

). Although the training set prediction rate seems to improve as the sample size increases, it cannot be used to screen out noisy variables, and is therefore difficult to use as a statistic to select highly predictive variable sets. The predictivity rates found through this statistic also dramatically depart from the out-of-sample testing rate. It tends to ever-optimistically evaluate variable sets for their future predictions even when noisy variables are added. This stands in stark contrast to the out-of-sample prediction rate because it lowers in prediction rate with the addition of useless variables. We notice that there is a trend that the

I

-score prediction rate does not remain flat. The score increases when removing a noisy variable and reducing to a variable set of only influential variables, indicating the additional advantage of the

I

-score as a lower bound; the

I

-score prefers a simpler model even when the Bayes’ rate remains the same, selecting for more parsimonious partitions that attain the Bayes’ rate, simultaneously a closer reflection of the out-of-sample prediction rate.

Recall the correct prediction rate is based on an absolute difference of probabilities summed over all

X

s. Suppose we start with influential variables only, with

θ_{c}^{*}

correct prediction rate, the highest we can attain out of all possible variable sets. Adding noisy variables to this set, variables that add no signal but simply create a finer partition, still returns

θ_{c}^{*}

. When estimating the correct prediction rate using sample data, though, the training estimate of

θ_{c}

value generally keeps increasing if noisy variables are added; the researcher does not know when to stop the search for influential variables, making selecting for highly predictive variables difficult. Ideally, we would like to “punish” adding such noisy variables to our variable set, so having a measure that balances between favoring coarser partitions but still recognizing actual new variables with strong enough signals (non noisy variables) is important. The

I

-score seems to support such an effect—preferring coarser partitions unless an additional variable (and therefore finer partition) provides enough signal in the data to justify keeping it.

Noisy variables in sample data may be indicative of actually noisy variables or influential variables with weak signals due to the sample size. Thus, we note there are cases where the

I

-score might not recognize these variables when their signals would require unrealistic sample sizes to be found through the measure. An example of this would be if a good predictor is highly complex (perhaps a combination of very many variables) and the observations are sparse in the partition. Because the

I

-score places greater weight on where the data tend to appear (note the

n_{i}^{2}

term in the score), when most of the partition cells contain no observations or at most one observation, this can often look like noise.

The main draw of the

I

-score is its ability to screen for influential variable sets. The variable sets inclusive of the three influential variables (

X_{1}

,

X_{2}

, and

X_{3}

) alone display the highest

I

-scores. Searching for variable sets with the highest

I

-scores thus tends to return highly influential variables only. Using the training prediction rate as a guiding measure for screening, however, would continually seek for ever-larger variable sets, regardless of whether they include noisy variables or not.

Real Data Application: van’t Veer Breast Cancer Data.

To reinforce the previous sections, we briefly analyze real disease data. As noted before, part of this research team has discovered that applying the PR approach to real disease data has not only been quite successful in finding variable sets (thus encompassing higher-order interactions, traditionally rather tricky in big data), but has also resulted in finding variable sets that are very predictive^† that do not necessarily show up as significant through traditional significance testing. We present one discovered variable set (a total of 18 variable sets were found in ref. 5) found to be highly predictive for a breast cancer dataset that is not highly significant using a chi-square test.^‡ In Table 1 we investigate the top, five-variable set (in this case five genes) found to be predictive through both top

I

-score and performance in prediction in cross-validation and an independent testing set in ref. 5. To find how significant these variables are, we calculate the individual, marginal association of each variable in the marginal P value. Given the familywise P value threshold of

6.98

×

10^{- 5}

, none of these variables seems statistically significant. Measuring the joint influence of all five variables is not significant either. Using the variable sets (all 18 in ref. 5) that seemed to have the highest

I

-scores to predict on this dataset resulted in an out-of-sample testing error rate of 8%, in direct comparison with the literature’s best error rates of 30%. Using only the variable set displayed in Table 1 and the lower bound in Eq. 6 we can calculate the asymptotic lower bound of the correct prediction rate for this variable set as 59%. Thus, using only this variable set alone, we can achieve at least a 59% correct classification rate at minimum. For details on the final predictors, see ref. 5.

Table 1.

Real data example: van’t Veer breast cancer data (6)

Gene number	Systematic name	Gene name	Marginal P value
1	Contig45347_RC	KIAA1683	0.008
2	NM_005145	GNG7	0.54
3	Z34893	ICAP-1A	0.15
4	NM_006121	KRT1	0.9
5	NM_004701	CCNB2	0.003
	Joint $I$ -score: 2.89	Joint P value: 0.005	Familywise threshold: $6.98$ × $10^{- 5}$

Table reused from ref. 3 [Lo et al.]. The data can be downloaded from ccb.nki.nl/data/.

Open in viewer

Simulation Details

The simulation is based on a six-SNP disease model. The six SNPs are organized into two three-SNP modules (

X_{1}

,

X_{2}

,

X_{3}

), and (

X_{4}

,

X_{5}

,

X_{6}

). Six additional variables (

X_{7}

, …,

X_{12}

) are simulated to be noisy and unrelated to the disease. The frequencies for the minor allele of each SNP are all 0.5. The risk of the disease for an individual depends on the two three-SNP genotypes of these two modules. Each module defines two sets of genotypes, high risk genotypes and low risk genotypes, identically depicted in Fig. 2. If an individual has two low risk genotypes, he has odds of

1 / 60

for having the disease. Here, odds is the ratio of the probability of an event occurring (disease) over the probability of the event not occurring (no disease). For an individual with one of the low-risk genotypes and one of the high-risk genotypes, the odds are increased to

1 / 10

. If an individual has high-risk genotypes for both modules, the odds become 1. In this section, we present results for the first module (

X_{1}

,

X_{2}

and

X_{3}

). In Fig. S1 we present results for both modules, or all six SNPs, together.

Fig. S1.

Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics,

I

-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the

I

-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example,

X_{1}

,

X_{2}

,

X_{3}

,

X_{4}

,

X_{5}

, and

X_{6}

. The top row of graphs compares the (red)

I

score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

Open in viewer

The data can take on three sample size levels: 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. For each possible variable set we create a partition

Π

and calculate the

{\hat{p}}_{i d}

and

{\hat{p}}_{i u}

(the estimated probability that an individual in cell

j

is a case or a control), respectively:

\frac{n_{i d}}{n_{d}}

and

\frac{n_{i u}}{n_{u}}

where

i = 1... m

and

m = | Π |

where

| Π |

is the size of the partition

Π

. We conducted 300 simulations and evaluated a set of statistics on each of the variable sets for each simulation: the training prediction rate, Bayes’ prediction rate, out-of-sample prediction rate, and the

I

-score-derived lower bound estimate of the predictivity rate; see Fig. 3. Throughout, we assume prior probability of (0.5, 0.5) for case and control. The statistics are detailed below:

i)

Training prediction rate is defined as the following:

\frac{1}{2} \sum_{j = 1}^{m_{1}} \max ({\hat{p}}_{j d}, {\hat{p}}_{j u})

ii)

Bayes’ rate: Recall this rate is constant across all variable sets that are inclusive of the truly influential variables, regardless of how many noisy variables are also included. This is the best predictivity one can achieve if knowledge of the influential variables is available. It is defined as

\frac{1}{2} \sum_{j = 1}^{m_{1}} \max (p_{j d}, p_{j u})

iii)

Out-of-sample prediction rate: This is conducted on the “infinite” future data to find

p_{j d}

and

p_{j u}

for the rate. The “infinite” future data are often unrealistic with real data but we present it for the purposes of this simulation and to clearly provide a gold standard against which to compare. It is defined as

\sum_{j = 1}^{m} p_{j d} \cdot {\hat{Y}}_{j} + p_{j u} \cdot (1 - {\hat{Y}}_{j})

iv)

I

-score lower bound predictivity rate as defined from Eq. 7.

Concluding Remarks

Prediction has become more important in recent decades and, with it, the need for tools appropriate for good prediction. A first step can be to assess variable sets for predictivity, which we call VSA. We show in other work that assessment of variables from a statistical significance criterion to predict is not ideal (3). A currently popular alternative solution is to select variables via sample-based, out-of-sample testing error rates. This approach is ad hoc in nature, sample-based, and is not measuring some theoretical underlying level of predictivity for a given variable set. Often validation of selected candidate variable sets requiressetting aside valuable sample data in out-of-sample testing or cross-validation. Sometimes the sample size may not suffice for validating variable set sizes larger than one or two variables, as is often the case in big data like GWAS. Cross-validation avoids setting aside sample data as independent test sets but is computationally difficult in big data without using independent samples. As such, prediction research would benefit from a theoretical framework that directly defines a variable set’s predictivity as a parameter of interest to estimate. We believe our work here is a preliminary and important effort in that direction, by considering what theoretically highly predictive variable sets are, and how we might try to find them. In fact, using measures such as the

I

-score could be an important new direction in the prediction literature because it neither uses the training sample prediction rate nor does it require an artificial or ad hoc regularization choice.

We identify the equation for the theoretical correct predictivity of variable sets (

θ_{c}

) in Eq. 2 and then demonstrate that, unfortunately, the training estimate for it is quite useless. As such, we offer an alternative measure. We show that the

\frac{I}{n}

asymptotically approaches a lower bound to

θ_{I}

of Eq. 2 and is thus correlated with the correct predictivity rate of a given variable set. Importantly, we show that the

I

-score has a natural tendency to discard noisy variables, keep influential ones, and asymptotically approach this lower bound to

θ_{c}

. The

I

-score does well in identifying predictive variable sets in both our complex simulations as well as real data application.

We note that other measures with such desirable properties may also exist, and we encourage rigorous research in this direction. As a new field of inquiry, the search for measures that maximize predictivity may do much in the way of living up to the hopes of advancing predicting outcomes of interest, such as disease status. In some ways, this work is motivated by a practical consideration of finite samples. As noted in the setup of our framework, in a theoretical world of limitless data we can in fact find the variable sets with highest values of

θ_{c}

. However, our real world of finite sample sizes requires other sample-appropriate measures that may approximate but not achieve the

θ_{c}

. In other words, based on available sample size, the

I

-score, and any other such measure, detects not necessarily the maximum

θ_{c}

but some

θ_{c, n}^{H}

, the largest

θ_{c}

correct prediction rate for which the corresponding

X

variables can be selected given

n

. Consider a situation where the true set of variables

𝐗^{*}

that provide the theoretical maximum

θ_{c}^{*}

is very large. Suppose we have a sample of data that is quite modest. Selecting all variables

𝐗^{*}

is not possible given the sample size

n

(too many of the cell frequencies are small or zero) and so a measure such as the

I

-score retrieves a set

𝐗'

that provides potentially the largest

θ_{c}

achievable given the sample constraint. This in some ways mirrors the common issue of not detecting true effects when the sample size is too small in statistical significance testing.

We leave the important discussion of how to combine identified predictive variable sets in different final prediction models outside the scope of this paper.

Simulation Results for Important Variable Set of Size 6

Here we present simulation results in Fig. S1 for the six SNPs according to the model described in the main text. All other simulation parameters were the same as the three-SNP example.

Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

Supporting Information

Supporting Information (PDF)

Supporting Information

Download
552.99 KB

References

1

K Gransbo, et al., Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J Intern Med 274, 233–240 (2013).

Crossref

PubMed

Google Scholar

2

SL Zheng, et al., Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358, 910–919 (2008).

3

A Lo, H Chernoff, T Zheng, SH Lo, Why significant variables aren’t automatically good predictors. Proc Natl Acad Sci USA 112, 13892–13897 (2015).

Crossref

PubMed

Google Scholar

4

H Chernoff, SH Lo, T Zheng, Discovering influential variables: A method of partitions. Ann Appl Stat 3, 1335–1369 (2009).

Crossref

Google Scholar

5

H Wang, SH Lo, T Zheng, I Hu, Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28, 2834–2842 (2012).

Crossref

PubMed

Google Scholar

6

LJ van’t Veer, et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

7

Y Saeys, I Inza, P Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).

Crossref

PubMed

Google Scholar

8

T Hastie, R Tibshirani, J Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2nd Ed, New York, 2009).

Go to reference

Crossref

Google Scholar

9

I Guyon, A Elisseeff, An introduction to variable and feature selection. J Mach Learn Res 3, 1157–1182 (2003).

Go to reference

Google Scholar

10

J Hua, WD Tembe, ER Dougherty, Performance of feature-selection meth-ods in the classification of high-dimension data. Pattern Recogn 42, 409–424 (2009).

Go to reference

Crossref

Google Scholar

11

V Bolón-Canedo, N Sánchez-Maroño, A Alonso-Betanzos, A review of feature selection methods on synthetic data. Knowl Inform Syst 34, 483–519 (2013).

Go to reference

Crossref

Google Scholar

12

G James, D Witten, T Hastie, R Tibshirani An Introduction to Statistical Learning: With Applications in R (Springer, New York, 2014).

Go to reference

Google Scholar

Information & Authors

Information

Published in

Proceedings of the National Academy of Sciences

Vol. 113 | No. 50
December 13, 2016

PubMed: 27911830

Classifications

Submission history

Published online: November 29, 2016

Published in issue: December 13, 2016

Keywords

Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

Notes

*This assumes that

s_{n}^{2} ⟶ λ (1 - λ) as n ⟶ \infty .

†

Here “predictive” refers to both high in

I

-score as well as having high correct prediction rates in k-fold cross-validation testing rates.

‡

We note an inherent difficulty to presenting the reverse situation, that of finding the most significant variable sets in the breast cancer data and determining their predictivity rates. This is precisely because the PR approach allows for higher-order interaction searches, which is more difficult using current common approaches. Although it is possible to use common approaches to discover marginally significant variables, or possibly two-way interactions, and then determine their predictivity rates, capturing up to five-way (as shown in our presentation here using the PR approach) interactions is not yet feasible as of the date of this writing with current common approaches.

∗

*“Unfortunately, the

C_{p}

, AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating

{\hat{σ}}^{2}

(variance) is problematic. Similarly, problems arise in the application of the adjusted

R^{2}

in the high-dimensional setting, because one can easily obtain a model with an adjusted

R^{2}

value of 1” (12).

†

We use GWAS data to motivate our presentation of the I-score and PR method, but the approach applies to any data with discrete explanatory variables.

‡

The PR method encompasses a BDA that is introduced in ref. 5; we directly cite and present the BDA in Supporting Information.

§

The presentation of the BDA is taken directly from section 2.2 of ref. 5. For further details, see ref. 5.

Authors

Affiliations

Adeline Lo https://orcid.org/0000-0001-5791-5541

Herman Chernoff¹ slo@stat.columbia.edu

Tian Zheng¹ https://orcid.org/0000-0003-4889-0391 slo@stat.columbia.edu

Shaw-Hwa Lo¹ slo@stat.columbia.edu

Notes

1

To whom correspondence may be addressed. Email: slo@stat.columbia.edu, chernoff@stat.harvard.edu, or tz33@columbia.edu.

Author contributions: S.-H.L initiated and oversaw the project; A.L., H.C., T.Z., and S.-H.L. designed research; A.L., H.C., T.Z., and S.-H.L. performed research; A.L., T.Z., and S.-H.L. analyzed data; and A.L., H.C., T.Z., and S.-H.L. wrote the paper.

Reviewers: D.L.B., Duke University; and M.Y., University of Wisconsin–Madison.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements

22

0

37

0

Smart Citations

22

0

37

0

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Altmetrics

See more details

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

View Options

View options

PDF format

Download this article as a PDF file

DOWNLOAD PDF

Media

Figures

Fig. 1.

Go to Figure Open in Viewer

Fig. 2.

Go to Figure Open in Viewer

Fig. 3.

Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics,

I

-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the

I

-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example,

X_{1}

,

X_{2}

, and

X_{3}

. The top row of graphs compares the (red)

I

-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

Go to Figure Open in Viewer

Fig. S1.

Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics,

I

-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the

I

-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example,

X_{1}

,

X_{2}

,

X_{3}

,

X_{4}

,

X_{5}

, and

X_{6}

. The top row of graphs compares the (red)

I

score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

Go to Figure Open in Viewer

Tables

Table 1.

Real data example: van’t Veer breast cancer data (6)

Go to Table Open in Viewer

Other

References

1

K Gransbo, et al., Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J Intern Med 274, 233–240 (2013).

Crossref

PubMed

Google Scholar

2

SL Zheng, et al., Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358, 910–919 (2008).

3

A Lo, H Chernoff, T Zheng, SH Lo, Why significant variables aren’t automatically good predictors. Proc Natl Acad Sci USA 112, 13892–13897 (2015).

Crossref

PubMed

Google Scholar

4

H Chernoff, SH Lo, T Zheng, Discovering influential variables: A method of partitions. Ann Appl Stat 3, 1335–1369 (2009).

Crossref

Google Scholar

5

H Wang, SH Lo, T Zheng, I Hu, Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28, 2834–2842 (2012).

Crossref

PubMed

Google Scholar

6

LJ van’t Veer, et al., Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

7

Y Saeys, I Inza, P Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).

Crossref

PubMed

Google Scholar

8

T Hastie, R Tibshirani, J Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2nd Ed, New York, 2009).

Go to reference

Crossref

Google Scholar

9

I Guyon, A Elisseeff, An introduction to variable and feature selection. J Mach Learn Res 3, 1157–1182 (2003).

Go to reference

Google Scholar

10

J Hua, WD Tembe, ER Dougherty, Performance of feature-selection meth-ods in the classification of high-dimension data. Pattern Recogn 42, 409–424 (2009).

Go to reference

Crossref

Google Scholar

11

V Bolón-Canedo, N Sánchez-Maroño, A Alonso-Betanzos, A review of feature selection methods on synthetic data. Knowl Inform Syst 34, 483–519 (2013).

Go to reference

Crossref

Google Scholar

12

G James, D Witten, T Hastie, R Tibshirani An Introduction to Statistical Learning: With Applications in R (Springer, New York, 2014).

Go to reference

Google Scholar

Featured Topics

Articles By Topic

Featured Topics

Articles By Topic

Featured Topic

Articles By Topic

Significance

Abstract

Sign up for PNAS alerts.

A Brief Literature Review on VS

Toy Example

Theoretical Prediction Rates

Problems with the Sample Analog.

Alternative Measure: I-Score.

Desirable Properties of the I-Score.

Generalization to Arbitrary Priors

Generalization to Different Loss and Cost Functions

Technical Notes

Technical Note 1: Alternative Formulation of the Theoretical Prediction Rate.

Technical Note 2: Issue with Sample Analog of θc.

Technical Note 3: Proof of Lemma 1.

Technical Note 4: Technical Details on I-Score.

Technical Note 5: Proof of Theorem 1.

Technical Note 6: Proof of Corollary 2.

BDA

Using the I-Score in Sample-Constrained Settings

Simulations.

Real Data Application: van’t Veer Breast Cancer Data.

Simulation Details

Concluding Remarks

Simulation Results for Important Variable Set of Size 6

Acknowledgments

Supporting Information

References

Information

Published in

Classifications

Submission history

Keywords

Acknowledgments

Notes

Authors

AffiliationsExpand All

Adeline Lo https://orcid.org/0000-0001-5791-5541

Herman Chernoff1 slo@stat.columbia.edu

Tian Zheng1 https://orcid.org/0000-0003-4889-0391 slo@stat.columbia.edu

Shaw-Hwa Lo1 slo@stat.columbia.edu

Notes

Competing Interests

Metrics

Citation statements

Altmetrics

Citations

Cited by

View options

PDF format

Figures

Tables

Other

Share

Share article link

Share on social media

References

Further reading in this issue

Sign up for thePNAS Highlights newsletter

Technical Note 2: Issue with Sample Analog of θ_c.

Affiliations

Herman Chernoff¹ slo@stat.columbia.edu

Tian Zheng¹ https://orcid.org/0000-0003-4889-0391 slo@stat.columbia.edu

Shaw-Hwa Lo¹ slo@stat.columbia.edu