Rediscovery rate

STEP 1. Select the type of analysis

You can decide to upload your data or use an available example (the same used in the paper). You should also specify if the outcome is dichotomous or continuous. If you use the example, the data files contain test statistics from a case-control study. See the first few lines in the output box panel.

STEP 1-a. Upload your data

You can upload data from the training set ONLY or from BOTH the training and the validation set. Uploading the validation dataset allows the calculation of the observed RDR and observed vFDR.The validation set should contain the same number of features as the training set. The data should look like one of these two scenarios: A.Just one column with t-statistics values; B.Two columns, one containing coefficients (beta) values and the other containing standard errors values. Columns should be space separated.

STEP 1-b. Select data format

Here you can select how your data look like:

STEP 1-c. Determine the number of components for the mixture model

You can manually select the number of components or you can check the box 'Select nq automatically' and the program will calculate the number of components automatically using AIC. See also the histogram on the right for more info.

STEP 2. Calculate the RDR or the sample size needed for a given RDR

You can fix the sample size and get the expected RDR or fix the RDR and get the expected sample size.

STEP 2-a. Set the sample sizes

Set the sample size in the training and validation set

STEP 2-a. Set the sample size and RDR

Set the sample size in the training set, targeted value of RDR and ratio of case and control in validation

STEP 2-a. Set the sample size and RDR

Set the sample size in the training set and targeted value of RDR

STEP 3. Decide the critical values

Define the significance threshold (in -Log10 P-values) to select features for the validation study (c.t.) and to determine which features are validated (c.v.)

STEP 4. Plot

Plot the RDR graph and decide the components to visualize. You can also decide if you want to visualize the measures as function of -log10 (p-value) in the training or validation set.

Plot: y-variable

STEP 4. Plot

Plot the RDR graph as function of sample size and decide the components to visualize.

Plot: y-variable

Plot: x-variable

Rediscovery rate estimation for assessing the validation of significant findings in high-throughput studies

We introduced two measures: the rediscovery rate (RDR) and the false discovery rate in a validation population (vFDR).

-The RDR is the expected proportion of findings validated among those declared significant in the training sample.

-The vFDR is the expected proportion of false validated features among all those taken forward in the validation study.

RDR and vFDR are obtained by just using the training sample. These measures can also be obtained using both the training and validation sample (if available). In this case they are defined as observed RDR and observed vFDR.


In the example (in STEP 1 select 'Use example') I select all the features from the training set with a P-value < 0.001 [c.t.=-log10(P-value)=3] to be taken forward in the validation set. In the validation set, I declare significant and validated all the features with a P-value < 0.1 [c.v.=-log10(P-value)=1].

By using these settings we expect 80% (RDR=0.80) of the feature taken forward to validation to be validated (i.e. having a P-value < c.v.). The number of false positives among the features taken forward to validation approaches 0 (vFDR=0).

Since we collected a validation set and tested all the features also in the validation set, we can calculate the observed RDR and observed vFDR. They are 0.79 and 0, respectively. These values are smilar to that estimated just using the training sample, indicating that the inference is correct.


The t-mixture approach is NOT well suited for GWAS data or any other data where the proportion of true null hypotesis is close to 1.
Reference: Andrea Ganna, Donghwan Lee, Erik Ingelsson and Yudi Pawitan. Rediscovery rate estimation for assessing the validation of significant findings in high-throughput studies, Briefings in Bioinformatics 2014.

RDR and vFDR


        

RDR plot

Histogram

If the density function nicely follow the histogram then the distribution of t-statistics in the training set is well fitted to the t-mixture distribution.

Estimates of t-Mixture