The CHEK2 challenge

Background

Variants in the ATM & CHEK2 genes are associated with breast cancer. For this experiment, predictors are asked to estimate the probability of an individual with a given mutation being in the case (cancer) or control (healthy) cohort. The data available include the targeted resequencing of two genes (ATM and CHEK2) from approximately 1250 breast cancer cases and 1250 controls. The ATM sequencing results have already been published, and will thus serve as an example set (Tavtigian et al., 2009, Journal of Human Genetics. doi: 10.1016/j.ajhg.2009.08.018).

Dataset

Predictors will be provided with 41 rare missense, nonsense, splicing, and indel variants in CHEK2.

Prediction challenge

Predictors are asked to classify variants as occurring in cases or controls. Predictors will provide their estimate of the probability of individuals with a given variant being in the case set. Control probability is implicitly 1 – P(case). Correctness of each prediction will be weighted according to

(a) how accurately P(case) was predicted

(b) the confidence measure provided

(c) the number of study participants with the variant.

While prediction for a single individual may not be meaningful in all cases, the sum across all predictions should give an informative measure of prediction accuracy. In addition, we ask predictors to submit the raw output data of the prediction algorithm.

Baseline

Predictions are restricted to single residue mutations and are based on a statistical analysis of the correlation between mutation type and disease computed from the annotation data derived from the July 2010 release of UniProtKB.

The probability for a mutation X-->Y to be found in cases is computed as the ratio between the number of mutations X-->Y related to disease and the total number of X-->Y mutations in the data set, as derived from UniProtKB (release July 2010)

Standard deviations are evaluated with the binomial approximation.

Data provided by

Sean Tavtigian, University of Utah