PredictDB Data Repository



Welcome

Welcome to PredictDB! The files hosted here can be used with the Im Lab's software PrediXcan and MetaXcan. Navigate below to download one or more tissue models.

Not sure where to start? If you are new to PrediXcan and MetaXcan, we recommend starting your analyses with the DGN-HapMap Whole Blood model or one of the various GTEx-HapMap tissue models. See our FAQ and the description below for more info.

Announcement

November 30, 2016

We have received a number of inquiries from users about how to build their own prediction models. All scripts to train the models are located in the GitHub repository here . Note that many of these scripts have to do with splitting up data and submitting jobs to the University of Chicago's HPC cluster. The script for model training is located at here . We hope this proves useful and gives some insight into how the models were built. If you do build a prediction model for your own research, we encourage you to share it with the community following the publication of your main results, or sooner, if you feel comfortable. We can then host your model on predictdb.org while giving you proper acknowledgement.


Disclaimer: the models are provided "as is", with the hope that they may be of use, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the models or the use or other dealings in the models.

PredictDB Data Repository Explorer  

PredictDB and Covariance Files ↓ Folder Last Modified Size

PredictDB Data

Description

The PredictDB Data Repository hosts genetic prediction models of transcriptome levels to be used with PrediXcan and MetaXcan. Based on HapMap SNP set, GTEx-V6p-HapMap-2016-09-08.tar.gz were trained using GTEx v6p data and DGN-HapMap-2015.tar.gz were trained using DGN whole blood data. Prediction models using 1KG SNP set (GTEx-V6p-1KG-2016-11-16.tar.gz) are also available from the PredictDB Data Repository's More folder. A copy of Portal_Analysis_Methods_v6p_08182016.pdf can be downloaded from either GTEx or predictdb s3 bucket.

Model Training

Given a matrix X consisting of genotype information, where each row is a sample, each column is a SNP, and entries are encoded on a 0-2 scale for the effect allele, and a column vector y of normalized expression levels of a given transcript, we fit an ElasticNet model to predict the values of y given X. For a given transcript, we only include SNPs which are located within 1 Mb upstream of the transcription start site and 1 Mb downstream of the transcription end site. In the fitting of the model, we employed 10-fold cross-validtion to choose the penalty parameter and we specified a mixing parameter of 0.5 (See glmnet R package ).

Extracting Files

After downloading the files you would like to use, you must extract them. To do so, either locate the file and double click on it or using the command line, use the command tar -zxvf [file]. This will produce a folder that contains the unzipped sqlite databases containing the weights for predicting gene expression based a person's genotype, and compressed text files containing the covariances of SNP weights.

Database Schema

Database tables are shown as follows:
  • extra - holds info about each linear model for predicting the transcriptome in the tissue. The column names with descriptions are listed here:
    • gene - the ensembl ID of the gene
    • genename - the gene's HUGO symbol
    • pred.perf.R2 - the cross-validated R2 value found when training the model.
    • n.snps.in.model - the number of cissnps used to predict the expression level of the gene
    • pred.perf.pval - the p-value of the correlation between cross-validated prediction and observed expression
    • pred.perf.qval - the q-value obtained when analyzing the initial distribution of p-values. The models in these databases have been filtered to only include results that are significant at a FDR of less than 5%.
  • weights - the weights for the snps in the linear models. The column names with descriptions are listed here:
    • rsid - the rsid number for the snp from dbSNP build 142
    • gene - the ensembl ID of the gene for which the snp weight is predicting expression
    • weight - the weight value for the snp in the model
    • ref_allele - the reference allele of the snp
    • eff_allele - the effect allele of the snp
  • sample_info - has only one column (n.samples) and one value, which is the number of samples used to train the model.
  • construction - contains information from the training of the models. Primarily included for reproducibility purposes.

Acknowledgements

GTEx

The Genotype-Tissue Expression (GTEx; Sample size ) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. All GTEx Data was downloaded from The database of Genotypes and Phenotypes (dbGaP).

DGN

Depression Genes and Networks (DGN; 922 whole-blood samples) Data was provided by Dr. Douglas F. Levinson. We gratefully acknowledge the resources were supported by National Institutes of Health/National Institute of Mental Health grants 5RC2MH089916 (PI: Douglas F. Levinson, M.D.; Coinvestigators: Myrna M. Weissman, Ph.D., James B. Potash, M.D., MPH, Daphne Koller, Ph.D., and Alexander E. Urban, Ph.D.) and 3R01MH090941 (Co-investigator: Daphne Koller, Ph.D.).

Funding

This work is supported by R01MH107666 (H.K.I.), K12 CA139160 (H.K.I.), R01 MH101820 (GTEx), P30 DK20595, P60 DK20595 (Diabetes Research and Training Center), P50 DA037844 (Rat Genomics), and P50 MH094267 (Conte).

Mailing List

Please join our Google Group for general discussion, notification of future changes to our tools, feature requests, etc.