Not sure where to start? If you are new to PrediXcan and MetaXcan, we recommend starting your analyses with the DGN-HapMap Whole Blood model or one of the various GTEx-HapMap tissue models. See our FAQ and the description below for more info.
We have received a number of inquiries from users about how to build their own prediction models. All scripts to train the models are located in the GitHub repository here . Note that many of these scripts have to do with splitting up data and submitting jobs to the University of Chicago's HPC cluster. The script for model training is located at here . We hope this proves useful and gives some insight into how the models were built. If you do build a prediction model for your own research, we encourage you to share it with the community following the publication of your main results, or sooner, if you feel comfortable. We can then host your model on predictdb.org while giving you proper acknowledgement.
PredictDB Data Repository Explorer
|PredictDB and Covariance Files ↓||Folder||Last Modified||Size|
The PredictDB Data Repository hosts genetic prediction models of transcriptome levels to be used with PrediXcan and MetaXcan. Based on HapMap SNP set, GTEx-V6p-HapMap-2016-09-08.tar.gz were trained using GTEx v6p data and DGN-HapMap-2015.tar.gz were trained using DGN whole blood data. Prediction models using 1KG SNP set (GTEx-V6p-1KG-2016-11-16.tar.gz) are also available from the PredictDB Data Repository's More folder. A copy of Portal_Analysis_Methods_v6p_08182016.pdf can be downloaded from either GTEx or predictdb s3 bucket.
Given a matrix X consisting of genotype information, where each row is a sample, each column is a SNP, and entries are encoded on a 0-2 scale for the effect allele, and a column vector y of normalized expression levels of a given transcript, we fit an ElasticNet model to predict the values of y given X. For a given transcript, we only include SNPs which are located within 1 Mb upstream of the transcription start site and 1 Mb downstream of the transcription end site. In the fitting of the model, we employed 10-fold cross-validtion to choose the penalty parameter and we specified a mixing parameter of 0.5 (See glmnet R package ).
After downloading the files you would like to use, you must extract them. To do so, either locate the file and double click on it or using the command line, use the command tar -zxvf [file]. This will produce a folder that contains the unzipped sqlite databases containing the weights for predicting gene expression based a person's genotype, and compressed text files containing the covariances of SNP weights.
The Genotype-Tissue Expression (GTEx; Sample size ) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. All GTEx Data was downloaded from The database of Genotypes and Phenotypes (dbGaP).
Depression Genes and Networks (DGN; 922 whole-blood samples) Data was provided by Dr. Douglas F. Levinson. We gratefully acknowledge the resources were supported by National Institutes of Health/National Institute of Mental Health grants 5RC2MH089916 (PI: Douglas F. Levinson, M.D.; Coinvestigators: Myrna M. Weissman, Ph.D., James B. Potash, M.D., MPH, Daphne Koller, Ph.D., and Alexander E. Urban, Ph.D.) and 3R01MH090941 (Co-investigator: Daphne Koller, Ph.D.).
This work is supported by R01MH107666 (H.K.I.), K12 CA139160 (H.K.I.), R01 MH101820 (GTEx), P30 DK20595, P60 DK20595 (Diabetes Research and Training Center), P50 DA037844 (Rat Genomics), and P50 MH094267 (Conte).
Please join our Google Group for general discussion, notification of future changes to our tools, feature requests, etc.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.