Creation of a Database with Homogenized dbGaP Data

Tripathi C, Ellis G, Akhabir L, Daley D

Centre for Heart Lung Innovation, Faculty of Medicine, University of British Columbia,, Vancouver, BC

Background: In the “omics” era, it has become increasingly important for researchers to share data and collaborate. The overarching goal is to decipher complex trait etiologies by genetic, epigenetic and genomic means. The database of Genotypes and Phenotypes (dbGaP) stores and distributes the results of genotype-phenotype association studies. Such studies include genome-wide association studies and a diverse array of phenotypic and demographic data. In order to optimally harness the potential of the data, they need to be integrated. Currently, studies submitted to dbGaP vary in their formatting; for instance, genotype data are found in PLINK, Matrix or VCF formats. This makes the use and integration of these data challenging.

Objective: Develop an integrated dbGaP relational SQL database to integrate and combine data from multiple studies and produce user-friendly outputs. The database would contain different kinds of data including study documentation; phenotypic data at individual level and summary form, and genetic data such as individual genotypes, sequencing data and pedigree information.

Methods: The database tables will be written in SQL language and will be compiled in three phases. We are presently in the first phase. Thus far, we have identified variables to be included and grouped together. We have also planned for the tables to be built: 1) One table to have a summary of all the studies; 2) a table containing variables common to all studies and; 3) the rest of the tables to have variables specific to each study. The study summary table will be linked to study-specific tables and to the table with common variables. All study-specific tables will be linked to the table with common variables. The challenge of accommodating individual-level sequencing data in the database will be addressed. In the second phase, we will establish relationships between variables and tables and determine the scope of variables: 1) local to a single table or subgroup of tables or 2) used in every table. The final phase will be beta-testing, where end users such as biologists, geneticists, statisticians and bioinformaticians will be invited to test the tables by querying the database content.

Results: Efforts towards integrating dbGaP datasets into a single database are ongoing. This project is now in the first phase where studies are being requested from dbGaP data access committees and common variables are being identified. The tables to be constructed are being drafted out. The resulting database structure will be shared with collaborators and other researchers who face the same challenges.

Conclusion: We are creating an easy to use database to homogenize the format of dbGaP data. This will enable accommodation of diverse variables. It will be a catalyst to enable data extraction and utilization.