Analysis of DNA Methylation and Gene Expression Data in Cordblood and Placenta Tissues: An Integrative Approach

Bhatnagar SR1,2, Houde AA4,5, Voisin G2, Bouchard L4,5, Greenwood CMT1,2,3

1. Department of Epidemiology, Biostatistics and Occupational Health, McGill University; 2. Lady Davis Institute, Jewish General Hospital, Montréal, QC; 3. Departments of Oncology and Human Genetics, McGill University; 4. Department of Biochemistry, Université de Sherbrooke, QC; 5. ECOGENE-21 and Lipid Clinic, Chicoutimi Hospital, QC

Background: Recent advances in genomic technologies have made it feasible to measure, on the same individual, multiple types of genomic activity such as genotypes, gene expression, DNA copy number, methylation and microRNA expression. However, in order to benefit from the increasing amounts of heterogeneous data and to obtain a more complete view of genomic functions, there is a great need for statistical and computationally efficient methods that allow us to combine this information in an intelligent way. Challenges with prediction models in this setting arise from the high-dimensional non-linear nature of the data, the large number of measurements compared to the few samples for whom they are collected, and the presence of complex interactions between the different types of data. Methods such as sparse regression, hierarchical clustering and principal component analysis can address any one of these challenges, but can not do so simultaneously. Kernel methods, which use matrices measuring the similarity between two individuals, offer a powerful way of simultaneously addressing these challenges without significantly increasing the computational burden. In this work, we investigate the benefits and challenges that arise from using kernel methods in the context of integrating DNA methylation, gene expression and phenotypic data in a sample of mother-child pairs from a prospective birth cohort. The goal of this study is to identify epigenetic marks observed at birth that help predict childhood obesity.

Methods: DNA methylation and gene expression, in both cord blood and placenta tissues, were measured at birth in a sample of 23 women, 16 of whom had a gestational diabetes (GD)-affected pregnancy. In addition, seven anthropometric measurements were taken in the offspring at age 5. We first kernalise these measures and classify them into two bodyfat groups via hierarchical clustering. We then use a sparse version of supervised (GD status) canonical correlation analysis to capture both the maximum correlation and sparse nature of the genomic data. The resulting sparse canonical variables are used to predict the bodyfat class labels. The choice of kernel and bandwidth parameter are chosen to maximize the area under the ROC curve.

Results: Applying a Gaussian kernel to the seven anthropometric measurements leads to an AUC of 83.3% (95% CI: 66.3-100) compared to an AUC of 74.4% (95% CI: 50.2-98.7) when using the untransformed data.

Conclusions: Although kernel-based prediction methods have been shown to perform well in cancer, when there tend to be numerous abnormalities, performance of these methods when using normal tissues, such as in this study, is unknown. This work shows the potential of kernel methods in integrating heterogeneous data from normal tissues and clinical phenotypes.