Methods for Estimating Changes in DNA Methylation in the Presence of Cell Type Heterogeneity

Kevin McGregor

Supervised by: Celia Greenwood, Aurelie Labbe

DNA methylation occurring at a cytosine-guanine (CpG) site blocks binding to the DNA and hence can influence gene function and regulation. Therefore, it is often valuable to investigate which methylation sites are associated with diseases or other phenotypes of interest. Though a large proportion of CpG sites in mammals are methylated, methylation signatures differ notably between cell types. Consequently, when measuring methylation levels on whole blood or other types of tissues involving multiple cell types, it can be difficult to distinguish the changes associated with a phenotype of interest from those occurring as a result of varying proportions of different cell types among subjects. This phenomenon is particularly of concern when the phenotype itself is associated with changes in cell type proportion, since there may then be confounding of the effects of interest and the cell type proportions.

There are several recently developed methods that attempt to correct for this potential confounding, including one method based on an external validation data set (Houseman et al., BMC Bioinformatics 2012), a reference-free method (Houseman et al., Bioinformatics 2014), Surrogate Variable Analysis (Leek and Storey, PLoS Genetics 2007), Independent Surrogate Variable Analysis (Teschendorff, Bioinformatics 2011), the FAST-LMM-EWASher method (Zou, Nature Methods 2014), Deconfounding (Repsilber, BMC Bioinformatics 2010), and CellCDecon (Wagner, PhD Thesis 2014). In order to compare the performance of each method, we have artificially re-combined measures of methylation obtained from cell-separated analysis of whole blood. Specifically, methylation measures are available for monocytes and CD4 T-cells. We randomly chose a subset of the samples to be disease cases, then we designated a set of CpG sites to be associated with the disease. A new artificial set of methylation measurements was generated by combining the values from each cell type according to a list of simulated cell type proportions for each subject. We have evaluated the performance of each adjustment method by comparing the values of the estimated parameters to those from the simulation. We uncovered notable differences between the methods in terms of accuracy, the extent to which the confounding has been corrected, as well as in computational performance.