Transcription Factor Binding in an Expanded Epigenetic Alphabet
1. Department of Computer Science, University of Toronto, Toronto, ON, Canada; 2. Princess Margaret Cancer Centre, Toronto, ON, Canada; 3. Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia; 4. Department of Genome Sciences, University of Washington, Seattle, WA, USA; 5. Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA; 6. Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
Introduction. Human gene expression programs are controlled through the action of 1400–2000 individual transcription factors. Many transcription factors initiate transcription only in specific sequence contexts, providing the means for sequence specificity of transcriptional control. The four-letter DNA alphabet generally used to describe these sequences, however, only partially describes the possible diversity of nucleobases a transcription factor might encounter. For instance, cytosine is often present in an epigenetically modified form: 5-methylcytosine (5mC). Cellular enzymes oxidize 5mC successively to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxylcytosine (5caC). Just as transcription factors distinguish one unmodified nucleobase from another, they have been shown to distinguish unmodified bases from epigenetically modified bases. Modification-sensitive transcription factors provide a mechanism by which widespread changes in DNA methylation and hydroxymethylation found in many cancers can dramatically shift active gene expression programs. In particular, acute myeloid leukemia and glioblastoma multiforme often display broad changes in epigenetic cytosine state, partly caused by frequent mutations in gene families affecting 5mC oxidation.
Methods. To understand the effect of modified nucleobases on gene regulation, we developed methods to discover transcription factor motifs and identify transcription factor binding sites in DNA with covalent modifications. Our models expand the standard A/C/G/T alphabet, adding m (5mC), h (5hmC), f (5fC), and c (5caC). We adapted the well-established position weight matrix formulation of transcription factor binding affinity to this expanded alphabet.
We have engineered several tools to work with expanded-alphabet sequence and position weight matrixes. First, we have developed a program, Cytomod, to create the sequence using data from both single-base assays (such as those involving bisulfite sequencing) and lower-resolution assays (such as those involving DNA immunoprecipitation or biotin tagging). Cytomod decides between multiple modifications at a single position using a configurable evidence model. Second, we have developed new versions of DREME (Discriminative Regular Expression Motif Elicitation) and CentriMo that enable de novo discovery of modification-sensitive motifs and identification of modification-sensitive binding sites. These versions permit users to specify new alphabets, anticipating future alphabet expansions.
Results. We created an expanded-alphabet genome sequence using genome-wide maps of 5mC, 5hmC, and 5fC in mouse embryonic stem cells. Using this sequence and expanded-alphabet position weight matrixes, we identified cis-regulatory modules that we believe are active only in the presence of cytosine modifications. With chromatin immunoprecipitation-sequencing (ChIP-seq) data for c-Myc transcription factor binding, we detected the known preference of c-Myc for unmethylated DNA. We found new binding sites for known methylation-sensitive transcription factors, such as Krüppel-like factor 4 (Klf4) and the c-Jun/c-Fos heterodimer. We also located new binding sites for the carboxylation-sensitive heterodimer of transcription factor 3 (Tcf3) and the Achaete-scute homolog 1 (Ascl1) transcription factor. Gene set enrichment analysis revealed biological pathways affected by DNA modification state.