Imputation as a Tool to Fine-map Chromosome 4q31.22 Region Previously Identified to Confer Breast Cancer Susceptibility

Mahalakshmi Kumaran¹, Carol E. Cass², Yutaka Yasui³ and Sambasivarao Damaraju¹

Departments of 1. Laboratory Medicine and Pathology, 2. Oncology and 3. Public Health, University of Alberta, Edmonton, Canada

Introduction: Genome Wide Association Studies (GWAS) have identified several Single Nucleotide Polymorphisms (SNPs) associated with Breast Cancer (BC) susceptibility in diverse ethnic populations. A three stage GWAS (sample size, n=7250 cases and controls) from our lab identified Chromosome 4q31.22 locus as highly associated with sporadic BC susceptibility, predominantly in pre-menopausal women [rs1429142, combined p-value 6.2×10^-10 and allelic OR of 1.49 (95% CI: 1.31-1.68) adjusted for body mass index]. Often, GWAS identified loci are surrogates for causal variants in the flanking regions. Targeted re-sequencing and genotype imputations are the common approaches used to identify and enhance the density of markers around the associated loci and to identify causal variants. Imputation is a powerful statistical method increasingly employed on GWAS data sets to infer the un-typed genetic markers based on Linkage Disequilibrium (LD); and use of publicly available high-density reference panels such as the 1000 Genomes Project has become a standard practise.

Objectives: Our objectives were (i) to impute the region 4q31.22 to increase the marker density and thereby facilitate SNP selection for further genotyping in larger cohorts as this is cost effective compared to targeted re-sequencing, and (ii) to conduct preliminary analysis from the imputed data for association with BC.

Materials: All subjects genotyped are of Caucasian ancestry. We have a GWAS dataset on 348 sporadic BC cases (predominantly pre-menopausal) and 348 apparently healthy controls that were genotyped on Affymetrix SNP 6.0 arrays. We have also imputed an externally accessed BC GWAS data set of 1,142 sporadic post-menopausal BC cases and 1,145 age matched controls from Cancer Genetic Markers of Susceptibility (CGEMS) project with genotyping performed on Illumina humanhap550v1 platform. Utilization of two GWAS data sets (from independent genotyping platforms) help facilitates finer comparisons of imputed data. CGEMS data as an independent validation set also served, as a control to confirm our original study premise that rs1429142 is not associated with BC in post-menopausal women.

Methods: Phasing of the study genotypes was done using Shapeit2 algorithm and the best guess method implemented in IMPUTE2 algorithm was used to infer the missing genotypes. The 1000 Genome project genotype data ((http://www.1000genomes.org) was used a reference panel for imputation. Implementation of pre- and post-imputation quality control measures helped refine the data and gain confidence in the imputed genotypes. Results: We achieved best imputation results with overall cross-validation accuracy over 98% using the cosmopolitan reference panel (samples included Americans, Europeans, East Asians, South Asians, and Africans). The SNP density on chromosome 4 increased from 40,203 to 952,002. Post- imputation filtering with info score cut-off ≥ 0.7, LD (r2) of >=0.1 with rs1429142, and with a minor allele frequency cut-off at >1% identified 251 new bi-allelic SNPs in 500 kb of the flanking region. We tested the imputed SNPs for association with BC in both GWAS data sets. We identified SNPs showing high statistical significance relative to rs1429142 in pre-menopausal women as we predicted, indicating that this locus may potentially harbor causal variants.

Future direction: We will genotype the newly identified SNPs in a combined sample size of 10,000 (cases and controls) with focus on pre-menopausal women to improve statistical power for the association study. Further functional studies may provide mechanistic insights of the variants associated with disease susceptibility.