Principal Component-based Analysis of High-dimensional Gene Expression Guided by Clinical Data

Lubieniecka JM, Graham J, Sarmiento A, Brown KL, Ross C, Luqmani R, Foell D, Gill E, Hancock REW, Benseler S, Cabral DA

Anti-neutrophil cytoplasmic antibody (ANCA) – associated vasculitides (AAVs) are a group of rare, systematic, inflammatory disorders affecting small to medium sized vessels of major organs. The AAVs comprise granulomatosis with polyangiitis (GPA), microscopic polyangiitis (MPA), and eosinophilic granulomatosis with polyangiitis (EGPA). The AAVs commonly lead to kidney failure or pulmonary hemorrhage and are associated with significant mortality and substantial morbidity in survivors. Since AAV is more common in adults than children, most knowledge of pediatric AAVs has been transferred from adult studies. To enable research on pediatric vasculitis, an international network of investigators has been established through the Pediatric Vasculitis Initiative (PedVas) for the collection of clinical data and biological samples.

Results of recent studies in adult patients suggest that the pathogenesis of AAVs may have a genetic component. One goal of the PedVas initiative is to identify the genetic component of pediatric vasculitis through analysis of the clinical and RNAseq data. A major challenge for statistical analysis is the small number of individuals relative to the number of variables that are measured. In addition, quantitative measurement of the clinical variables of interest is often impossible or imprecise. For gene expression data, principal component (PC) analysis has been developed to both reduce the high-dimensionality, by capturing a small set of latent variables, and to quantify the variation due to these latent variables. PC analysis has been successfully used to estimate the signatures of latent variables in genomic data; however, there are still only a limited number of methods available to identify the genetic variables which are the key drivers of the PCs.

Guided by clinical data, we propose to identify systematic variation in gene expression, and apply a PC-based approach that identifies genes that drive the variation. Genes driving this variation are prioritized in a subsequent gene set enrichment analysis in order to highlight biological pathways involved in the disease.