Towards a Cloud-ready Cancer Genomics Analysis Pipeline

Marco Albuquerque, Bruno Grande, and Ryan D. Morin

An inherent problem in data-driven biology is the multi-disciplinary skill set required of the scientist to draw meaningful inferences from complex data sets. Over the past decade, this has been exacerbated by new technologies such as next-generation sequencing, which have greatly accelerated the rate at which data is being produced. To facilitate efficient handling of these increasingly large data sets, myriad algorithms have been developed to perform various bioinformatic tasks. In the data-intensive field of cancer genomics, these include computational methods for sequence alignment/mapping and variant detection. However, each software tool is typically run from the command line with an often cryptic combination of parameters, creating a tremendous barrier for new users. Graphical user interfaces—which are sorely lacking in the field of bioinformatics—can broaden the usage of these powerful computational methods by effectively eliminating their steep learning curve. This can be achieved using the Galaxy platform, which provides a unified graphical user interface for running software. Still, the small number of incorporated tools on this platform restricts users to simple analyses. To directly address this deficiency, we’re introducing a cancer genomics toolbox in Galaxy. Currently, our toolbox includes 20 new Galaxy tools spanning several sub-categories, notably variant calling (for single nucleotide variants, copy number variations and structural variations), visualization and additional helper tools for integrating and summarizing results. These tools have been linked with existing Galaxy tools forming 5 workflows, which are reproducible bioinformatic pipelines. Following Galaxy best practices, automatic tool installation exists for every tool allowing for a seamless installation. All tools and workflows have been developed to ensure optimal parallelism on a cluster environment. Additionally, our cancer genomics toolbox can easily be deployed onto a cloud-based instance of Galaxy, eliminating the need for permanent access to commodity computing hardware. In summary, our Galaxy toolbox will accelerate cancer research by empowering researchers to perform their own cancer genome analyses with unprecedented accessibility.