Detecting Genomic Elements of Extreme Conservation in Higher Eukaryotes by Integration of Hash Mapping and Cache-oblivious In-memory Computing
Author | : Andi Dhroso |
Publisher | : |
Total Pages | : 51 |
Release | : 2015 |
ISBN-10 | : OCLC:970211718 |
ISBN-13 | : |
Rating | : 4/5 ( Downloads) |
Download or read book Detecting Genomic Elements of Extreme Conservation in Higher Eukaryotes by Integration of Hash Mapping and Cache-oblivious In-memory Computing written by Andi Dhroso and published by . This book was released on 2015 with total page 51 pages. Available in PDF, EPUB and Kindle. Book excerpt: Genomics is one of the first life science disciplines to enter the era of Big Data, facing challenges in all three dimensions—volume, variety, and velocity. Yet, in spite of a plethora of sequencing data, we are still far from creating a complete encyclopedia of functional and structural elements of the genome. In 2004, an example of this knowledge gap came about when Bejerano and Haussler discovered 481 DNA elements in the syntenic positions of human, mouse and rat genomes that were 100% identical, called the ultra-conserved elements (UCEs). Recently, using an advanced data-mining alignment-free approach, it was shown that this phenomenon exists beyond the animal kingdom and outside the regions of synteny (conservation of blocks of order within two sets of chromosomes that are being compared with each other). Our ultimate goal is to provide a comprehensive atlas of the regions of extreme conservation in higher eukaryotes providing insights into the structural organization, function and evolution of these elements. However, the all-against-all comparison of dozens, if not hundreds of eukaryotic genomes may not be feasible using current approaches. For instance, the original findings of syntenic-only UCEs relied on a whole-genome alignment of three mammalian genomes and it took one day on a 24-nodes cluster. A comprehensive alignmentfree algorithm that guaranteed finding all syntenic and non-syntenic long identical multi-species elements (LIMEs) took three days on a 48 CPU cluster between two assembled genomes. Here, we present a new hybrid approach that integrates the ideas of hash mapping and cacheoblivious in-memory computing. Our algorithm leverages the concept of help-me-help-you, where the data structures are tailored to maximize cache-hit while minimizing cache-miss. As a result, our hybrid algorithm is approximately 800 times faster than the current state-of-the-art method and is scalable to deal with the unassembled genomes. The new hybrid approach has been applied to detect the earliest evidence of extreme conservation by including into the largescale analysis recently sequenced genomes of coelacanth and lamprey. The integration of efficient software with hardware-optimized approaches has shown to be a promising direction in comparative genomics, allowing scientists to provide even deeper insights into the function and evolution of eukaryotic genomes.