Research in genetics has developed rapidly recently due to the aid of next generation sequencers (NGS). However, they produces much data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework appears to be the best solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the NGS and, therefore, are inefficient. Last, it is difficult for biologists to use these tools because most were developed on Linux with a command line interface.
To advocate the trend of using Cloud technologies in genomics and prepare for the next generation of sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, and is more accuracy with a friendly interface. It was also designed to be able to deal with long sequences. The performance gain over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based applications, the performance gain is from the partition and parallel processing of the huge reference genome as well as the reads. CloudAligner source code is available at http://cloudaligner.sourceforge.net/ and a web version of CloudAligner is at http://mine.cs.wayne.edu:8080/CloudAligner/.
Our results show that CloudAligner is faster than CloudBurst, provides more accurate results than RMAP, and supports various input as well as output formats. In addition, with the web-based interface, it is easier to use than its counterparts.
- [PubMed - as supplied by publisher]