This article is Open Access
The rapid expansion in the quantity and quality of RNA-Seq data requires the development of sophisticated high-performance bioinformatics tools capable of rapidly transforming this data into meaningful information that is easily interpretable by biologists. Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.
To initiate analyses using R-SAP, the user provides two required inputs for the pipeline: the sequence alignment file and known transcripts’ coordinate file. Currently R-SAP accepts alignment files only in psl format that are generated by mapping RNA-Seq reads to the reference genome using BLAT (Blast like alignment tool) (21) or SSAHA2 (Sequence search and alignment by hashing algorithm) (22). RNA-Seq reads mapping to the genome may result in the alignments scattered across multiple exons separated by introns. We chose psl as the alignment format for the pipeline because the scattered alignments are precisely stitched together and reported as a large single alignment. As a result, for each sequencing read the most likely alignment and corresponding genomic locus can be readily found in the alignment files. Moreover, the psl format preserves the orientation of alignment blocks originating from the contiguous genomic loci enabling their accurate re-mapping to the annotated exons and determination of associated reference structural variants.
R-SAP is also configured to work with two of the currently available transcript assemblers: Cufflinks (23) and Scripture (24). Assembled transcripts can be supplied to R-SAP either in GTF (Gene Transfer Format) or in BED (Browser Extensible Data) format. GTF and BED are default output formats from Cufflinks and Scripture respectively.
- Nucl. Acids Res. (2012) doi: 10.1093/nar/gks047