http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks047.long
This article is Open Access
The rapid expansion in the quantity and quality of RNA-Seq data requires the development of sophisticated high-performance bioinformatics tools capable of rapidly transforming this data into meaningful information that is easily interpretable by biologists. Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.
To initiate analyses using R-SAP, the user provides two
required inputs for the pipeline: the sequence alignment file and
known transcripts’ coordinate file. Currently
R-SAP accepts alignment files only in psl format that are generated by
mapping
RNA-Seq reads to the reference genome using BLAT
(Blast like alignment tool) (21) or SSAHA2 (Sequence search and alignment by hashing algorithm) (22).
RNA-Seq reads mapping to the genome may result in the alignments
scattered across multiple exons separated by introns.
We chose psl as the alignment format for the
pipeline because the scattered alignments are precisely stitched
together and
reported as a large single alignment. As a
result, for each sequencing read the most likely alignment and
corresponding genomic
locus can be readily found in the alignment
files. Moreover, the psl format preserves the orientation of alignment
blocks
originating from the contiguous genomic loci
enabling their accurate re-mapping to the annotated exons and
determination of
associated reference structural variants.
R-SAP is also configured to work with two of the currently available transcript assemblers: Cufflinks (23) and Scripture (24). Assembled transcripts can be supplied to R-SAP either in GTF (Gene Transfer Format) or in BED (Browser Extensible Data)
format. GTF and BED are default output formats from Cufflinks and Scripture respectively.
- Nucl. Acids Res. (2012) doi: 10.1093/nar/gks047
No comments:
Post a Comment