Showing posts with label fastq. Show all posts
Showing posts with label fastq. Show all posts

Wednesday, 2 August 2017

Creating filtered fastq files of ONLY mapped reads from a BAM file

Filtering BAM files for mapped or unmapped reads

To get the unmapped reads from a bam file use :
samtools view -f 4 file.bam > unmapped.sam, the output will be in sam
to get the output in bam use : samtools view -b -f 4 file.bam > unmapped.bam
To get only the mapped reads use the parameter 'F', which works like -v of grep and skips the alignments for a specific flag.
samtools view -b -F 4 file.bam > mapped.bam

Source: https://www.biostars.org/p/56246/ Sukhdeep Singh


To do this as efficiently as possible, using BBTools:
reformat.sh in=reads.sam out=mapped.fq mappedonly
Also, BBMap has a lot of options designed for filtering, so it can output in fastq format and separate mapped from unmapped reads, preventing the creation of intermediate sam files.  This approach also keeps pairs together, which is not very easy using samtools for filtering.

bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq
Source: https://www.biostars.org/p/127992/ Brian Bushnell

Wednesday, 22 February 2012

Amazon S3 for temporary storage of large datasets?

Just did a rough calculation on AWS calculator, the numbers are quite scary!

For a hypothetical 50 TB dataset (haven't found out the single S3 object max file size yet, seem to recall it's 1 Gbytes)
it costs $4160.27 to store it for a month!

to transfer it out it costs $4807.11!

For 3 years, the cost of storage is $149,000 which I guess you can pay for an enterprise storage solution and transfer costs are zero.

At this point in time, I guess one can't really use AWS S3 for sequence archival. I wonder if data deduplication can help reduce cloud storage costs ... I am sure in terms of bytes, BAM files should be quite similar .. no?


Thursday, 31 March 2011

Convert SAM / BAM to fasta / fastq

Probably one of the most freq FAQ

latest thread in biostar
http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file

Samtofastq using Picard
http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq

Samtools and awk to make fasta from sam
samtools view filename.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta
 
Biopython and pysam (code contributed by Brad Chapman)
http://biostar.stackexchange.com/questions/6993/convert-bam-file-to-fasta-file/6994#6994 

Sunday, 13 March 2011

script 4 filter to unique FASTQ reads using a bloom-filter in front of a python set

from the hackmap blog 

a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:

python fastq-unique.py > in.fastq < out.unique.fastq

Saturday, 12 March 2011

Quality control and preprocessing of metagenomic datasets


Quality control and preprocessing of metagenomic datasets

Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
Availability and Implementation: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at http://prinseq.sourceforge.net/

Datanami, Woe be me