|
Saturday, 31 March 2012
Recorded Webinar: Dissecting the Apache Hadoop Stack
Friday, 30 March 2012
HPCwire: BGI Starts Bioinformatics and Computing Lab in Tianjin
Professor Jian Wang, President of BGI, said, "In the past, it took a year to conduct a project on the genomics association study of 500 human samples, but now with "Tianhe", 3 hours is enough. We believe this will broaden the applications of Tianhe-1A in life science and greatly accelerate the development of frontier of science and technology."
Wednesday, 28 March 2012
BMC Bioinformatics | Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.
Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II.
Abstract
ABSTRACT:
BACKGROUND:
Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified.
RESULTS:
Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants.
CONCLUSION:
We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.
- PMID:
- 22443413
- [PubMed - as supplied by publisher]
BMC Bioinformatics | Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer
Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer.
Abstract
ABSTRACT:
BACKGROUND:
The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (de novo assemblers) are typically employed to process such data. However, these methods require large memory resources and computation time. Many basic biological questions could be answered targeting specific information in the reads, thus avoiding complete assembly.
RESULTS:
We present Mapsembler, an iterative micro and targeted assembler which processes large datasets of reads on commodity hardware. Mapsembler checks for the presence of given regions of interest that can be constructed from reads and builds a short assembly around it, either as a plain sequence or as a graph, showing contextual structure. We introduce new algorithms to retrieve approximate occurrences of a sequence from reads and construct an extension graph. Among other results presented in this paper, Mapsembler enabled to retrieve previously described human breast cancer candidate fusion genes, and to detect new ones not previously known.
CONCLUSIONS:
Mapsembler is the first software that enables de novo discovery around a region of interest of repeats, SNPs, exon skipping, gene fusion, as well as other structural events, directly from raw sequencing reads. As indexing is localized, the memory footprint of Mapsembler is negligible. Mapsembler is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/mapsembler/.
- PMID:
- 22443449
- [PubMed - as supplied by publisher]
Tuesday, 27 March 2012
[Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)
---------- Forwarded message ----------
From: "C. Titus Brown"
Date: Mar 25, 2012 12:48 PM
Subject: [Velvet-users] A strategy for scaling genome and transcriptome contig assembly (digital normalization)
To: "velvet-users"
Hi all,
last week I posted a preprint of a paper discussing a strategy for coverage normalization, data reduction, and error elimination:
http://ivory.idyll.org/blog/mar-12/diginorm-paper-posted.html
we call this strategy 'digital normalization' and it can yield good to spectacular reductions in data size and memory usage for assembly. In the paper we test it with Velvet, Oases, and Trinity on a variety of data sets.
---
On the paper site,
http://ged.msu.edu/papers/2012-diginorm/
I just posted a tutorial for running it on microbial genomes prior to Velvet assembly, and on the Trinity paper's yeast mRNAseq data set prior to Oases or Trinity assembly.
http://ged.msu.edu/angus/diginorm-2012/tutorial.html
The tutorial uses an Amazon EC2 instance for reproducibility, but with a bit of hopefully obvious tweaking the commands should work on any Linux system. Note, you'll need about 15 gb of RAM for the yeast Oases & Trinity assemblies.
Let me know if you have any questions (but be please to ask just on the relevant mailing list -- I'm sending this to velvet, oases, and trinity lists).
cheers,
--titus
_______________________________________________
Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users
Saturday, 24 March 2012
Flow cytometric chromosome sorting in plants: The next generation.
Flow cytometric chromosome sorting in plants: The next generation.
Source
Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, Sokolovská 6, CZ-77200 Olomouc, Czech Republic.
Abstract
Genome analysis in many plant species is hampered by large genome size and by sequence redundancy due to the presence of repetitive DNA and polyploidy. One solution is to reduce the sample complexity by dissecting the genomes to single chromosomes. This can be realized by flow cytometric sorting, which enables purification of chromosomes in large numbers. Coupling the chromosome sorting technology with next generation sequencing provides a targeted and cost effective way to tackle complex genomes. The methods outlined in this article describe a procedure for preparation of chromosomal DNA suitable for next-generation sequencing.
ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.
ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.
Source
Wellcome Trust Sanger Institute, Hinxton, UK.
Abstract
Objectives: There is increasing evidence that rare variants play a role in some complex traits, but their analysis is not straightforward. Locus-based tests become necessary due to low power in rare variant single-point association analyses. In addition, variant quality scores are available for sequencing data, but are rarely taken into account. Here, we propose two locus-based methods that incorporate variant quality scores: a regression-based collapsing approach and an allele-matching method. Methods: Using simulated sequencing data we compare 4 locus-based tests of trait association under different scenarios of data quality. We test two collapsing-based approaches and two allele-matching-based approaches, taking into account variant quality scores and ignoring variant quality scores. We implement the collapsing and allele-matching approaches accounting for variant quality in the freely available ARIEL and AMELIA software. Results: The incorporation of variant quality scores in locus-based association tests has power advantages over weighting each variant equally. The allele-matching methods are robust to the presence of both protective and risk variants in a locus, while collapsing methods exhibit a dramatic loss of power in this scenario. Conclusions: The incorporation of variant quality scores should be a standard protocol when performing locus-based association analysis on sequencing data. The ARIEL and AMELIA software implement collapsing and allele-matching locus association analysis methods, respectively, that allow the incorporation of variant quality scores.
Copyright © 2012 S. Karger AG, Basel.
Thursday, 22 March 2012
Multi-threaded BAM compression and sorting The multi-threaded sort/merge/view is available at the "mt" branch:
From: Heng Li
Date: Thu, Mar 22, 2012 at 11:32 AM
Subject: [Samtools-help] Multi-threaded BAM compression and sorting
To convert coordinated sorted BAM back to FASTQ, the recommended way is to sort BAM by name and then convert the name sorted BAM to fastq. This is important because some mapper such as BWA assumes the input is random. They may have some troubles if we directly convert a coordinate sorted BAM with Picard's bam2fastq. While novosort, it does not sort by name. As I need to do BAM=>FASTQ for some huge BAMs, I added multi-threading to "sort", "merge" and "view".
This is not a full parallelization in that not all the steps are parallelized. Thus the efficiency is not scaled linearly with the number of threads. It is not recommended to use more than 8 threads. With 4 threads, time on sorting is reduced to 40% according to limited test. It may save you half a day if you have a huge BAM.
All my changes are naive and simple. It is possible to speed up sorting and compression further, but so far as I can see, this needs quite a lot of code restructuring and development time. For coordinate sorting, novosort scales much better with the number of threads (though I do not know if multi-threaded novosort is free to use beyond 15 days). Nils' multi-threaded bgzip should also do better on compression. These are sophisticated implementations. Mine is not.
The multi-threaded sort/merge/view is available at the "mt" branch:
https://github.com/samtools/samtools/tree/mt
The samtools/bgzf APIs stay the same except a few new functions to enable threading. In addition to multithreading, there are a few other improvements to sorting (some are based on Nils'):
1) @HD-SO tag is properly set (finally).
2) The compression level can be changed on the command line (-l).
3) Coordinate sorting considers strand as part of the key.
4) Improved alpha-numeric comparison between query names. The previous version was slower and did not work when there is a large integer.
5) Supporting "K/M/G" with option "-m". The maximum memory is estimated a little more accurately.
6) I kept claiming samtools sort was stable (i.e. the relative order of two records having the same coordinate are retained), but this was not true. The new sort is truly stable. This also means under the same compression level, sort always produces exactly the same output. For endusers, stable sorting is largely irrelevant. This just makes me feel more comfortable, "in theory".
Heng
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
https://lists.sourceforge.net/lists/listinfo/samtools-help
UPDATE
contributed by iceman (see comments )
relevant links to an alternative implementation:
See: http://seqanswers.com/forums/showthread.php?p=66683
and: https://github.com/nh13/samtools/tree/pbgzip
CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing
CloVR is a VM that executes on a desktop (or laptop) computer, providing the ability to run analysis pipelines on local resources (Figure (Figure1).1). CloVR is invoked using one of two supported VM players, VMware [33] and VirtualBox [34]; at least one of which is freely available on all major desktop platforms: Windows, Unix/Linux, and Mac OS. On a local computer, CloVR utilizes local disk storage and compute resources, as supported by the VM player, including multi-core CPUs if available. To access data stored on the local computer, users can copy files into a "shared folder" that is accessible on both the VM and the local desktop and uses available hard drive space on the computer. Once inside the shared folder, CloVR can read this data for processing. Similarly, CloVR writes output data to this shared folder, making the pipeline output available on the desktop. This shared folder feature is supported by both VMware and VirtualBox.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228541/?tool=pubmed
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community BMC Bioinformatics | Abstract |
Background
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Results
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Conclusions
CloudBioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
only hg19 resources available for PLINK/SEQ
http://atgu.mgh.harvard.edu/plinkseq/tutorial.shtml
you would have to first download the data
Downloading the data for the tutorial
We assume you have a working copy of PLINK/SEQ already installed. The data for this tutorial are in two archives you need to download:- pseq-tut1.tar.gz [ 1.1M ] : VCFs and a few auxiliary data files
- resources-hg18-0.02.tar.gz [ 1.1G ] : a (relatively large) bundle of resource databases (RefSeq genes, dbSNP variants, hg18 sequence)
UPDATE: a better page to describe the resources can be found here
although only hg19 is avail
http://atgu.mgh.harvard.edu/plinkseq/resources.shtml
Flash on 64 bit Ubuntu is 1 good reason to install Google Chrome
Wednesday, 21 March 2012
DeconSeq @ SourceForge.net automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.
ntroduction
Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.
DeconSeq is distributed under the GNU Public License (GPL). All its source codes are freely available to both academic and commercial users. The latest version can be downloaded at the SourceForge download page.
Web version
TOP OF PAGEThe interactive web interface of DeconSeq can be used to automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets.
Necessary resources
Hardware
- Computer connected to the Internet
Software
- Up-to-date Web browser (Firefox, Safari, Chrome, Internet Explorer, ...)
Files
- FASTA file with sequence data
- FASTQ file (as alternative format to trim sequence and quality data)
Upload data to the DeconSeq web version
To upload a new dataset in FASTA or FASTQ format to DeconSeq, follow these steps:
1. Go to http://deconseq.sourceforge.net
2. Click on "Use DeconSeq" in the top menu on the right (the latest DeconSeq web version should load)
3. Select your FASTA or FASTQ file
4. Select the retain and remove (optional) database(s)
5. Click "Submit"
Tuesday, 20 March 2012
gigasync - Tool that enables rsync to mirror enormous directory trees.
http://samba.org/rsync/) has a couple issues with mirroring large (> 100K) directory trees.
rsync's memory usage is directly proportional to the number of files in a tree. Large directories take a large amount of RAM.
rsync can recover from previous failures, but always determines the files to transfer up-front. If the connection fails before that determination can be made, no forward progress in the mirror can occur.
The solution? Chop up the workload by using perl to recurse the directory tree, building smallish lists of files to transfer with rsync. Most of the time these small lists of files transfer over fine, but if they fail, this script can look for that specific failure and retry that set a couple times before giving up.
http://matthew.mceachen.us/geek/gigasync/gigasync.pod.html
Monday, 19 March 2012
WSEmboss has been deprecated for a year ...
http://www.ebi.ac.uk/Tools/webservices/services/emboss
Important
Description
EMBOSS (European Molecular Biology Open Software Suite) is a free Open Source software analysis package specially developed for the needs of the molecular biology user community.The software automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web. Since extensive libraries are provided with the package providing a platform to allow scientists to develop and release software in the true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.
For more information see:
Saturday, 17 March 2012
Article: Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World
Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World
http://www.bio-itworld.com/2012/03/14/broads-heng-li-wins-2012-benjamin-franklin-award.html
(Sent from Flipboard)
Sent from my iPad
Using Excel for Bioinformatics Data: Five Issues, Five Solutions
Using Excel for Bioinformatics Data: Five Issues, Five Solutions
http://info.5amsolutions.com/bid/120220/Using-Excel-for-Bioinformatics-Data-Five-Issues-Five-Solutions
Nice tips !
I had been wondering about the Issue: Mistaken SYLK Files for the longest time! Forgot how I solved it eventually. MacOS excel also appears to have different default behavior. In windows I used to be able to open .csv files and there will be a dialog box asking me what are the field separators in the file (:,;,|,tab,space)
Now my just run everything thru sed to change it to commas so that it works.
Sed 's/tab/,/g' file > file.csv
Open file.csv
Thursday, 15 March 2012
theBioBucket*: R-Function to Read Data from Google Docs Spreadsheets
Understanding the “improved” in VIM - Super User Blog
Wednesday, 14 March 2012
Detecting selective sweeps from pooled - PubMed Mobile
http://www.ncbi.nlm.nih.gov/m/pubmed/22411855/
Abstract Due to its cost effectiveness, next generation sequencing of pools of individuals (Pool-Seq) is becoming a popular strategy for characterizing variation in population samples. Since Pool-Seq provides genome-wide SNP frequency data, it is possible to use them for demographic inference and/or the identification of selective sweeps. Here, we introduce a statistical method that is designed to detect selective sweeps from pooled data by accounting for statistical challenges associated with Pool-Seq, namely sequencing errors and random sampling among chromosomes. This allows for an efficient use of the information : all base calls are included in the analysis, but the higher credibility of regions with higher coverage and base calls with better quality scores is accounted for.
The Pistoia Alliance Sequence Squeeze competition
The Pistoia Alliance Sequence Squeeze Competition
Amazon vouchers
The first 40 entries received will each receive a US$20 Amazon Web Services voucher.
(Only one voucher per person.)
The volume of next-generation sequencing data is a big problem. Data volumes are growing rapidly as sequencing technology improves. Individual runs are providing many more reads than before, and decreasing run times mean that more data can today be generated by a single machine in one day than a single machine could have produced in the whole of 2005.
Storing millions of reads and their quality scores uncompressed is impractical, yet current compression technologies are becoming inadequate. There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.
The Pistoia Alliance, in the interests of promoting pre-competitive collaboration, is putting forward a prize fund of US$15,000 to the best novel open-source NGS compression algorithm submitted before the closing date of 15 March 2012.
Follow us on Twitter - @SeqSqueeze
LKVenia: Using OS X Terminal keys on a Macbook Pro
http://lkraider.eipper.com.br/blog/2009/04/using-os-x-terminal-keys-on-macbook-pro.html
Find these keys and edit them (note that the \033 you can get by pressing the esc key):
End - send string to shell: \033[4~
Home - send string to shell: \033[1~
Page down - send string to shell: \033[6~
Page up - send string to shell: \033[5~
Shift page down - scroll to next page in buffer
Shift page up - scroll to previous page in buffer
Easyfig | Free software downloads at SourceForge.net
Easyfig enables the creation of linear comparison figures showing BLAST matches between multiple genomic loci or prokaryote genomes. Easyfig has an easy-to-use graphical user interface and is able to launch BLAST searches interactively.
http://sourceforge.net/projects/easyfig/
WITHDRAWN: Evaluation of next-generation sequenc... [J Hum Genet. 2011] - PubMed - NCBI
WITHDRAWN: Evaluation of next-generation sequencing software in mapping and assembly.
Source
Department of Biochemistry, Center for Reproduction, Development and Growth, The University of Hong Kong, Hong Kong.
Retraction in
Abstract
Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages, when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 16 June 2011; doi:10.1038/jhg.2011.62.