Thursday, 30 June 2011

Ion Torrent Releases Data for 316 Chip from the genome of the E. coli DH10B laboratory strain.

As Life Tech's Ion Torrent nears an early July launch for the Ion 316 chip for its Personal Genome Machine, the company last week released a dataset for an Escherichia coli genome on its website that it generated internally on the new chip. In addition, several early-access customers — including the Human Genome Sequencing Center at Baylor College of Medicine and the Broad Institute — have tested the 316 chip, beating Ion Torrent's own R&D teams on throughput in some runs.

The E. coli dataset, approximately 150 megabases from a single run on the 316 chip and posted in the "Torrent Dev" section of the company's "Ion Community" website, comes from the genome of the E. coli DH10B laboratory strain. According to Ion Torrent, the data contains no more than 1 error at read lengths of 100 bases, and there are 69 errors in the entire genome.

The 316 chip will represent a 10-fold increase in throughput over the 314 chip that the company currently markets for the PGM. "We made enough progress on the 316 that … we feel it's a good time to put data out so folks can look at it and analyze it in different ways," said Maneesh Jain, Ion Torrent's vice president of marketing and business development.

Full article

Wednesday, 29 June 2011

Sequencing Devils to Save a Species

Sequencing Devils to Save a Species
TODAY | USA Today | Scientists from Penn State and the J. Craig Venter Institute have sequenced the genomes of Cedric and Spirit, a pair of Tasmanian Devils, in the hopes of saving the marsupial species that is threatened with extinction due to the rapid spread of an infectious facial cancer. more

Tuesday, 28 June 2011

Next generation sequencing--implications for clini... [Br Med Bull. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21705347

Abstract Background Genetic testing in inherited disease has traditionally relied upon recognition of the presenting clinical syndrome and targeted analysis of genes known to be linked to that syndrome. Consequently, many patients with genetic syndromes remain without a specific diagnosis. Areas of agreement New 'next-generation' sequencing (NGS) techniques permit simultaneous sequencing of enormous amounts of DNA. A slew of research publications have recently demonstrated the tremendous power of these technologies in increasing understanding of human genetic disease. Areas of controversy These approaches are likely to be increasingly employed in routine diagnostic practice, but the scale of the genetic information yielded about individuals means that caution must be exercised to avoid net harm in this setting. Areas timely for developing research Use of NGS in a research setting will increasingly have a major but indirect beneficial impact on clinical practice. However, important technical, ethical and social challenges need to be addressed through informed professional and public dialogue before it finds its mature niche as a direct tool in the clinical diagnostic armoury. PMID: 21705347 [PubMed -as supplied by publisher]

Sent from an Android.

Enriching Targeted Sequencing Experiments for Rare... [Bioinformatics. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21700677

Abstract Next-generation targeted resequencing of GWAS-associated genomic regions is a common approach for follow-up of indirect association of common alleles. However, it is prohibitively expensive to sequence all the samples from a well-powered GWAS study with sufficient depth of coverage to accurately call rare genotypes. As a result, many studies may use next-generation sequencing for SNP discovery in a smaller number of samples, with the intent to genotype candidate SNPs with rare alleles captured by resequencing. This approach is reasonable, but may be inefficient for rare alleles if samples are not carefully selected for the resequencing experiment. We have developed a probability-based approach, SampleSeq, to select samples for a targeted resequencing experiment that increases the yield of rare disease alleles substantially over random sampling of cases or controls or sampling based on genotypes at associated SNPs from GWAS data. This technique allows for smaller sample sizes for resequencing experiments, or allows the capture of rarer risk alleles. When following up multiple regions, SampleSeq selects subjects with an even representation of all the regions. SampleSeq also can be used to calculate the sample size needed for the resequencing to increase the chance of successful capture of rare alleles of desired frequencies. Software: http://biostat.mc.vanderbilt.edu/SampleSeq chun.li@vanderbilt.edu

Sent from an Android.

Exome Sequencing Reveals Comprehensive Genomic Alt... [PLoS One. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21701589

a. Abstract It is well established that genomic alterations play an essential role in oncogenesis, disease progression, and response of tumors to therapeutic intervention. The advances of next-generation sequencing technologies (NGS) provide unprecedented capabilities to scan genomes for changes such as mutations, deletions, and alterations of chromosomal copy number. However, the cost of full-genome sequencing still prevents the routine application of NGS in many areas. Capturing and sequencing the coding exons of genes (the "exome") can be a cost-effective approach for identifying changes that result in alteration of protein sequences. We applied an exome-sequencing technology (Roche Nimblegen capture paired with 454 sequencing) to identify sequence variation and mutations in eight commonly used cancer cell lines from a variety of tissue origins (A2780, A549, Colo205, GTL16, NCI-H661, MDA-MB468, PC3, and RD). We showed that this technology can accurately identify sequence variation, providing ∼95% concordance with Affymetrix SNP Array 6.0 performed on the same cell lines. Furthermore, we detected 19 of the 21 mutations reported in Sanger COSMIC database for these cell lines. We identified an average of 2,779 potential novel sequence variations/mutations per cell line, of which 1,904 were non-synonymous. Many non-synonymous changes were identified in kinases and known cancer-related genes. In addition we confirmed that the read-depth of exome sequence data can be used to estimate high-level gene amplifications and identify homologous deletions. In summary, we demonstrate that exome sequencing can be a reliable and cost-effective way for identifying alterations in cancer genomes, and we have generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines. These findings could provide important insights into cancer pathways and mechanisms of resistance to anti-cancer therapy
Sent from an Android.

Monday, 27 June 2011

Note to self: CentOS Yum cache

to save bandwidth when doing multiple installs from netinstall.iso

to remember to change /etc/yum.conf

keepcache=0

to

keepcache=1

This will keep all the packages downloaded as cached.

Thursday, 23 June 2011

Ewan's Blog; bioinformatician at large: Five statistical things I wished I had been taught 20 years ago

link to full article

Five statistical things I wished I had been taught 20 years ago

I came through the English educational system, which meant that although I was mathematically minded, because I had chosen biochemistry for my undergraduate, my maths teaching rapidly stopped - in university I took the more challenging "Maths for Chemists" option in my first year, though in retrospect that was probably a mistake because it was all about partial differentiation, and not enough stats. Probably the maths for biologists was a better course, but even that I think spent too much time on things like t-test and ANOVA, and not enough on what you need. To my subsequent regret, no one took my aside and said "listen mate, you're going to be doing alot of statistics, so just get the major statistical tools under your belt now".

Sunday, 19 June 2011

Short read Illumina data for the de novo assembly ... [BMC Genomics. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21679424

Abstract ABSTRACT: Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig quality. These results highlight the ongoing need for improvements in assembly methodology. PMID: 21679424 [PubMed -as supplied by publisher]

Sent from an Android.

Rise of the machines - recommendations for ecologi... [Mol Ecol Resour. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21679314

Next generation sequencing is revolutionizing molecular ecology by simplifying the development of molecular genetic markers, including microsatellites. Here, we summarize the results of the large-scale development of microsatellites for 54 nonmodel species using next generation sequencing and show that there are clear differences amongst plants, invertebrates and vertebrates for the number and proportion of motif types recovered that are able to be utilized as markers. We highlight that the heterogeneity within each group is very large. Despite this variation, we provide an indication of what number of sequences and consequent proportion of a 454 run are required for the development of 40 designable, unique microsatellite loci for a typical molecular ecological study. Finally, to address the challenges of choosing loci from the vast array of microsatellite loci typically available from partial genome runs (average for this study, 2341 loci), we provide a microsatellite development flowchart as a procedural guide for application once the results of a partial genome run are obtained.

Sent from an Android.

Wednesday, 15 June 2011

Rothberg Expects 400-Base Reads on Ion Torrent by Year's End; $1K Human Genome by 2013

Posted on Genomeweb

Rothberg Expects 400-Base Reads on Ion Torrent by Year's End; $1K Human Genome by 2013

Excerpted

Ion Torrent has been improving the Personal Genome Machine's throughput, read length, accuracy, and sample prep and believes it will achieve 400-base reads later this year and be able to sequence a human genome for $1,000 by early 2013, according to the company's founder, Jonathan Rothberg.

At the Consumer Genetics conference in Boston last week, Rothberg said that one of Ion Torrent's customers, the Broad Institute, recently beat the Life Technologies subsidiary's own throughput record for the Ion 316 chip by about 20 percent by generating almost 290 megabases of data in a single run. The 316 chip, which is currently in early-access testing and due to be released at the end of this month, has 6.1 million accessible sensors and is officially supposed to produce only about 100 megabases per run, or a million 100-base reads (IS 3/1/2011).

Saturday, 11 June 2011

microRNA sequencing for Bird song response (not kidding)

A microRNA response to birdsong
Claudio V Mello and Peter V Lovell
The male zebrafinch is one of few animals other than humans that communicates through complex vocal signals. Claudio Mello and Peter Lovell comment on genomic approaches to unravelling the neural basis of song-learning, including a paper in BMC Genomics that profiles the microRNA response to conspecific birdsong.

Friday, 10 June 2011

BWA to support multiple hits as separate lines in SAM with addon pl script

This is the reason why I love open source communities / software.
After a brief discussion and request for BWA to also report multiple hits as separate entries in sam/bam files. The author of BWA (Li Heng) promptly released a addon perl script to allow for this feature.

commercial providers: try to beat that for speed for new feature release!

Anyway if you are interested on the usage:

A new script xa2multi.pl is added to convert XA:Z tag to multiple lines.

bwa samse ref.fa reads.sai reads.fq.gz | xa2multi.pl > out.sam

A related question was also posted on biostars

Question: How to force 'bwa samse' to output multiple hits in .sam format?
http://www.biostars.org/p/45430/

"Introduction to Ion Torrent Training Webinar Series" Recordings Now Available

K:what's inside the ion community?

Thank you for registering for the "Introduction to Ion Torrent Training Webinar Series". The recorded versions of these webinars are now available on the Ion Community

Below is the list of training webinars available and a link to the Ion Community for viewing them.

Webinar Title	Date
Introduction to Semiconductor Sequencing	27-April	view
Introduction to Ion Torrent Informatics	4-May	view
Technical Performance of Ion Torrent Semiconductor Sequencing	11-May	view

Also the data that made headlines recently

Download the E. coli O104:H4 strain data.

E_coli_O104H4_FASTQ.zip

E_coli_O104H4_SFF.zip

Thursday, 9 June 2011

BlueSEQ- an excellent place to start understanding NGS

The folks at blueseq have nicely summarized the applications which platforms are most suited for in a easy to understand table.(Your views may vary especially if you are aligned with any of the platform)

there is also a glossary, that is helpful in explaining terms to newbies (read: will point students to this link instead of having to explaining it myself)

Alerted to this resource by this post

Goldmine of unbiased expert knowledge on next generation sequencing

Ion Torrent, de novo assembly and a nasty bug

Nick Loman's blog post looking at de novo assembly with Ion Torrent did over 2,000 views within a short time, attesting to the interest that everyone has with de novo assembly or Ion Torrent data,

His post

Ion Torrent data blog post; a week is a long time in genomics

also touches on N50 values, a common 'metric' for de novo assembly

CLC Bio wrote a press release
from his blog post

However, he also pointed out MIRA as a excellent free assembler..

Notable Tweets from Applied Bioinformatics & Public Health 2011

credits to http://pathogenomics.bham.ac.uk/blog/2011/06/all-the-tweets-from-abph11/

pathogenomenick: You will struggle to identify many species by 16S – particularly Streptococcus – even with full-length sequences #ABpH11

pathogenomenick: William Wade has a very nice database www.homd.org – 619 bact "species" represented, 66% of those cultured. 113 un-named. #ABPH11

pathogenomenick: Actinobacteria are underrepresented in 16S clone libraries: why – lysis? high GC? primers turned out to be the major problem #ABpH11

aunderwo: William Wade: Both bacterial culture and PCR amplification of 16S rRNA gene introduce their own biases when examining microbiomes #ABPH11

aunderwo: William Wade: By 16S rDNA sequencing found 50% of oral microbiome is unculturable #ABPH11

aunderwo: Whole transcriptome RNA sequencing will be possible with Ion Torrent 'soon' according to their rep #ABPH11

pathogenomenick: The Broad have done some early experiments with mate-pair sequencing on Ion Torrent using insert lengths of around 1.5kb #ABPH11
aunderwo: Mate paired libraries with 1.5kb inserts have been achieved with Ion Torrent PGM #ABPH11

aunderwo: At volume it is now possible to sequence a human genome for $4k using Illumina HiSeq #ABPH11
pathogenomenick: HiSeq 2000 – 8 human genomes per Tb or 8000 bacterial genomes per run. I know what I'd prefer! #ABPH11

avilella: Illumina HiSeq now: 600Gb per run. Latest R&D number: more than 1Tb per run #ABPH11

pathogenomenick: Super deep sequencing of KRAS allows detection of 1.1% variant frequency using MiSeq. This is going to take over cancer screening. #ABPH11

aunderwo: Possible to find SNPs involved in drug resistance in TB strains using MiSeq sequencing #ABPH11
jennifergardy: I'm switching my Christmas wish from a pony to an Illumina MiSeq. If they could throw in a pony w/ the machine, that would be great #ABPH11

jacarrico: Very nice examples of use for microbiology using miseq: TB, pseudomonas, ecoli sequencing #ABPH11
pathogenomenick: Presenting CF sputum metagenomics using HiSeq., PA LESB58 came out at 636x depth, plus phage & 7 other genomes 50-86% covered #ABPH11

pathogenomenick: After depleting human DNA, CF sputum DNA is >70% bacterial #ABPH11
jacarrico: Hiseq for metagenomics / Miseq to characterize individual isolates #ABPH11
fionabrinkman: MiSeq can seq P. aeruginosa LES genome accurately vs ref. which is good since large, high G+C. Using HiSeq for metagenomics though #ABPH11

avilella: The MiSeq pipeline will run the latest version of Velvet assembler as you can find at dzerbino's website. Nothing closed and canned. #ABPH11

fionabrinkman: @pathogenomenick yes! Next Star Trek movie must show shots of PacBio sequencing #ABPH11

pathogenomenick: OK, going to talk about Haiti cholera outbreak now. 12-fold genome coverage achieved in 90 minutes. Wow, those bugs are in log phase #ABPH11

aunderwo: FLX+ has a modal read length of 700bp – approaching read lengths of Sanger sequencing. Base accuracy 99.99%+ with 15x coverage #ABPH11
avilella: 454 FLX Plus modal 700bp, 85% above 500bp, total 700MB per run, 23 hours, accuracy a couple of 10^-5 extra pc points: 99,997% #ABPH11
jacarrico: 454 GS Flex + has 80% of reads greater than 500 bp and up to 1kbp. #ABPH11

lexnederbragt: RT @pathogenomenick: PacBio gives more even coverage of genome compared to Illumina – this is due to amplification bias. Models Poisson very well. #ABPH11

lexnederbragt: We too! MT @pathogenomenick: 454 8kb PE data can produce single scaffolds for S. pneumoniae, E. coli, (it's true, we've done it too) #ABPH11

aunderwo: Joo Andr Carrio: An ontology and REST API for microbial typing
Paper : http://bit.ly/jm7XDT Ontology: http://bit.ly/jzApXB #ABPH11 http://bit.ly/jm7XDT http://bit.ly/jzApXB
pathogenomenick: RT @aunderwo: Joo Andr Carrio: An ontology and REST API for microbial typing
pathogenomenick: Has developed a RESTful MLST web interface. This is great. We just need it for next-gen now. #ABPH11

#ABPH11 http://rest.phyloviz.net/webui/

pathogenomenick: Developed data visualisation software called Phyloviz: http://bit.ly/isNJgj handles ST data, SNP data, looks pretty #ABPH11 http://bit.ly/isNJgj
pathogenomenick: .@jacarrico makes a compelling case for open data in molecular typing. What a shame it is not embraced by wider community #ABPH11
marina_manrique: Nick Loman @pathogenomenick starts the pipeline session. xBASE-NG A web interface for rapid analysis of bacterial genomes #ABPH11
aunderwo: Nick Loman: Web interface for WGS analysis http://ng.xbase.ac.uk/my/ #ABPH11 http://ng.xbase.ac.uk/my/

aunderwo: Nick Loman: Use Illumina sequence to correct homopolymeric tracts in 454 scaffolds #ABPH11
jacarrico: Nick Loman – illumina corrected 133 putative erros in 454 assembly #ABPH11
marina_manrique: Once more: the importance of hybrid assemblies (in this case #454 & #illumina) for correcting seq errors @pathogenomenick #ABPH11 #ngs

aunderwo: Marina Manrique: An annotation pipeline for NGS genome data http://www.era7bioinformatics.com/en/prokaryote_genome_annotation.html #ABPH11 http://www.era7bioinformatics.com/en/prokaryote_genome_annotation.html
jacarrico: www.ohnosequences.com – great name for a sequence assembler based on protein similarity #ABPH11

pathogenomenick: Mossong: Was alarmed when he got 1Gb files per each MRSA strain sequenced, compared with 7 bytes for MLST! #ABPH11
marina_manrique: Great! some info about the kind of technology used in Jel Mossong talk at #ABPH11 Illumina 2x100bp 80x for MSRA genomes
jacarrico: Interesting spa type vs SNP typing max parsimony tree comparison #ABPH11
aunderwo: Joel Mossong: using 85x coverage illumina data could extract MLST profiles from 36/40 strains. Puzzled about missing 4? #ABPH11
aunderwo: RT @pathogenomenick: Mossong: Was alarmed when he got 1Gb files per each MRSA strain sequenced, compared with 7 bytes for MLST! #ABPH11
marina_manrique: Another idea I've liked at Jel Mossong talk: WGS data should not be limited to SNP analysis, Mobile elements also play a role!

#ABPH11

aunderwo: Marcus Claesson: comparing 454 and Illumina data for classifying bacteria using 16S. 454 outperforms Illumina #ABPH11
aunderwo: Marcus Claesson: Metagenomics – long reads of 454 give better classification , more data from Illumina => more OTUs detected #ABPH11
fionabrinkman: Claesson: 454 better vs Illumina for 16S seq (see http://goo.gl/u4CDD) but >60bp Illumina reads really helps & primer choice key #ABPH11 http://goo.gl/u4CDD)

Wednesday, 8 June 2011

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping.

Abstract

ABSTRACT:

BACKGROUND:

Research in genetics has developed rapidly recently due to the aid of next generation sequencers (NGS). However, they produces much data, which leads to storage, compatibility, scalability, and performance issues. The Cloud Computing and MapReduce framework appears to be the best solution for these issues. Consequently, it has been adopted by many organizations recently, and the initial results are very promising. However, since these are only initial steps toward this trend, the developed software does not provide adequate primary functions like bisulfite, pair-end mapping, etc in on-site software such as RMAP or BS Seeker. In addition, existing MapReduce-based applications were not designed to process the long reads produced by the NGS and, therefore, are inefficient. Last, it is difficult for biologists to use these tools because most were developed on Linux with a command line interface.

RESULTS:

To advocate the trend of using Cloud technologies in genomics and prepare for the next generation of sequencing, we have built a Hadoop MapReduce-based application, CloudAligner, which achieves higher performance, covers most primary features, and is more accuracy with a friendly interface. It was also designed to be able to deal with long sequences. The performance gain over Cloud-based counterparts (35 to 80%) mainly comes from the omission of the reduce phase. In comparison to local-based applications, the performance gain is from the partition and parallel processing of the huge reference genome as well as the reads. CloudAligner source code is available at http://cloudaligner.sourceforge.net/ and a web version of CloudAligner is at http://mine.cs.wayne.edu:8080/CloudAligner/.

CONCLUSIONS:

Our results show that CloudAligner is faster than CloudBurst, provides more accurate results than RMAP, and supports various input as well as output formats. In addition, with the web-based interface, it is easier to use than its counterparts.

PMID:: 21645377; [PubMed - as supplied by publisher]

PileLineGUI: a desktop environment for handling genome position files in next-generation sequencing studies.

Abstract

Next-generation sequencing (NGS) technologies are making sequence data available on an unprecedented scale. In this context, new catalogs of Single Nucleotide Polymorphism and mutations generated by resequencing studies are usually stored in genome position files (e.g. Variant Call Format, SAMTools pileup, BED, GFF) comprising of large lists of genomic positions, which are difficult to handle by researchers. Here, we present PileLineGUI, a novel desktop application primarily designed for manipulating, browsing and analysing genome position files (GPF), with specific support to somatic mutation finding studies. The developed tool also integrates a new genome browser module specially designed for inspecting GPFs. PileLineGUI is free, multiplatform and designed to be intuitively used by biomedical researchers. PileLineGUI is available at: http://sing.ei.uvigo.es/pileline/pilelinegui.html.

PileLine GUI is a front-end of the PileLine toolkit, plus a genome browser. With this intuitive graphical desktop application you can run the following tasks:

Processing commands of GP files, like seek, join, annotate and filtering.
Perform 2-samples and n-samples point somatic mutation calling (via the PileLine 2smc and nsmc commands).
Browse GP files in a interactive local genome browser.

You can download PileLine GUI from Downloads.

General scheme of the PileLine GUI software.

PileLine GUI's interactive genome browser.

PileLine GUI showing a instantly-navigable .pileup file.

Assemblathon 2 Challenges Informatics Experts with Vertebrate Genome Data, Two Sequencing Platforms

Excerpted from GenomeWeb

Organizers of the Assemblathon genome assembly competition last week launched the second round of the effort, posting sequence data for three vertebrate genomes generated on two next-gen sequencing platforms. Results from this round of the competition, called Assemblathon 2, are expected in early November..

Full article

Monday, 6 June 2011

Technical variability is too high to ignore | RNA-Seq Blog

ok this disturbing.
Will research more given time.
Technical variability is too high to ignore. Technical variability results in inconsistent detection of exons at low levels of coverage. Further, the estimate of the relative abundance of a transcript can substantially disagree, even when coverage levels are high. This may be due to the low sampling fraction and if so, it will persist as an issue needing to be addressed in experimental design even as the next wave of technology produces larger numbers of reads. Practical recommendations for dealing with the technical variability, without dramatic cost increases are provided. McIntyre, LM et. al. (2001) RNA-seq : technical variability and sampling. BMC Genomics [Epub ahead of print]. [article]
http://rna-seqblog.com/publications/technical-variability-is-too-high-to-ignore/

RNA-seq : technical variability and sampling.

BMC Genomics. 2011 Jun 6;12(1):293. [Epub ahead of print]

RNA-seq : technical variability and sampling.

McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV.

Abstract

ABSTRACT:

BACKGROUND:

RNA-seq is revolutionizing the way we study transcriptomes. mRNA can be surveyed without prior knowledge of gene transcripts. Alternative splicing of transcript isoforms and the identification of previously unknown exons are being reported. Initial reports of differences in exon usage, and splicing between samples as well as quantitative differences among samples are beginning to surface. Biological variation has been reported to be larger than technical variation. In addition, technical variation has been reported to be in line with expectations due to random sampling. However, strategies for dealing with technical variation will differ depending on the magnitude. The size of technical variance, and the role of sampling are examined in this manuscript.

RESULTS:

In this study three independent Solexa/Illumina experiments containing technical replicates are analyzed. When coverage is low, large disagreements between technical replicates are apparent. Exon detection between technical replicates is highly variable when the coverage is less than 5 reads per nucleotide and estimates of gene expression are more likely to disagree when coverage is low. Although large disagreements in the estimates of expression are observed at all levels of coverage.

CONCLUSIONS:

Technical variability is too high to ignore. Technical variability results in inconsistent detection of exons at low levels of coverage. Further, the estimate of the relative abundance of a transcript can substantially disagree, even when coverage levels are high. This may be due to the low sampling fraction and if so, it will persist as an issue needing to be addressed in experimental design even as the next wave of technology produces larger numbers of reads. We provide practical recommendations for dealing with the technical variability, without dramatic cost increases.

PMID:: 21645359; [PubMed - as supplied by publisher]

Saturday, 4 June 2011

Important footnotes in Ion Fragment Library Preparation Kit

The service provider conferred with Ion and came back with the fact my library was too big; the protocols are designed for inserts smaller than 150 bp, and my amplicons were carefully designed to be 150-205 in size.
.....
Ion Fragment Library Preparation kit. It even has a very helpful appendix on Amplicon Sequencing, The only real hint about the size limitation is the statement that "Target regions from 75 to 150 nucleotides in length must be sequenced bidirectionally". Clearly this is insufficiently emphatic! In the fragment library preparation information, the more serious warning does appear: "Libraries with a mean sze >~220 bp yield results of reduced sequencing quality" (and that size is after adding adapters).

Kevin: I just love how small details like these screw up your experiments. Be Forewarned!

from

Paying a Painful 75% Secrecy Tax

In a post a while back, I mentioned that my Ion Torrent sequencing project was stalled because my service provider couldn't get some of the key kits, despite an Ion representative posting that no such shortages existed. I've been remiss in updating that; last Tuesday the kits showed up and Monday I got my data -- and a bit of a shock.

Posting of Ion Torrent protocols online is a violation of Terms and Conditions

http://seqanswers.com/forums/showthread.php?t=10400

Just got to know of this rather disturbing fact that seqanswers admins were informed (nicely) to take down online posted protocols for the Ion Torrent.

I wished to post the adaptor sequences for RNA multiplex libraries online before as a help for bioinformaticians that might have gotten their data from a service provider or have problems getting a prompt response from the ever friendly FAS. I mean if it's online, I need not bother them yeah?

Now I wonder if I might be violating terms and conditions somewhere out there.

I would argue for posting of protocols online.
Lab Protocols are meant to be optimised in every lab.
case in point? you promote active discussion on the product and once you have that, it is an active support community that beats a whole army of FAS with trained responses to problems in protocols.
see this imaginary conversation

Researcher A: making that incubation step longer for 10 secs improves your yield? good for you! but it didn't work for me, any advice on where else I can do it?
Researcher B: yeah sure, you see page 15 step 8A ? don't over do that step as it affects yield but be warned it might affect the quality of the final output but let's solve one problem at a time. . I tried that last week!

Agilent grants for systems biology software development

RE: Agilent grants for systems biology software development

Dear Kevin,
I am writing to you on behalf of Leo Bonilla, Director of Marketing for Integrated Biology, Agilent Technologies, Inc. Leo and the Integrated Biology team at Agilent have been reading your blog, My Weblog on Bioinformatics, Genome Science, Next Generation Sequencing, and thought you may be interested in sharing a funding opportunity with your readers. Agilent is fostering integrated, whole-systems approaches to biological research through two $75,000 US grants (application deadline August 12, 2011). Funds will support academic or nonprofit research projects covering the development of open source software tools for integrating data from different omics platforms—genomics, transcriptomics, proteomics, and metabolomics. For full details on eligibility, submission, and review process, please visit www.Agilent.com/lifesciences/emerginginsights.
If you have any questions or would like to interview Leo about the grant program, I’d be happy to set up a phone call. Just reply to my email and I’ll connect you with Leo.

Readers if you have any questions post them in the comments and I shall pass them on :)

Integrated Biology - eMerging Insights Grants

Integrated Biology - eMerging Insights Grants

Fostering integrated, whole-systems approaches to biological research with two $75,000US grants for open source data-integration tool development The different omics platforms—genomics, transcriptomics, proteomics and metabolomics—are generating new insights into how biological systems work at a molecular level. Although each individual omics approach provides a global view of a specific cellular process, this view is limited to only one aspect of the biological system. In order to gain a comprehensive understanding of the system as a whole, researchers are faced with the challenge of merging these very different data sets.
Agilent is supporting scientists who are taking on this challenge through our eMerging Insights Grant Program. We currently have two open initiatives for academic and non-profit researchers developing and/or improving open source, Agilent-compatible software tools to integrate multi-omics data. Each initiative will provide $75,000US to a single academic or non-profit research lab in fiscal year 2011. A proof-of-concept prototype or working solution must be demonstrated at the end of one year, using either existing data sets from the investigator’s own lab or institution, or from new or existing datasets produced at Agilent.
One of the most important outcomes of our eMerging Insights Grant Program is the development of open source* solutions for the analytical life science community. Any tools developed with this funding will be freely available, open source tools for the research community.
The submission deadline for these two initiatives is August 12, 2011.
Awards will be announced September 30, 2011.
*All free or open source licenses are acceptable except "any license requiring , as a condition of use, modification and/or distribution of the software subject to the license, that the software or other software combined and/or distributed with it be (i) disclosed or distributed in source code form; (ii) licensed for the purpose of making derivative works; or (iii) redistributable at no charge. Excluded licenses include, but are not limited to, the GPLv3 License."

Download Application

Friday, 3 June 2011

Ion Torrent in the limelight for sequencing E coli strain of outbreak in Germany

June 2, 2011 | Researchers at BGI (formerly the Beijing Genomics Institute) in Shenzhen, China, have sequenced the strain of Escherichia coli bacterium responsible for the deadly outbreak in Germany this week that has claimed at least 17 lives and infected more than 1,000 people across Europe, with symptoms including kidney failure and bloody diarrhea.

BGI completed the sequencing of the E. coli samples within three days, using the relatively new Ion Torrent platform (owned by Life Technologies). This third-generation sequencing platform has the advantage of speed of sequencing and relatively long read lengths, which is useful for sequencing and identifying novel bacterial strains.

Bioinformatics analysis showed that the strain at the center of the latest outbreak is highly infectious and toxic. According to the results of the draft assembly (available at ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482), the estimated genome size of this new E. coli strain is about 5.2 megabases (Mb). The bacterium is an EHEC (enterohemorrhagic) serotype O104 E. coli strain, but a new serotype that has not been previously associated with any E. coli outbreaks.

Comparative sequence analysis showed that this bacterium has 93% similarity with the EAEC (enteroaggregative) 55989 E. coli strain, previously isolated in the central Africa and linked to cases of serious diarrhea. The new European strain of E. coli also features DNA sequences related to those involved in the pathogenicity of hemorrhagic colitis and hemolytic-uremic syndrome, potentially acquired through horizontal gene transfer. The genome also carries a number of antibiotic resistance genes, including resistance to aminoglycoside, macrolides and beta-lactam antibiotics.

BGI and its collaborators are studying the bacterial virulence genes, expression profiles, drug resistance, and gene transfer mechanisms. It also hopes to develop diagnostic kits. The sequences of this new E. coli strain have been uploaded to NCBI (SRA No: SRA037315.1).

link

Ion Torrent Ships New RNA Sequencing Application as Fast, Easy, Affordable Alternative to Microarrays | Life Technologies

http://www.lifetechnologies.com/news-gallery/press-releases/2011/ion-torrent-ships-new-rna-sequencing-application-as-fast.html

The Ion RNA-Seq is making RNA sequencing accessible to every scientist in world, with a fast, sim can buy and use: The workflow is fast: Single-day workflow and sequencing that takes about an hour, comp some next-gen sequencers. The price is affordable: The Ion PGM sequencer costs just $49,500, less than a microarray generation platform, and Ion chips start at $250. Ion RNA-Seq delivers a low per-sample pool hundreds of samples. The workflow is easy: Any molecular biology lab can quickly master the workflow, and dat minutes with intuitive tools like Partek, which are familiar to microarray users. The platform is scalable: Change the number of reads by just changing the chip. The read lengths are long: At least 100 bp today, 200 bp by the end of 2011 and 400 bp in splice junctions

Sent from an Android.

Thursday, 2 June 2011

: [BioRuby] Parsing large Blast xml files - a new bioruby plugin

Benchmarked and Featured on "Bioruby Mailing List"

Hi Rob,

https://github.com/pjotrp/blastxmlparser

_________________________________________
http://lists.open-bio.org/mailman/listinfo/bioruby

Revealing impaired pathways in the an11 mutant by ... [Plant J. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21623977

Sent from an Android.

Rapid screening of complex DNA samples by single-m... [PLoS One. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21625543

Abstract Microbial cloning makes Sanger sequencing of complex DNA sample intensive. We present a simple, rapid and robust method that enable special equipment to perform single-molecule amplicon sequencing, throughput manner, from sub-picogram quantities of DNA. The meth quick quality control of next-generation sequencing libraries, as was metagenomic sample. PMID: 21625543 [PubMed -in process]

Sent from an Android.

SVA: Software for Annotating and Visualizing Seque... [Bioinformatics. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21624899

Abstract Here we present SVA, a software tool that assigns a predicted biological function to variants identified in next-generation sequencing studies and provides a browser to visualize the variants in their genomic contexts. SVA also provides for flexible interaction with software implementing variant association tests allowing users to consider both the bioinformatic annotation of identified variants and the strength of their associations with studied traits. We illustration the annotation features of SVA using two simple examples of sequenced genomes that harbor Mendelian mutations. Availability and Implementation: Freely available on the web at http://www.svaproject.org. (For direct reviewer access, please visit: http://www.svaproject.org/directaccess.php) d.ge@duke.edu Available at the journal's website. PMID: 21624899 [PubMed -as supplied by publisher]

Sent from an Android.

Evaluating the fidelity of de novo short read meta... [PLoS One. 2011] - PubMed result

http://www.ncbi.nlm.nih.gov/pubmed/21625384

in metagenomic data analysis comprises the assembly of the sequenced assembly tools have been published in the last years targeting data coming from on sequencing (NGS) technologies but these assemblers have not been designed n multi-genome scenarios that characterize metagenomic studies. Here we cal assessment of current de novo short reads assembly tools in multi-genome ng complex simulated metagenomic data. With this approach we tested the erent assemblers in metagenomic studies demonstrating that even under the positions the number of chimeric contigs involving different species is e further showed that the assembly process reduces the accuracy of the ssification of the metagenomic data and that these errors can be overcome verage of the studied metagenome. The results presented here highlight the iculties that de novo genome assemblers face in multi-genome scenarios g that these difficulties, that often compromise the functional classification of data, can be overcome with a high sequencing effort.

Sent from an Android.

Wednesday, 1 June 2011

Genetics tests 'no better than flipping a coin'

Genetics tests flawed and inaccurate, say Dutch scientists

Investigation found they gave wildly different results and arrived at predictions that were no better than flipping a coin

Excerpted from guardian.co.uk,

Personalised health tests that screen thousands of genes for versions that influence disease are inaccurate and offer little, if any, benefit to consumers, scientists claimed on Monday.

An investigation into the services found they gave wildly different results, and in some cases arrived at medical predictions that were no better than flipping a coin.

The findings of the Dutch study will bolster calls for tighter regulations around personalised genetics tests which can cost more than £500. Critics claim the tests are a waste of money that could mislead people about their future health.

According to the group at Erasmus University medical centre in Rotterdam, tests from rival companies predicted conflicting risks for some diseases, often because they disagreed on how common the conditions were in the general population.

Another flaw was that the tests looked only at genetic factors, whereas many diseases are governed more by lifestyle and other environmental factors.

In the study, researchers used a computer to simulate genetic information for 100,000 typical people. They then used formulas from two of the largest genetic testing companies, deCODEme and 23andMe, to predict the risk of eight medical conditions, including heart attack, prostate cancer, coeliac disease, an eye disease known as age-related macular degeneration and diabetes.

Thursday, 30 June 2011

Wednesday, 29 June 2011

Tuesday, 28 June 2011

Monday, 27 June 2011

Thursday, 23 June 2011

Five statistical things I wished I had been taught 20 years ago

Sunday, 19 June 2011

Wednesday, 15 June 2011

Saturday, 11 June 2011

Friday, 10 June 2011

Thursday, 9 June 2011

Wednesday, 8 June 2011

Abstract

BACKGROUND:

RESULTS:

CONCLUSIONS:

Abstract

Monday, 6 June 2011

RNA-seq : technical variability and sampling.

Abstract

BACKGROUND:

RESULTS:

CONCLUSIONS:

Saturday, 4 June 2011

Friday, 3 June 2011

Thursday, 2 June 2011

Wednesday, 1 June 2011

Genetics tests flawed and inaccurate, say Dutch scientists

Datanami, Woe be me