Wednesday, 28 September 2011

[bedtools-discuss] pybedtools: a flexible Python library for manipulating genomic datasets and annotations

Hi all -

Version 0.5 of pybedtools is released.  Pybedtools is an interface to BEDTools using the Python programming language.  In addition to wrapping all the BEDTools programs (including the latest multiBamCov, tagBam, and nucBed programs) and making them accessible from within Python, it extends BEDTools by allowing feature-by-feature manipulation of BED/GFF/GTF/BAM/SAM/VCF files.

There's lots more that pybedtools provides . . . as a brief example, here's the complete code that identifies genes that are <5kb from intergenic SNPs, given a file of genes and a file of SNPs:

from pybedtools import BedTool
snps = BedTool('snps.bed.gz')
genes = BedTool('hg19.gff')
intergenic_snps = (snps - genes)
nearby = genes.closest(intergenic_snps, d=True, stream=True)
for gene in nearby:
    if int(gene[-1]) < 5000:

Note the (snps - genes) line, which does a subtractBed call, and the feature-level access to results from closest(), which wraps BEDTools' closestBed program.  How this compares to Bash and BEDTools programs alone is left as an exercise to the reader . . . or you can just check

You can get a brief overview of pybedtools in:

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations
Ryan K. Dale, Brent S. Pedersen, and Aaron R. Quinlan
Bioinformatics (2011) first published online September 23, 2011

You can get more details, including installation instructions, in the documentation at

The latest source can always be found on github:

Comments, bug reports, bug fixes, and suggestions are always welcome -- either through the github interface or via email.

happy intersecting, 


Tuesday, 27 September 2011

How to run SGE jobs (comparison to PBS scripts)

if you didn't know already the Torrent Server uses SGE, for those of us who are more used to PBS or LSF, this guide below might help you walk through some of the commands if you need to say create your own reference index or do tmap on the TS


Array jobs for clusters running SGE

Converting between PBS and SGE scripts
Above url has a fantastic conversion table that lists options that are new to me as well!
For a more concise, quick and dirty guide see here

Monday, 26 September 2011

Early MiSeq Users Say Data Quality Matches HiSeq; Cite Speed as Advantage for Range of Applications | In Sequence | Sequencing | GenomeWeb

Broad institute is cited to be using it daily.
and seq quality nearly matches that of hi seq but it costs more per base though it produces data faster.
I don't have any 1st hand accounts of MiSeq data even though a friend has asked if I know anyone who might be able to serve as a consultant for a sequencing provider company in China using MiSeq. I also wondered if BGI has early access to MiSeq as well.

According to Nusbaum, the instrument is "pretty easy" to use, runs fast, and provides high-quality data, although at a greater cost per base than the HiSeq. It has been running according to Illumina's specifications, he said, and so far, there have been no serious problems with the machine.
According to Illumina, MiSeq produces more than 120 megabases of data with 35-base reads in four hours, and more than 1 gigabase of data with paired 150-base reads in 27 hours, including amplification and sequencing, and the number of unpaired reads exceeds 3.4 million.
The base accuracy of the data "is similar to what we see for the HiSeq," Nusbaum said. Toward the ends of the reads, the quality is even slightly higher than for HiSeq, probably because the sample spends less time on the machine.
Initially, the Broad plans to use the platform for "any kind of urgent project where turnaround time trumps cost of the data," Nusbaum said. This includes, for example, R&D projects, because "you get your answer in a day rather than in a week and a half."
In addition, projects that "fit nicely onto a small platform" will be run on MiSeq in the future at the Broad; these could include, for example, viral and microbial sequencing projects.

Ion Torrent vs MiSeq & GS FLX+ - SEQanswers


For Ion Torrent vs MiSeq. Read this in depth independent analysis

For Ion Torrent vs 454


Omics! Omics!: Ion Throws A Long Punch At MiSeq




Thursday, 22 September 2011

Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.

1. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA.
PLoS Genet. 2011 Sep;7(9):e1002280. Epub 2011 Sep 15.
PMID: 21935354 [PubMed - in process]
2.ChIP-Seq: technical considerations for obtaining high-quality data.
Kidder BL, Hu G, Zhao K.
Nat Immunol. 2011 Sep 20;12(10):918-22. doi: 10.1038/ni.2117.
PMID: 21934668 [PubMed - in process]
3.Next-Generation Sequencing Reveals HIV-1-Mediated Suppression of T Cell Activation and RNA Processing and Regulation of Noncoding RNA Expression in a CD4+ T Cell Line.
Chang ST, Sova P, Peng X, Weiss J, Law GL, Palermo RE, Katze MG.
MBio. 2011 Sep 20;2(5). pii: e00134-11. doi: 10.1128/mBio.00134-11. Print 2011.
PMID: 21933919 [PubMed - in process]

Wednesday, 21 September 2011

ContEst: estimating cross-contamination of human samples in next-generation sequencing data

ContEst: estimating cross-contamination of human samples in next-generation sequencing data

Summary: Here, we present ContEst, a tool for estimating the level of cross-individual contamination in next-generation sequencing data. We demonstrate the accuracy of ContEst across a range of contamination levels, sources and read depths using sequencing data mixed in silico at known concentrations. We applied our tool to published cancer sequencing datasets and report their estimated contamination levels.

Availability and Implementation: ContEst is a GATK module, and distributed under a BSD style license at


Supplementary information: Supplementary data is available at Bioinformatics online.

Emerging Hallmarks of Cancer

Very good read!

via MassGenomics by Dan Koboldt on 12/09/11

In 2000, Hanahan and Weinberg published a landmark article in which they described the "hallmarks of cancer" – six biological capabilities acquired during the multi-step development of human tumors. It went on to become the most-cited Cell article of all time. In a follow-up article this year, the authors revisit their conceptual framework for cancer biology, incorporating the remarkable progress in cancer research that was made over the last decade.
The authors conclude that their six hallmarks – sustained proliferative signaling, evading growth suppression, resisting cell death, replicative immortality, induction of angiogenesis, and invasion/metastasis – continue to provide a useful conceptual framework for understanding the biology of cancer. Further, they present two new hallmarks – reprogramming of energy metabolism and evasion of immune destruction – that have emerged as critical capabilities of cancer cells.

In coming years, thousands of tumors will be characterized by ever-more high-throughput technologies, such as massively parallel sequencing. Collecting the data is no longer the obstacle; instead, the true challenges lie in analysis and interpretation. Hanahan and Weinberg humbly describe their hallmarks as "organizing principles" for thinking about why cancer cells do what they do. Conceivably, fitting new catalogues of genetic alterations to this model of acquired capabilities will help us better understand the relationship between genotype (genetic susceptibility and somatic mutation) and phenotype (tumor development, growth, and metastasis).
Hanahan D, & Weinberg RA (2011). Hallmarks of cancer: the next generation. Cell, 144 (5), 646-74 PMID: 21376230

NIH announces 79 awards to encourage creative ideas in science

NIH announces 79 awards to encourage creative ideas in science

The National Institutes of Health announced that it is awarding $143.8 million to challenge the status quo with innovative ideas that have the potential to propel fields forward and speed the translation of research into improved health for the American public.

Tuesday, 20 September 2011

Partek(R) Wins Illumina(R) iDEA Award

press release

Sept. 19, 2011, 10:29 a.m. EDT

Partek(R) Wins Prestigious Illumina(R) iDEA Award

Partek software and algorithms show promising ability to substantially improve the scientific utility of next generation sequencing data

ST. LOUIS, Sep 19, 2011 (BUSINESS WIRE) -- Partek Incorporated, a global leader in bioinformatics software, announced today their receipt of the Most Creative Algorithm award, Commercial category, in the Illumina Data Excellence Award (iDEA) challenge for innovation in genomic data visualization and algorithmic analysis.

According to the judges, Partek was awarded the prestigious award for their entire, comprehensive start-to-finish data analysis tool set--Partek(R) Flow(TM), Partek(R) Genomics Suite(TM), and Partek(R) Pathway(TM)--as well as a number of useful novel algorithms. The most revolutionary of the algorithms being Partek's Gene-Specific Model. The model works on the assertion that one single statistical test does not optimally fit all genes, due to the fact that each gene may have a different distribution and be influenced by different biological factors. Therefore the Gene-Specific Model evaluates many models and distributions for each gene and selects the model that best fits that gene individually. This method results in two important advantages: first, more statistical power and more reliable findings as a result of a better model fit; and secondly, more information about which genes are influenced by which biological factors. This allows researchers to ascertain exactly how genes are affected by specific factors, in turn yielding a more statistically accurate analysis.

Tom Downey, President of Partek Incorporated had this to say, "People have been debating what is the proper distribution and statistical test for next generation sequencing data for years. We've pointed out the real elephant in the room on this debate, which is that there is not one single distribution and single statistical test that fits all genes or transcripts. For example, some genes are gender-specific, and others are not. Some genes follow a Poisson distribution and others do not."

To learn more about Partek's award winning data analysis, register at to view the webinar.

About Partek

Partek Incorporated ( ) develops and globally markets quality software for life sciences research. Its flagship product, Partek(R) Genomics Suite(TM), provides innovative solutions for integrated genomics. Partek Genomics Suite is unique in supporting all major microarray and next-generation sequencing platforms. Workflows offer streamlined analysis for: Gene Expression, miRNA Expression, Exon, Copy Number, Allele-Specific Copy Number, LOH, Association, Trio analysis, Tiling, ChIP-Seq, RNA-Seq, DNA-Seq, DNA Methylation and qPCR. Since 1993, Partek, headquartered in St. Louis, Missouri USA, has been turning data into discovery(R).

Partek and Genomics Suite are trademarks of Partek Incorporated. The names of other companies or products mentioned herein may be the trademarks of their respective owners.

SOURCE: Partek Incorporated

Let's make the SEQwiki even more awesome

Let's make the SEQwiki even more awesome

Help make the Wiki awesome! The SEQanswers wiki (or "SEQwiki" for short) is a great help for users of this forum: It's a catalog of high-throughput sequencing tools. There is currently an effort to get the SEQanswers forum and the wiki published in the next NAR database issue. This is a to get the wiki in shape: Some of the tool descriptions are just "stubs", but in order for those pages to be really helpful, we need just . We invite everyone to pick one (or of course more!) tools that she or he uses and improve its description in the wiki. If you are the author of the tool, it is the best time to advertise there by writing to help people choose the tools. This will help others tremendously when they decide which tool to use. It's also the perfect opportunity to write down the sometimes things that are often not mentioned on the homepages of those tools. Here are some suggestions what you could write about your tool: Which tasks is the tool at? (something like: "really fast read mapping") What are the limitations? In which situations would the tool not be appropriate? ("cannot find indels", "needs at least least 32 GB RAM") Are there any ? Any idiosyncrasies when using the tool? ("be careful: if the input file is corrupt, there's no error message -- and the results will even look meaningful"). Hidden requirement? e.g. that said memory efficient but require 32GB or more for a reasonable small dataset? Which tools exist? ("if you still use MAQ, you should switch to BWA") Are there related tools? (on the BWA page: "Seal uses the BWA algorithm, but runs on a Hadoop cluster") Since the wiki is there to help us decide which tool to use for a specific purpose, please also try to . Just try to make it fit into the "Browse software" page! While you are editing and using the wiki, please also let us know about you have with the wiki software itself -- things perhaps only an administrator can change. Do any templates look ugly? Is the navigation too confusing? Don't like the fonts? We cannot promise to take care of everything, but ! Writing by M Martin. Thanks! The SeqWiki Publishing team want to thank you for the help

SEED: efficient clustering of next-generation sequences

SEED: efficient clustering of next-generation sequences

Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.

Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.

Availability: The SEED software can be downloaded for free from this site:


Supplementary information: Supplementary data are available at Bioinformatics online

online gamers playing on Foldit have deciphered a protein folding puzzle that had bamboozled scientists and automated computers

Gamers take 3 weeks to solve puzzle that stumped scientists for over a decade

Ever heard of protein folding? Neither had we. But the online game Foldit lets you do just that, so you can have fun while you contribute to the progress of science. Yes, the science that actual, professional scientists spend time working with!

The Sydney Morning Herald points us to an interesting article published in the Nature Structural & Molecular Biology journal [PDF], which describes how online gamers playing on Foldit have deciphered a puzzle that had bamboozled scientists and automated computers working on the problem for over a decade.

They figured out the protein structure of a monomeric protease enzyme, which is "a cutting agent in the complex molecular tailoring of retroviruses, a family that includes HIV". The understanding of this structure is an important step towards discovering the causes of many diseases related to this enzyme and coming up with treatments for them.

Sorting Petabytes with MapReduce - The Next Episode

Sorting Petabytes with MapReduce - The Next Episode

Posted by Grzegorz Czajkowski, Marián Dvorský, Jerry Zhao, and Michael Conley, Systems Infrastructure

Almost three years ago we announced results of the first ever "petasort" (sorting a petabyte-worth of 100-byte records, following the Sort Benchmark rules). It completed in just over six hours on 4000 computers. Recently we repeated the experiment using 8000 computers. The execution time was 33 minutes, an order of magnitude improvement.

Our sorting code is based on MapReduce, which is a key framework for running multiple processes simultaneously at Google. Thousands of applications, supporting most services offered by Google, have been expressed in MapReduce. While not many MapReduce applications operate at a petabyte scale, some do. Their scale is likely to continue growing quickly. The need to help such applications scale motivated us to experiment with data sets larger than one petabyte. In particular, sorting a ten petabyte input set took 6 hours and 27 minutes to complete on 8000 computers. We are not aware of any other sorting experiment successfully completed at this scale.

We are excited by these results. While internal improvements to the MapReduce framework contributed significantly, a large part of the credit goes to numerous advances in Google's hardware, cluster management system, and storage stack.

What would it take to scale MapReduce by further orders of magnitude and make processing of such large data sets efficient and easy? One way to find out is to join Google's systems infrastructure team. If you have a passion for distributed computing, are an expert or plan to become one, and feel excited about the challenges of exascale then definitely consider applying for a software engineering position with Google.

Stanford-Led Team Demonstrates Utility of Ethnicity-Specific Reference for Interpreting Genome Data

excerpted from Genomeweb

NEW YORK (GenomeWeb News) – In a study appearing online last night in PLoS Genetics, a Stanford University-led team described the "ethnicity-specific" reference genome approach it used to analyze whole genome sequences from four members of a single family.

By incorporating estimated allele frequency data from the 1000 Genomes Project into the existing human reference genome, the researchers came up with three synthetic human genome references containing the major alleles identified in European, African, or East Asian populations — a strategy that's intended to more accurately represent the genetic variation present in each of the major HapMap populations.

Whole-genome sequencing and clinical annotation

  • Construction of and alignment to an ethnicity-specific major allele reference sequence yielded improved alignment and more accurate genotyping, especially at disease-associated loci.
  • Mendelian inheritance state analysis in the family structure enabled identification and removal of >90% of variants arising from sequencing errors.
  • Per-trio phasing, inheritance state of adjacent variants, and population-level linkage disequilibrium data were integrated to provide long-range phased haplotypes.
  • By fine-mapping recombination events to sub-kilobase resolution, the authors were able to perform sequence-based human lymphocyte antigen (HLA) typing.
  • A curated database of genotype-phenotype correlations made it possible to construct comprehensive genetic risk profiles, including multigenic risk of inherited thrombophilia, common disease susceptibility, and pharmacogenomics.

Are BIG RAM servers popular?

I know for sure

BGI has them :(

ucdavis has one :(

Titus Brown recommends 512 Gb or even 1 Tb (shudder)

Jerm makes a case for owning one here

Nick Loman is already doing market research on buying one,

More importantly seqanswers wiki suggests that you shld own one for de novo assembly ;)

Do you own one? how often does it get used?
More MPI or memory efficient de Brujin assemblers are being pushed out now ... is throwing more ram at the problem really something that is still required?

Hmmm I don't have access to one but my limited experience with a 256 Gb ram machine for a de novo assembly of a fish transcriptome didn't give me the contigs that I wanted. (it ran out of memory midway :( )

Monday, 19 September 2011

Assemblathon 1: A competitive assessment of de novo short read assembly methods


    Low cost short read sequencing technology has revolutionised genomics, though it is only just
    becoming practical for the high quality de novo assembly of a novel large genome. We describe
    the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in
    de novo assembly methods when applied to current sequencing technologies. In a collaborative
    effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown,
    simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling and copy number
    regions of uncertainty.  assembly problem there are a number of key considerations, notably (1) the length of the reads,
    were made. We establish that within this benchmark (1) it is possible to assemble the genome to
    a high level of coverage and accuracy, and that (2) large differences exist between the
    assemblies, suggesting room for further improvements in current methods. The simulated
    benchmark, including the correct answer, the assemblies and the code that was used to evaluate
    the assemblies is now public and freely available from

    excerpted from Introduction

    As the field of sequencing has changed so has the field of sequence assembly, for a recent
    review see Miller et al. (2010). In brief, using Sanger sequencing, contigs were initially built using
    overlap or string graphs (Myers 2005) (or data structures closely related to them), in tools such
    as Phrap (, GigAssembler (Kent and Haussler, 2001), Celera (Myers et al.
    2000) (Venter et al. 2001), ARACHNE (Batzoglou et al. 2002), and Phusion (Mullikin and Ning
    2003), which were used for numerous high quality assemblies such as human (Lander et al.
    2001) and mouse (Mouse Genome Sequencing Consortium et al. 2002). However, these
    programs were not generally efficient enough to handle the volume of sequences produced by the
    assembly software.

    While some maintained the overlap graph approach, e.g. Edena (Hernandez et al. 2008) and
    Newbler (, others used word look-up tables to greedily extend reads, e.g.
    SSAKE (Warren et al. 2007), SHARCGS (Dohm et al. 2007), VCAKE (Jeck et al. 2007) and
    OligoZip ( These word look-up tables were
    then extended into de Bruijn graphs to allow for global analyses (Pevzner et al. 2001), e.g. Euler
    (Chaisson and Pevzner 2008), AllPaths (Butler et al. 2008) and Velvet (Zerbino and Birney 2008).
    As projects grew in scale further engineering was required to fit large whole genome datasets into
    memory ((ABySS (Simpson et al. 2009), Meraculous (in submission)), (SOAPdenovo (Li et al.
    2010), Cortex (in submission)). Now, as improvements in sequencer technology are extending the
    length of "short reads", the overlap graph approach is being revisited, albeit with optimized
    programming techniques, e.g. SGA (Simpson and Durbin 2010), as are greedy contig extension
    In general, most sequence assembly programs are multi stage pipelines, dealing with correcting
    measurement errors within the reads, constructing contigs, resolving repeats (i.e. disambiguating
    false positive alignments between reads) and scaffolding contigs in separate phases. Since a
    number of solutions are available for each task, several projects have been initiated to explore the
    parameter space of the assembly problem, in particular in the context of short read sequencing
    ((Phillippy et al. 2008), (Hubis et al. 2011), (Alkan et al. 2011), (Narzisi and Mishra 2011), (Zhang et al. 2011) and (Lin et al. 2011)).

    Saturday, 17 September 2011

    High-throughput sequencing confers a deep view of seasonal community dynamics in pelagic marine environments

    If you sequence the bacteria in the English Channel, you can work out which month it is with perfect accuracy HT @dr_bik

    Gilbert et al. (2011, 2010) show that even in bacterial communities, there are definite seasonal patterns and peaks in community diversity.  Figuring out what causes these patterns is sometimes surprisingly easy – it looks like shifting day length accounts for 65% of the changes in bacterial diversity (I'm sure the authors' jaws dropped when they saw this result…).  Even more ridiculous (in a good way), the specific bacterial assemblage—the 'fingerprint' of species present in the community—could predict the month with 100% accuracy.   And no surprise, only 2% of the 100 most abundant taxa they observed could be identified down to species level.  (Previously undiscovered diversity is so old hat these days.  But still cool).

    Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, & Field D (2011a). Defining seasonal marine microbial community dynamics. The ISME journal PMID: 21850055
    Gilbert, J., Field, D., Swift, P., Thomas, S., Cummings, D., Temperton, B., Weynberg, K., Huse, S., Hughes, M., Joint, I., Somerfield, P., & Mühling, M. (2010). The Taxonomic and Functional Diversity of Microbes at a Temperate Coastal Site: A ‘Multi-Omic’ Study of Seasonal and Diel Temporal Variation PLoS ONE, 5 (11) DOI: 10.1371/journal.pone.0015545
    Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, & Naeem S (2006). Annually reoccurring bacterial communities are predictable from ocean conditions. Proceedings of the National Academy of Sciences of the United States of America, 103 (35), 13104-9 PMID: 16938845

    FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy

    One of the top questions posted in the Galaxy User mailing list. 
    reposted the summary links here for convenience.

    Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")

    FAQ to help with troubleshooting (if needed):

    For visualization, an update that allows the use of a user-specified
    fasta reference genome is coming out very soon. For now, you can view
    annotation by creating a custom genome build, but the actual reference
    will be not included. Use "Visualization -> New Track Browser" and
    follow the instructions for "Is the build not listed here? Add a Custom

    Help for using the tool is available here:

    Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:

    Please note that there are quotas associated with the test server:

    [Credit : Jennifer Jackson ]

    Another helpful resource (non-Galaxy related though) is written by Matthew Young
    and the discussion on this wiki @ seqanswers

    As well as this review paper in Genome Biology RNA-seq Review

    Stephen mentions this tutorial as well in this blog

    Dr David Matthews has posted a starter thread to discuss RNA seq analysis workflow on Paired End Seq with Tophat on Galaxy in the mailling list.

    RNA seq analysis workflow on Galaxy (Bristol workflow)

    His post and the discussion thread is here. 

    kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?) 

    What's new for 'next generation sequencing' in PubMed

    1.Next-generation human genetics.
    Shendure J.
    ABSTRACT: The field of human genetics is being reshaped by exome and genome sequencing. Several lessons are evident from observing the rapid development of this area over the past 2 years, and these may be instructive with respect to what we should expect from 'next-generation human genetics' in the next few years.
    Genome Biol. 2011 Sep 14;12(9):408. [Epub ahead of print]
    Click here to read
    PMID: 21920048 [PubMed - as supplied by publisher]
    2.Next-generation diagnostics for inherited skin disorders.
    Lai-Cheong JE, McGrath JA.
    J Invest Dermatol. 2011 Oct;131(10):1971-3. doi: 10.1038/jid.2011.253.
    PMID: 21918571 [PubMed - in process] Free Article
    Click here to read


    Identifying genes and mutations in the monogenic inherited skin diseases is a challenging task. Discoveries are cherished but often gene-hunting efforts have gone unrewarded because technology has failed to keep pace with investigators' enthusiasm and clinical resources. But times are changing. The recent arrival of next-generation sequencing has transformed what can now be achieved.
    Related citations
    3.Whole cancer genome sequencing by next-generation methods.
    Ross JS, Cronin M.


    Traditional approaches to sequence analysis are widely used to guide therapy for patients with lung and colorectal cancer and for patients with melanoma, sarcomas (eg, gastrointestinal stromal tumor), and subtypes of leukemia and lymphoma. The next-generation sequencing (NGS) approach holds a number of potential advantages over traditional methods, including the ability to fully sequence large numbers of genes (hundreds to thousands) in a single test and simultaneously detect deletions, insertions, copy number alterations, translocations, and exome-wide base substitutions (including known "hot-spot mutations") in all known cancer-related genes. Adoption of clinical NGS testing will place significant demands on laboratory infrastructure and will require extensive computational expertise and a deep knowledge of cancer medicine and biology to generate truly useful "clinically actionable" reports. It is anticipated that continuing advances in NGS technology will lower the overall cost, speed the turnaround time, increase the breadth of genome sequencing, detect epigenetic markers and other important genomic parameters, and become applicable to smaller and smaller specimens, including circulating tumor cells and circulating free DNA in plasma.
    Am J Clin Pathol. 2011 Oct;136(4):527-39.
    PMID: 21917674 [PubMed - in process]
    Related citations
    4.A novel application of pattern recognition for accurate SNP and indel discovery from high-throughput data: Targeted resequencing of the glucocorticoid receptor co-chaperone FKBP5 in a Caucasian population.
    Pelleymounter LL, Moon I, Johnson JA, Laederach A, Halvorsen M, Eckloff B, Abo R, Rossetti S.
    Mol Genet Metab. 2011 Aug 24. [Epub ahead of print]
    Click here to read


    The detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a 160kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5h to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
    PMID: 21917492 [PubMed - as supplied by publisher]
    Related citations

    Bitcasa lets you have 'infinite' storage on cloud- Not a joke

    The web is abuzz with this new company present TechCrunch Disrupt conference
    Essentially they promise to store all of your hdd content in encrypted format in the cloud.
    Nothing new? Well they are only going to charge you USD$10 / month for it.

    How are they going to achieve that?
    The company has propriety data de-duplication algorithms that can reduce most users file storage footprint to 25 Gb of data each (assuming we share similar files like mp3 and that )

    Hmmm imagine the potential for storing NGS data on the cloud for cheap! (Well we won't exactly be bankrupting them if most ppl are storing human genome sequences which will be very very similar right?)

    [From CNET]
    The company is aggressive about data de-duplication, and furthermore, most users have less than 25GB of data. With cheap bandwidth and cheap storage, it works. The 8-person company has raised $1.3 million and counts Andreessen Horowitz and the CrunchFund as its backers.

    Interested? Sign up for a trial using this url to help push me up to the front of the queue for the beta :)

    How to Make Your Hard Drive Infinite - Technology Review

    Why 17.59 terabytes? Because that's the maximum amount of data OS X can address.
    (Credit: Bitcasa)

    BioLektur - a “longitudinal study” of the improvements made on the Ion Torrent.

    Good post!

    RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier

    RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier

    Friday, 16 September 2011

    Ion Torrent PGM Technology updates

    Attended PGM Technology update talk by Micheal Rhodes by Life Technologies today.
    PGM does seem to be the most promising platform with room to grow
    I am curious though how much more wells they can squeeze into the chip size without having to upgrade the machine or doing 'dual core' tricks to double throughput.
    But as I understand, they cannot load all of the wells with beads as the software actually uses the empty wells to be the noise filter at the processing stage.

    Interesting snippets.

    They have been getting inhouse throughput of
    50.3 Mbp  (~600k reads) on the 314 Chip
    330 Mbp on the 316 Chip

    The longest read that they officially have without errors is 341 bp (though I guess it's a matter of chance that the sequence matches the 'samba' random cycle that one can achieve longer reads)

    one also can do miRNA sequencing with 5 ng of miRNA although the number of reads might be a tad limiting based on the transcriptome complexity of your organism.

    Would be interesting to see what numbers are coming out from Broad and BGI though. Please post in comments if you have them.

    Will update if i remember more stuff.

    What is interesting is that they have been pushing the throughput envelope but they are more careful about pushing new protocols without extensive testing.
    I like the direction they are going ahead with releasing public data and allowing fair comparisons and I hope that other vendors take up the same direction.
    I do understand why they wish to keep all the discussions ( uncensored ) within their Ion Community to make it a vibrant supportive community. I don't really like the idea that they made the Torrent Users section only for someone with a PGM serial number.
    This makes life hard for labs sequencing with providers or core labs.

    BGI Bemoans Absurdist Data Transfer | Informatics Iron | GenomeWeb

    The IT leaders of 1000 Genomes Project describe how they must "distressingly often resort to shipping hard disks around to transfer data between centers, rather than use the internet, or even via Aspera which is faster than ftp [file transfer protocol]." The issue is so dire that BGI has established an open access journal, Giga Science, to deal with the problem of data dissemination and organization.

    Thursday, 15 September 2011

    Full-length transcriptome assembly from RNA-Seq data without a reference genome.

    Nat Biotechnol. 2011 May 15;29(7):644-52. doi: 10.1038/nbt.1883.

    Full-length transcriptome assembly from RNA-Seq data without a reference genome.


    Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA.


    Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

    [PubMed - in process]
    Click here to read

    Wednesday, 14 September 2011

    Adding custom reference genome to Torrent Server manually - My Experience

    Apologies! After digging in the Ion Community a little more, I think this is the updated link for V1.4 TS 

    Adding a New Genome Index 

    Created on: Jul 7, 2011 4:29 PM by ghartsell - Last Modified:  Jul 11, 2011 1:48 PM by ghartsell

    But the manually created reference index doesn't appear in the final dropdown menu when I try to do realignment (it does appear in the reference tab) 

    Don't really understand this line "
    As of release 1.1.0, only the "tmap-f1" index_type is supported." 

    as the index i created had the info.txt with tmap-f2

     In anycase, if you don't mind fiddling with the web browser and you met with 'file deleted' or job started and you still do not have ur index . you can 

    restart ionJObServer
                            sudo /etc/init.d/ionJobServer restart

    Adapted from the original doc here 

    Adding a New Genome Index

    As part of the standard analysis process reads are aligned to a genomic reference and the alignments and some summary statistics based on the alignments are included in the analysis report page.  This HOWTO describes the process to add a new reference genome, something that will be necessary when a user starts to work with a new genome sequence.

    The aligner used is named tmap and it comes pre-installed on the Torrent Server.


    Before we begin, you will need your reference sequence in a single file in
     fasta format and you will need command-line access to the Torrent Server.  Please note that it must have Unix line endings and not Windows line endings. (they can be in .zip compressed format but i didn't test this)
    You will need admin rights to scp the files over to /results/referenceLibrary/tmap-f2/


     Select a Short Form of Genome Name 

    The short form of genome name is the name that you would like the reference option to appear when initiating a run on the PGM™ instrument. There are some rules on how to define the short form of the genome name.
    1. it should not match any of the existing references installed under the standard reference location 
    2. it should also be comprised solely of alphanumeric characters, underscore ("_") and period (".")
     Index Creation 

    The alignment package (
     ion-alignment ) comes a wrapper script,, that automates the TMAP index creation process. It requires four inputs:
    • single FASTA file
    • short form of the genome name (see previous section)
    • long form of the genome name (see next section for description)
    • genome version (see next section for description)

    The steps to create the index:
    1. move or copy the FASTA file to the standard reference location 
    $ cd /results/referenceLibrary/tmap-f2/
    $ --fasta A_flavithermus.fasta -s A_flavithermus 
    -v "gi|212637849|ref|NC_011567.1" 
    -l "Anoxybacillus flavithermus WK1 chromosome complete genome"
    Copying A_flavithermus.fasta to A_flavithermus/A_flavithermus.fasta...
      ...copy complete
    Making tmap index...
      ...tmap index complete
    Making samtools index...
      ...samtools index complete

    There should now be 10 files in the directory, including the original fasta file.  The size of the files varies by genome - for the human genome (3,000,000,000 bases in length) the combined size of all index files, including the original fasta file itself, is just under 8Gb.  For E. coli (4,600,000 bases in length) it is about 0.4Gb.
    You might want to del the fasta file to keep things tidy
    rm A_flavithermus.fasta
    $ ls -1 /results/referenceLibrary/tmap-f1/e_coli/

     Adding the Genome to the PGM Drop-down Menu 

    For additional convenience it is also recommended (though not required) to add the genome to the list that is made available on the PGM as a drop-down menu - this can be very helpful in avoiding typos on the PGM.

     updateref will crawl through the directory and grab the genome_shortname fields from all installed reference library of the version specified and overwrite reference_list.txt. When updateref is called without any command line argument, it will assume the default settings. For example, /results/PGM_config is the location of PGM configuration. The location is crucial because it needs to be under the same root directory to which PGM transfer the data. For example, if PGMs transfer data to a file server, which is mounted as /mnt/PGM_Data on Torrent server, an option -p /mnt/PGM_Data/PGM_config needs to be specified. updateref --help will list more options.

    Default settings. PGM data are stored in
    $ sudo updateref
    List of library
    -> ampl_valid
    -> vibrio_fisch
    -> e_coli_k12
    -> e_coli_dh10b
    -> rhodopalu

    Customized environment. PGM data are stored in
    $ sudo updateref -p /mnt/PGM_Data/PGM_config
    You may also manually edit the text file ( I did this as I can't find updateref)
    sudo vim /results/PGM_config/reference_list.txt
    insert the shortname into the txt file

    Update:manually editing the text file doesn't make the genome appear in libraries for realignment plugin. Curiously after adding the reference genome via the web browser, the genome name doesn't appear here.

    Datanami, Woe be me