Wednesday, 28 September 2011
[bedtools-discuss] pybedtools: a flexible Python library for manipulating genomic datasets and annotations
Tuesday, 27 September 2011
How to run SGE jobs (comparison to PBS scripts)
if you didn't know already the Torrent Server uses SGE, for those of us who are more used to PBS or LSF, this guide below might help you walk through some of the commands if you need to say create your own reference index or do tmap on the TS
Basics
https://ac.seas.harvard.edu/display/USERDOCS/How+To+Run+Parallel+MPI+Jobs+Using+SGEArray jobs for clusters running SGE
http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-HowtoConverting between PBS and SGE scripts
http://wiki.ibest.uidaho.edu/index.php/Tutorials:_SGE_PBS_ConvertingAbove url has a fantastic conversion table that lists options that are new to me as well!
For a more concise, quick and dirty guide see here
http://www.ucl.ac.uk/isd/common/research-computing/services/legion-upgrade/userguide/pbstosge
Monday, 26 September 2011
Early MiSeq Users Say Data Quality Matches HiSeq; Cite Speed as Advantage for Range of Applications | In Sequence | Sequencing | GenomeWeb
and seq quality nearly matches that of hi seq but it costs more per base though it produces data faster.
I don't have any 1st hand accounts of MiSeq data even though a friend has asked if I know anyone who might be able to serve as a consultant for a sequencing provider company in China using MiSeq. I also wondered if BGI has early access to MiSeq as well.
http://www.genomeweb.com/node/979585/?hq_e=el&hq_m=1096205&hq_l=6&hq_v=4f37903830
According to Nusbaum, the instrument is "pretty easy" to use, runs fast, and provides high-quality data, although at a greater cost per base than the HiSeq. It has been running according to Illumina's specifications, he said, and so far, there have been no serious problems with the machine.
According to Illumina, MiSeq produces more than 120 megabases of data with 35-base reads in four hours, and more than 1 gigabase of data with paired 150-base reads in 27 hours, including amplification and sequencing, and the number of unpaired reads exceeds 3.4 million.
The base accuracy of the data "is similar to what we see for the HiSeq," Nusbaum said. Toward the ends of the reads, the quality is even slightly higher than for HiSeq, probably because the sample spends less time on the machine.
Initially, the Broad plans to use the platform for "any kind of urgent project where turnaround time trumps cost of the data," Nusbaum said. This includes, for example, R&D projects, because "you get your answer in a day rather than in a week and a half."
In addition, projects that "fit nicely onto a small platform" will be run on MiSeq in the future at the Broad; these could include, for example, viral and microbial sequencing projects.
Ion Torrent vs MiSeq & GS FLX+ - SEQanswers
For Ion Torrent vs MiSeq. Read this in depth independent analysis
http://www.edgebio.com/blog/?p=241
For Ion Torrent vs 454
http://flxlexblog.wordpress.com/2011...-homopolymers/
Omics! Omics!: Ion Throws A Long Punch At MiSeq
Thursday, 22 September 2011
Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
1. | Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. |
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA. | |
PLoS Genet. 2011 Sep;7(9):e1002280. Epub 2011 Sep 15. | |
PMID: 21935354 [PubMed - in process] | |
2. | ChIP-Seq: technical considerations for obtaining high-quality data. |
Kidder BL, Hu G, Zhao K. | |
Nat Immunol. 2011 Sep 20;12(10):918-22. doi: 10.1038/ni.2117. | |
PMID: 21934668 [PubMed - in process] | |
3. | Next-Generation Sequencing Reveals HIV-1-Mediated Suppression of T Cell Activation and RNA Processing and Regulation of Noncoding RNA Expression in a CD4+ T Cell Line. |
Chang ST, Sova P, Peng X, Weiss J, Law GL, Palermo RE, Katze MG. | |
MBio. 2011 Sep 20;2(5). pii: e00134-11. doi: 10.1128/mBio.00134-11. Print 2011. | |
PMID: 21933919 [PubMed - in process] | |
Wednesday, 21 September 2011
ContEst: estimating cross-contamination of human samples in next-generation sequencing data
ContEst: estimating cross-contamination of human samples in next-generation sequencing data
Summary: Here, we present ContEst, a tool for estimating the level of cross-individual contamination in next-generation sequencing data. We demonstrate the accuracy of ContEst across a range of contamination levels, sources and read depths using sequencing data mixed in silico at known concentrations. We applied our tool to published cancer sequencing datasets and report their estimated contamination levels.
Availability and Implementation: ContEst is a GATK module, and distributed under a BSD style license at http://www.broadinstitute.org/cancer/cga/contest
Contact: kcibul@broadinstitute.org; gadgetz@broadinstitute.org
Supplementary information: Supplementary data is available at Bioinformatics online.
Emerging Hallmarks of Cancer
In 2000, Hanahan and Weinberg published a landmark article in which they described the "hallmarks of cancer" – six biological capabilities acquired during the multi-step development of human tumors. It went on to become the most-cited Cell article of all time. In a follow-up article this year, the authors revisit their conceptual framework for cancer biology, incorporating the remarkable progress in cancer research that was made over the last decade.
The authors conclude that their six hallmarks – sustained proliferative signaling, evading growth suppression, resisting cell death, replicative immortality, induction of angiogenesis, and invasion/metastasis – continue to provide a useful conceptual framework for understanding the biology of cancer. Further, they present two new hallmarks – reprogramming of energy metabolism and evasion of immune destruction – that have emerged as critical capabilities of cancer cells.
In coming years, thousands of tumors will be characterized by ever-more high-throughput technologies, such as massively parallel sequencing. Collecting the data is no longer the obstacle; instead, the true challenges lie in analysis and interpretation. Hanahan and Weinberg humbly describe their hallmarks as "organizing principles" for thinking about why cancer cells do what they do. Conceivably, fitting new catalogues of genetic alterations to this model of acquired capabilities will help us better understand the relationship between genotype (genetic susceptibility and somatic mutation) and phenotype (tumor development, growth, and metastasis).
References
Hanahan D, & Weinberg RA (2011). Hallmarks of cancer: the next generation. Cell, 144 (5), 646-74 PMID: 21376230
NIH announces 79 awards to encourage creative ideas in science
NIH announces 79 awards to encourage creative ideas in science
Tuesday, 20 September 2011
Partek(R) Wins Illumina(R) iDEA Award
press release
Sept. 19, 2011, 10:29 a.m. EDT
Partek(R) Wins Prestigious Illumina(R) iDEA Award
Partek software and algorithms show promising ability to substantially improve the scientific utility of next generation sequencing data
ST. LOUIS, Sep 19, 2011 (BUSINESS WIRE) -- Partek Incorporated, a global leader in bioinformatics software, announced today their receipt of the Most Creative Algorithm award, Commercial category, in the Illumina Data Excellence Award (iDEA) challenge for innovation in genomic data visualization and algorithmic analysis.
According to the judges, Partek was awarded the prestigious award for their entire, comprehensive start-to-finish data analysis tool set--Partek(R) Flow(TM), Partek(R) Genomics Suite(TM), and Partek(R) Pathway(TM)--as well as a number of useful novel algorithms. The most revolutionary of the algorithms being Partek's Gene-Specific Model. The model works on the assertion that one single statistical test does not optimally fit all genes, due to the fact that each gene may have a different distribution and be influenced by different biological factors. Therefore the Gene-Specific Model evaluates many models and distributions for each gene and selects the model that best fits that gene individually. This method results in two important advantages: first, more statistical power and more reliable findings as a result of a better model fit; and secondly, more information about which genes are influenced by which biological factors. This allows researchers to ascertain exactly how genes are affected by specific factors, in turn yielding a more statistically accurate analysis.
Tom Downey, President of Partek Incorporated had this to say, "People have been debating what is the proper distribution and statistical test for next generation sequencing data for years. We've pointed out the real elephant in the room on this debate, which is that there is not one single distribution and single statistical test that fits all genes or transcripts. For example, some genes are gender-specific, and others are not. Some genes follow a Poisson distribution and others do not."
To learn more about Partek's award winning data analysis, register at www.partek.com to view the webinar.
About Partek
Partek Incorporated ( www.partek.com ) develops and globally markets quality software for life sciences research. Its flagship product, Partek(R) Genomics Suite(TM), provides innovative solutions for integrated genomics. Partek Genomics Suite is unique in supporting all major microarray and next-generation sequencing platforms. Workflows offer streamlined analysis for: Gene Expression, miRNA Expression, Exon, Copy Number, Allele-Specific Copy Number, LOH, Association, Trio analysis, Tiling, ChIP-Seq, RNA-Seq, DNA-Seq, DNA Methylation and qPCR. Since 1993, Partek, headquartered in St. Louis, Missouri USA, has been turning data into discovery(R).
Partek and Genomics Suite are trademarks of Partek Incorporated. The names of other companies or products mentioned herein may be the trademarks of their respective owners.
SOURCE: Partek Incorporated
Let's make the SEQwiki even more awesome
Let's make the SEQwiki even more awesome
SEED: efficient clustering of next-generation sequences
SEED: efficient clustering of next-generation sequences
Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.
Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed.
Contact: thomas.girke@ucr.edu
Supplementary information: Supplementary data are available at Bioinformatics online
online gamers playing on Foldit have deciphered a protein folding puzzle that had bamboozled scientists and automated computers
Gamers take 3 weeks to solve puzzle that stumped scientists for over a decade
Ever heard of protein folding? Neither had we. But the online game Foldit lets you do just that, so you can have fun while you contribute to the progress of science. Yes, the science that actual, professional scientists spend time working with!
The Sydney Morning Herald points us to an interesting article published in the Nature Structural & Molecular Biology journal [PDF], which describes how online gamers playing on Foldit have deciphered a puzzle that had bamboozled scientists and automated computers working on the problem for over a decade.
They figured out the protein structure of a monomeric protease enzyme, which is "a cutting agent in the complex molecular tailoring of retroviruses, a family that includes HIV". The understanding of this structure is an important step towards discovering the causes of many diseases related to this enzyme and coming up with treatments for them.
Sorting Petabytes with MapReduce - The Next Episode
Sorting Petabytes with MapReduce - The Next Episode
Posted by Grzegorz Czajkowski, Marián Dvorský, Jerry Zhao, and Michael Conley, Systems InfrastructureAlmost three years ago we announced results of the first ever "petasort" (sorting a petabyte-worth of 100-byte records, following the Sort Benchmark rules). It completed in just over six hours on 4000 computers. Recently we repeated the experiment using 8000 computers. The execution time was 33 minutes, an order of magnitude improvement.
Our sorting code is based on MapReduce, which is a key framework for running multiple processes simultaneously at Google. Thousands of applications, supporting most services offered by Google, have been expressed in MapReduce. While not many MapReduce applications operate at a petabyte scale, some do. Their scale is likely to continue growing quickly. The need to help such applications scale motivated us to experiment with data sets larger than one petabyte. In particular, sorting a ten petabyte input set took 6 hours and 27 minutes to complete on 8000 computers. We are not aware of any other sorting experiment successfully completed at this scale.
We are excited by these results. While internal improvements to the MapReduce framework contributed significantly, a large part of the credit goes to numerous advances in Google's hardware, cluster management system, and storage stack.
What would it take to scale MapReduce by further orders of magnitude and make processing of such large data sets efficient and easy? One way to find out is to join Google's systems infrastructure team. If you have a passion for distributed computing, are an expert or plan to become one, and feel excited about the challenges of exascale then definitely consider applying for a software engineering position with Google.
Stanford-Led Team Demonstrates Utility of Ethnicity-Specific Reference for Interpreting Genome Data
NEW YORK (GenomeWeb News) – In a study appearing online last night in PLoS Genetics, a Stanford University-led team described the "ethnicity-specific" reference genome approach it used to analyze whole genome sequences from four members of a single family.
By incorporating estimated allele frequency data from the 1000 Genomes Project into the existing human reference genome, the researchers came up with three synthetic human genome references containing the major alleles identified in European, African, or East Asian populations — a strategy that's intended to more accurately represent the genetic variation present in each of the major HapMap populations.
Whole-genome sequencing and clinical annotation
- Construction of and alignment to an ethnicity-specific major allele reference sequence yielded improved alignment and more accurate genotyping, especially at disease-associated loci.
- Mendelian inheritance state analysis in the family structure enabled identification and removal of >90% of variants arising from sequencing errors.
- Per-trio phasing, inheritance state of adjacent variants, and population-level linkage disequilibrium data were integrated to provide long-range phased haplotypes.
- By fine-mapping recombination events to sub-kilobase resolution, the authors were able to perform sequence-based human lymphocyte antigen (HLA) typing.
- A curated database of genotype-phenotype correlations made it possible to construct comprehensive genetic risk profiles, including multigenic risk of inherited thrombophilia, common disease susceptibility, and pharmacogenomics.
Are BIG RAM servers popular?
BGI has them :(
ucdavis has one :(
Titus Brown recommends 512 Gb or even 1 Tb (shudder)
Jerm makes a case for owning one here http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html
Nick Loman is already doing market research on buying one,
More importantly seqanswers wiki suggests that you shld own one for de novo assembly ;)
Do you own one? how often does it get used?
More MPI or memory efficient de Brujin assemblers are being pushed out now ... is throwing more ram at the problem really something that is still required?
Hmmm I don't have access to one but my limited experience with a 256 Gb ram machine for a de novo assembly of a fish transcriptome didn't give me the contigs that I wanted. (it ran out of memory midway :( )
Monday, 19 September 2011
Assemblathon 1: A competitive assessment of de novo short read assembly methods
ABSTRACT
Low cost short read sequencing technology has revolutionised genomics, though it is only just
becoming practical for the high quality de novo assembly of a novel large genome. We describe
the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in
de novo assembly methods when applied to current sequencing technologies. In a collaborative
effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown,
simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling and copy number
regions of uncertainty. assembly problem there are a number of key considerations, notably (1) the length of the reads,
were made. We establish that within this benchmark (1) it is possible to assemble the genome to
a high level of coverage and accuracy, and that (2) large differences exist between the
assemblies, suggesting room for further improvements in current methods. The simulated
benchmark, including the correct answer, the assemblies and the code that was used to evaluate
the assemblies is now public and freely available from http://www.assemblathon.org/.
excerpted from Introduction
As the field of sequencing has changed so has the field of sequence assembly, for a recent
review see Miller et al. (2010). In brief, using Sanger sequencing, contigs were initially built using
overlap or string graphs (Myers 2005) (or data structures closely related to them), in tools such
as Phrap (http://www.phrap.org/), GigAssembler (Kent and Haussler, 2001), Celera (Myers et al.
2000) (Venter et al. 2001), ARACHNE (Batzoglou et al. 2002), and Phusion (Mullikin and Ning
2003), which were used for numerous high quality assemblies such as human (Lander et al.
2001) and mouse (Mouse Genome Sequencing Consortium et al. 2002). However, these
programs were not generally efficient enough to handle the volume of sequences produced by the
assembly software.
While some maintained the overlap graph approach, e.g. Edena (Hernandez et al. 2008) and
Newbler (http://www.454.com/), others used word look-up tables to greedily extend reads, e.g.
SSAKE (Warren et al. 2007), SHARCGS (Dohm et al. 2007), VCAKE (Jeck et al. 2007) and
OligoZip (http://linux1.softberry.com/berry.phtml?topic=OligoZip). These word look-up tables were
then extended into de Bruijn graphs to allow for global analyses (Pevzner et al. 2001), e.g. Euler
(Chaisson and Pevzner 2008), AllPaths (Butler et al. 2008) and Velvet (Zerbino and Birney 2008).
As projects grew in scale further engineering was required to fit large whole genome datasets into
memory ((ABySS (Simpson et al. 2009), Meraculous (in submission)), (SOAPdenovo (Li et al.
2010), Cortex (in submission)). Now, as improvements in sequencer technology are extending the
length of "short reads", the overlap graph approach is being revisited, albeit with optimized
programming techniques, e.g. SGA (Simpson and Durbin 2010), as are greedy contig extension
e.g.
PRICE
(http://derisilab.ucsf.edu/software/price/index.html),
Monument
In general, most sequence assembly programs are multi stage pipelines, dealing with correcting
measurement errors within the reads, constructing contigs, resolving repeats (i.e. disambiguating
false positive alignments between reads) and scaffolding contigs in separate phases. Since a
number of solutions are available for each task, several projects have been initiated to explore the
parameter space of the assembly problem, in particular in the context of short read sequencing
((Phillippy et al. 2008), (Hubis et al. 2011), (Alkan et al. 2011), (Narzisi and Mishra 2011), (Zhang et al. 2011) and (Lin et al. 2011)).
Saturday, 17 September 2011
High-throughput sequencing confers a deep view of seasonal community dynamics in pelagic marine environments
Gilbert et al. (2011, 2010) show that even in bacterial communities, there are definite seasonal patterns and peaks in community diversity. Figuring out what causes these patterns is sometimes surprisingly easy – it looks like shifting day length accounts for 65% of the changes in bacterial diversity (I'm sure the authors' jaws dropped when they saw this result…). Even more ridiculous (in a good way), the specific bacterial assemblage—the 'fingerprint' of species present in the community—could predict the month with 100% accuracy. And no surprise, only 2% of the 100 most abundant taxa they observed could be identified down to species level. (Previously undiscovered diversity is so old hat these days. But still cool).
Sent via TweetDeck (www.tweetdeck.com)
References:
Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, & Field D (2011a). Defining seasonal marine microbial community dynamics. The ISME journal PMID: 21850055
Gilbert, J., Field, D., Swift, P., Thomas, S., Cummings, D., Temperton, B., Weynberg, K., Huse, S., Hughes, M., Joint, I., Somerfield, P., & Mühling, M. (2010). The Taxonomic and Functional Diversity of Microbes at a Temperate Coastal Site: A ‘Multi-Omic’ Study of Seasonal and Diel Temporal Variation PLoS ONE, 5 (11) DOI: 10.1371/journal.pone.0015545
Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, & Naeem S (2006). Annually reoccurring bacterial communities are predictable from ocean conditions. Proceedings of the National Academy of Sciences of the United States of America, 103 (35), 13104-9 PMID: 16938845
FAQ - Howto do RNA-seq Bioinformatics analysis on Galaxy
reposted the summary links here for convenience.
Tutorial covering RNA-seq analysis (tool under "NGS: RNA Analysis")
http://usegalaxy.org/u/jeremy/
FAQ to help with troubleshooting (if needed):
http://usegalaxy.org/u/jeremy/
For visualization, an update that allows the use of a user-specified
fasta reference genome is coming out very soon. For now, you can view
annotation by creating a custom genome build, but the actual reference
will be not included. Use "Visualization -> New Track Browser" and
follow the instructions for "Is the build not listed here? Add a Custom
Build".
Help for using the tool is available here:
http://galaxyproject.org/
Currently, RNA-seq analysis for SOLiD data is available only on Galaxy test server:
http://test.g2.bx.psu.edu/
Please note that there are quotas associated with the test server:
http://galaxyproject.org/wiki/
[Credit : Jennifer Jackson ]
http://usegalaxy.org
http://galaxyproject.org/Suppo
Another helpful resource (non-Galaxy related though) is
http://seqanswers.com/wiki/How-to/RNASeq_analysis written by Matthew Young
and the discussion on this wiki @ seqanswers
http://seqanswers.com/forums/showthread.php?t=7068
As well as this review paper in Genome Biology RNA-seq Review
Stephen mentions this tutorial as well in this blog
RNA seq analysis workflow on Galaxy (Bristol workflow)
His post and the discussion thread is here.
http://gmod.827538.n3.nabble.com/Replicates-tt2397672.html#a2560404
kevin:waiting for the next common question to come next, is there Ion Torrent Support on Galaxy ?)
What's new for 'next generation sequencing' in PubMed
1. | Next-generation human genetics. |
Shendure J. ABSTRACT: The field of human genetics is being reshaped by exome and genome sequencing. Several lessons are evident from observing the rapid development of this area over the past 2 years, and these may be instructive with respect to what we should expect from 'next-generation human genetics' in the next few years. | |
Genome Biol. 2011 Sep 14;12(9):408. [Epub ahead of print] | |
PMID: 21920048 [PubMed - as supplied by publisher] | |
2. | Next-generation diagnostics for inherited skin disorders. |
Lai-Cheong JE, McGrath JA. | |
J Invest Dermatol. 2011 Oct;131(10):1971-3. doi: 10.1038/jid.2011.253. | |
PMID: 21918571 [PubMed - in process] Free Article AbstractIdentifying genes and mutations in the monogenic inherited skin diseases is a challenging task. Discoveries are cherished but often gene-hunting efforts have gone unrewarded because technology has failed to keep pace with investigators' enthusiasm and clinical resources. But times are changing. The recent arrival of next-generation sequencing has transformed what can now be achieved. | |
Related citations |
3. | Whole cancer genome sequencing by next-generation methods. |
Ross JS, Cronin M.AbstractTraditional approaches to sequence analysis are widely used to guide therapy for patients with lung and colorectal cancer and for patients with melanoma, sarcomas (eg, gastrointestinal stromal tumor), and subtypes of leukemia and lymphoma. The next-generation sequencing (NGS) approach holds a number of potential advantages over traditional methods, including the ability to fully sequence large numbers of genes (hundreds to thousands) in a single test and simultaneously detect deletions, insertions, copy number alterations, translocations, and exome-wide base substitutions (including known "hot-spot mutations") in all known cancer-related genes. Adoption of clinical NGS testing will place significant demands on laboratory infrastructure and will require extensive computational expertise and a deep knowledge of cancer medicine and biology to generate truly useful "clinically actionable" reports. It is anticipated that continuing advances in NGS technology will lower the overall cost, speed the turnaround time, increase the breadth of genome sequencing, detect epigenetic markers and other important genomic parameters, and become applicable to smaller and smaller specimens, including circulating tumor cells and circulating free DNA in plasma. | |
Am J Clin Pathol. 2011 Oct;136(4):527-39. | |
PMID: 21917674 [PubMed - in process] | |
Related citations |
4. | A novel application of pattern recognition for accurate SNP and indel discovery from high-throughput data: Targeted resequencing of the glucocorticoid receptor co-chaperone FKBP5 in a Caucasian population. |
Pelleymounter LL, Moon I, Johnson JA, Laederach A, Halvorsen M, Eckloff B, Abo R, Rossetti S. | |
Mol Genet Metab. 2011 Aug 24. [Epub ahead of print] AbstractThe detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a 160kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5h to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus. | |
PMID: 21917492 [PubMed - as supplied by publisher] | |
Related citations |
Bitcasa lets you have 'infinite' storage on cloud- Not a joke
Essentially they promise to store all of your hdd content in encrypted format in the cloud.
Nothing new? Well they are only going to charge you USD$10 / month for it.
How are they going to achieve that?
The company has propriety data de-duplication algorithms that can reduce most users file storage footprint to 25 Gb of data each (assuming we share similar files like mp3 and that )
Hmmm imagine the potential for storing NGS data on the cloud for cheap! (Well we won't exactly be bankrupting them if most ppl are storing human genome sequences which will be very very similar right?)
[From CNET]
The company is aggressive about data de-duplication, and furthermore, most users have less than 25GB of data. With cheap bandwidth and cheap storage, it works. The 8-person company has raised $1.3 million and counts Andreessen Horowitz and the CrunchFund as its backers.
http://www.bitcasa.com/beta-signup/?share=594278487
How to Make Your Hard Drive Infinite - Technology Review
(Credit: Bitcasa)
RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier http://t.co/wkUUlmrP
RT @phylogenomics Video "UCLA: 12 file sharing myths in two minutes" mostly makes me think about how openness makes life so much easier http://t.co/wkUUlmrP
Friday, 16 September 2011
Ion Torrent PGM Technology updates
PGM does seem to be the most promising platform with room to grow
I am curious though how much more wells they can squeeze into the chip size without having to upgrade the machine or doing 'dual core' tricks to double throughput.
But as I understand, they cannot load all of the wells with beads as the software actually uses the empty wells to be the noise filter at the processing stage.
Interesting snippets.
They have been getting inhouse throughput of
50.3 Mbp (~600k reads) on the 314 Chip
330 Mbp on the 316 Chip
The longest read that they officially have without errors is 341 bp (though I guess it's a matter of chance that the sequence matches the 'samba' random cycle that one can achieve longer reads)
one also can do miRNA sequencing with 5 ng of miRNA although the number of reads might be a tad limiting based on the transcriptome complexity of your organism.
Would be interesting to see what numbers are coming out from Broad and BGI though. Please post in comments if you have them.
Will update if i remember more stuff.
What is interesting is that they have been pushing the throughput envelope but they are more careful about pushing new protocols without extensive testing.
I like the direction they are going ahead with releasing public data and allowing fair comparisons and I hope that other vendors take up the same direction.
I do understand why they wish to keep all the discussions ( uncensored ) within their Ion Community to make it a vibrant supportive community. I don't really like the idea that they made the Torrent Users section only for someone with a PGM serial number.
This makes life hard for labs sequencing with providers or core labs.
BGI Bemoans Absurdist Data Transfer | Informatics Iron | GenomeWeb
http://www.genomeweb.com//node/978745?hq_e=el&hq_m=1089348&hq_l=9&hq_v=4f37903830
The IT leaders of 1000 Genomes Project describe how they must "distressingly often resort to shipping hard disks around to transfer data between centers, rather than use the internet, or even via Aspera which is faster than ftp [file transfer protocol]." The issue is so dire that BGI has established an open access journal, Giga Science, to deal with the problem of data dissemination and organization.
Thursday, 15 September 2011
Full-length transcriptome assembly from RNA-Seq data without a reference genome.
Full-length transcriptome assembly from RNA-Seq data without a reference genome.
Source
Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA.
Abstract
Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
Comment in
- PMID:
- 21572440
- [PubMed - in process]
Wednesday, 14 September 2011
Adding custom reference genome to Torrent Server manually - My Experience
Apologies! After digging in the Ion Community a little more, I think this is the updated link for V1.4 TS
Adding a New Genome Index
Created on: Jul 7, 2011 4:29 PM by ghartsell - Last Modified: Jul 11, 2011 1:48 PM by ghartsell
But the manually created reference index doesn't appear in the final dropdown menu when I try to do realignment (it does appear in the reference tab)
Don't really understand this line "As of release 1.1.0, only the "tmap-f1" index_type is supported."
as the index i created had the info.txt with tmap-f2
In anycase, if you don't mind fiddling with the web browser and you met with 'file deleted' or job started and you still do not have ur index . you can
restart ionJObServer
sudo /etc/init.d/ionJobServer restart
Adapted from the original doc here
Adding
a New Genome Index
As part of the standard analysis process reads are aligned to a genomic reference and the alignments and some summary statistics based on the alignments are included in the analysis report page. This HOWTO describes the process to add a new reference genome, something that will be necessary when a user starts to work with a new genome sequence.
The aligner used is named tmap and it comes pre-installed on the Torrent Server.
Prerequisites
Before we begin, you will need your reference sequence in a single file in fasta format and you will need command-line access to the Torrent Server. Please note that it must have Unix line endings and not Windows line endings. (they can be in .zip compressed format but i didn't test this)
Procedure
Select
a Short Form of Genome Name
The short form of genome name is the name that you would like the reference option to appear when initiating a run on the PGM™ instrument. There are some rules on how to define the short form of the genome name.
it should not match any of the existing references installed under the standard reference locationit should also be comprised solely of alphanumeric characters, underscore ("_") and period (".")
Index
Creation
The alignment package ( ion-alignment ) comes a wrapper script, build_genome_index.pl, that automates the TMAP index creation process. It requires four inputs:
single FASTA fileshort form of the genome name (see previous section)long form of the genome name (see next section for description)genome version (see next section for description)
The steps to create the index:
move or copy the FASTA file to the standard reference locationrun under the standard reference location
$ cd /results/referenceLibrary/tmap-f2/ $ build_genome_index.pl --fasta A_flavithermus.fasta -s A_flavithermus
-v "gi|212637849|ref|NC_011567.1"
-l "Anoxybacillus flavithermus WK1 chromosome complete genome" Copying A_flavithermus.fasta to A_flavithermus/A_flavithermus.fasta... ...copy complete Making tmap index... ...tmap index complete Making samtools index... ...samtools index complete
There should now be 10 files in the directory, including the original fasta file. The size of the files varies by genome - for the human genome (3,000,000,000 bases in length) the combined size of all index files, including the original fasta file itself, is just under 8Gb. For E. coli (4,600,000 bases in length) it is about 0.4Gb.
$ ls -1 /results/referenceLibrary/tmap-f1/e_coli/ A_flavithermus.fasta A_flavithermus.fasta.fai A_flavithermus.fasta.md5 A_flavithermus.fasta.tmap.anno A_flavithermus.fasta.tmap.bwt A_flavithermus.fasta.tmap.pac A_flavithermus.fasta.tmap.rbwt A_flavithermus.fasta.tmap.rpac A_flavithermus.fasta.tmap.rsa A_flavithermus.fasta.tmap.sa A_flavithermus.info.txt samtools.log tmap.log
Adding
the Genome to the PGM Drop-down Menu
For additional convenience it is also recommended (though not required) to add the genome to the list that is made available on the PGM as a drop-down menu - this can be very helpful in avoiding typos on the PGM.
updateref will crawl through the directory and grab the genome_shortname fields from all installed reference library of the version specified and overwrite reference_list.txt. When updateref is called without any command line argument, it will assume the default settings. For example, /results/PGM_config is the location of PGM configuration. The location is crucial because it needs to be under the same root directory to which PGM transfer the data. For example, if PGMs transfer data to a file server, which is mounted as /mnt/PGM_Data on Torrent server, an option -p /mnt/PGM_Data/PGM_config needs to be specified. updateref --help will list more options.
Default settings. PGM data are stored in /results.
$ sudo updateref List of library -> ampl_valid -> vibrio_fisch -> e_coli_k12 -> e_coli_dh10b -> rhodopalu
Customized environment. PGM data are stored in /mnt/PGM_Data.
$ sudo updateref -p /mnt/PGM_Data/PGM_config