You have to read My Data Management Plan – a satire
if you are not laughing at the end of it,
a) you are not a bioinformatician / computational scientist or work with one
b) you have no sense of humour
c) you actually find that it makes for a good plan.
If you are c) , I shall pray for you. Perhaps you might need this software tool as well as your usual BioPerl
Monday, 31 May 2010
Sunday, 30 May 2010
Cofactor genomics on the different NGS platforms
Original post here
They are a commercial company that offers NGS on ABI and Illumina platforms and since this is on their company page I guess its their official stand on what rocks on each platform
Excerpted.
They are a commercial company that offers NGS on ABI and Illumina platforms and since this is on their company page I guess its their official stand on what rocks on each platform
Excerpted.
Applied Biosystems SOLiD 3
The Applied Biosystems SOLiD 3 has the shortest but also the highest quantity of reads. The SOLiD produces up to 240 million 50bp reads per slide per end. As with the Illumina, Mate-Pairs produce double the output by duplicating the read length on each end, and the SOLiD supports a variety of insert lengths like the 454. The SOLiD can also run 2 slides at once to again double the output. SOLiD has the lowest *raw* base qualities but the highest processed base qualities when using a reference due to its 2-base encoding. Because of the number of reads and more advanced library types, we recommend the SOLiD for all RNA and bisulfite sequencing projects.Solexa/Illumina
The Solexa/Illumina generates shorter reads at 36-75bp but produces up to 160 million reads per run. All reads are of similar length. The Illumina has the highest *raw* quality scores and its errors are mostly base substitutions. Paired-end reads with ~200 bp inserts are possible with high efficiency and double the output of the machine by duplicating the read length on each end. Paired-end Illumina reads are suitable for de novo assemblies, especially in combination with 454. The large number of reads makes the Illumina appropriate for de novo transcriptome studies with simultaneous discovery and quantification of RNAs at qRT-PCR accuracy.Roche/454 FLX
The Roche/454 FLX with Titanium chemistry generates the longest reads (350-500bp) and the most contiguous assemblies, can phase SNPs or other features into blocks, and has the shortest run times. However, 454 also produces the fewest total reads (~1 million) at the highest cost per base. Read lengths are variable. Errors occur mostly at the ends of long same-nucleotide stretches. Libraries can be constructed with many insert sizes (8kb - 20kb) but at half of the read length for each end and with low efficiency.
Labels:
454,
ABI,
comparison,
Illumina,
Next Generation Sequencing,
pyrosequencing,
Solexa,
SOLiD
GenoCon: First-ever Contest in Rational Genome Design Based on Semantic-web Technology
Fascinating!
Original post here and here
The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
Tokyo, Japan (PRWeb UK) May 24, 2010 — The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
GenoCon: An international science and technology competition supporting future specialists in rational genome design for Synthetic Biology
First-ever contest in rational genome design based on semantic-web technology
Summary:
- A challenge for green innovation: rational genome design of a plant with an environmental detoxification function.
- Collection and sharing of genome-design theories and programs from researchers around the world.
- Web-based contest aimed at supporting a future generation of scientists – including a category for high-school students.
The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design*1 Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
Built upon semantic web technology, GenoCon is the first contest of its kind, offering contestants the chance to compete in technologies for rational genome design. To succeed, contestants must make effective use of genomic and protein data contained in SciNeS database clusters to design DNA sequences that improve plant physiology. In the first GenoCon, contestants are asked to design a DNA sequence conferring to the model organism Arabidopsis thaliana the functionality to effectively eliminate and detoxify airborne Formaldehyde.
GenoCon also offers, in addition to categories for Japanese and international researchers and university students, a category specifically for high-school students. Just as ROBOCON (Robot Contest), GenoCon thus provides opportunities for young people to learn about the most cutting-edge science with a sense of pleasure, bringing intellectual excitement to the field of Life Science and supporting a future generation of scientists.
GenoCon will be accepting entries to the contest starting May 25, 2010 at the official GenoCon website: http://genocon.org/sw/wiki/en/cria196s1i/.
For more information, please contact:
Dr. Tetsuro Toyoda
Director, Bioinformatics And Systems Engineering(BASE) Division
RIKEN Yokohama Institute
Tel: +81-(0)45-503-9610 / Fax: +81-(0)45-503-9553
Planning Section
Yokohama Research Promotion Division
RIKEN Yokohama Institute
Tel: +81-(0)45-503-9117 / Fax: +81-(0)45-503-9113
Ms. Tomoko Ikawa (PI officer)
Global Relations Office
RIKEN
Tel: +81-(0)48-462-1225 / Fax: +81-(0)48-462-4715
Mail: koho(at)riken(dot)jp
Original post here and here
The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
Tokyo, Japan (PRWeb UK) May 24, 2010 — The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
GenoCon: An international science and technology competition supporting future specialists in rational genome design for Synthetic Biology
First-ever contest in rational genome design based on semantic-web technology
Summary:
- A challenge for green innovation: rational genome design of a plant with an environmental detoxification function.
- Collection and sharing of genome-design theories and programs from researchers around the world.
- Web-based contest aimed at supporting a future generation of scientists – including a category for high-school students.
The Bioinformatics And Systems Engineering (BASE), a division of RIKEN, Japan’s flagship research institute, is holding its first ever International Rational Genome Design*1 Contest (GenoCon) on the semantic web. The contest makes use of an information infrastructure for life science research known as the RIKEN Scientists’ Networking System (SciNeS*2) and will take place between May 25 and September 30.
Built upon semantic web technology, GenoCon is the first contest of its kind, offering contestants the chance to compete in technologies for rational genome design. To succeed, contestants must make effective use of genomic and protein data contained in SciNeS database clusters to design DNA sequences that improve plant physiology. In the first GenoCon, contestants are asked to design a DNA sequence conferring to the model organism Arabidopsis thaliana the functionality to effectively eliminate and detoxify airborne Formaldehyde.
GenoCon also offers, in addition to categories for Japanese and international researchers and university students, a category specifically for high-school students. Just as ROBOCON (Robot Contest), GenoCon thus provides opportunities for young people to learn about the most cutting-edge science with a sense of pleasure, bringing intellectual excitement to the field of Life Science and supporting a future generation of scientists.
GenoCon will be accepting entries to the contest starting May 25, 2010 at the official GenoCon website: http://genocon.org/sw/wiki/en/cria196s1i/.
For more information, please contact:
Dr. Tetsuro Toyoda
Director, Bioinformatics And Systems Engineering(BASE) Division
RIKEN Yokohama Institute
Tel: +81-(0)45-503-9610 / Fax: +81-(0)45-503-9553
Planning Section
Yokohama Research Promotion Division
RIKEN Yokohama Institute
Tel: +81-(0)45-503-9117 / Fax: +81-(0)45-503-9113
Ms. Tomoko Ikawa (PI officer)
Global Relations Office
RIKEN
Tel: +81-(0)48-462-1225 / Fax: +81-(0)48-462-4715
Mail: koho(at)riken(dot)jp
The State of computational genomics
It's good to know that things are moving ahead at Sean Eddy's lab.
this blog post makes some very good recommendation .. of which I wonder how much is helped by *chuckle the fact that he is "paid by a dead billionaire, not so much by federal tax dollars."
Just kidding! His recommendations are made from his vast experience for sure. So read on!
My fav parts
Plan explicitly for sustainable exponential growth. We keep using metaphors of data “tsunamis” or “explosions”, but these metaphors are misleading. Big data in biology is not an unexpected disastrous event that we have to clean up after. The volume of data will continue to increase exponentially for the foreseeable future. We must make sober plans for sustainable exponential growth (this is not an oxymoron).
Datasets should be “tiered” into different appropriate representations of different volumes. Currently we tend to default to archiving raw data – which not only maximizes storage/communication challenges, but also impedes downstream analyses that require processed datasets (genome assemblies, sequence alignments, or short reads mapped to a reference genome, rather than raw reads).
Data structures for histograms of short reads mapped to a reference genome coordinate system for a particular ChIP-seq or RNA-seq experiment. Many analyses of ChIP-seq and RNA-seq data don’t need the actual reads, only the histogram.
Reduce waste on subscaled computing infrastructure. NIH-funded investigators with large computing needs are typically struggling. Many are building inefficient computing clusters in their individual labs. This is the wrong scale of computing, and it is wasting NIH money. Several competing forces are at work:
Clusters have short lives: typically about three years. They are better accounted for as recurring consumables, not a one-time capital equipment expense. There are many stories of institutions being surprised when an expensive cluster needs to be thrown away, and the same large quantity of money spent to buy a new one.
in actual experience that computational biologists simply do not use remote computing facilities, preferring to use local ones. HPC experts in other fields tend to assume that this reflects a lack of education in HPC, but many computational genomicists have extensive experience in HPC. I assert that it is in fact an inherent structural issue, that computing in biology simply has a different workflow in how it manipulates datasets. For example, a strikingly successful design decision in our HHMI Janelia Farm HPC resource was to make the HPC filesystem the same as the desktop filesystem, minimizing data transfer operations internally.
NIH funding mechanisms are good at funding individual investigators, or at one-time capital equipment expenses, or at charging services (there are line items on grant application forms that seem to have originated in the days of mainframe computing!), or at large (regional or national) technology resources. Institutions have generally struggled to find appropriate ways to fund equipment, facilities, and personnel for mid-scale technology cores at the department or institute level, and computing clusters are probably the most dysfunctional example.
Cloud computing deserves development, but is not yet a substitute for local infrastructure. Cloud computing will have an increasing impact. It offers the prospect of offloading the difficult infrastructural critical mass issues of clustered computing to very large facilities — including commercial clouds such as Amazon EC2 and Microsoft Azure, but also academic clouds, perhaps even clouds custom-built for genomics applications.
Make better integrated informatics and analysis plans in NHGRI big science projects. NHGRI planning of big science projects is generally a top-down, committee-driven process that excels at the big picture goal but is less well-suited for arriving at a fully detailed and internally consistent experimental design. This is becoming a weakness now that NHGRI is moving beyond data sets of simple structure and enduring utility (such as the Human Genome Project), and into large science projects that ask more focused, timely questions. Without more detailed planning up front, informatics and analysis is reactive and defensive, rather than proactive and efficient. The result tends to be a default into “store everything, who knows what we’ll need to do!”, which of course exacerbates all our data problems. Three suggestions:
this blog post makes some very good recommendation .. of which I wonder how much is helped by *chuckle the fact that he is "paid by a dead billionaire, not so much by federal tax dollars."
Just kidding! His recommendations are made from his vast experience for sure. So read on!
My fav parts
Plan explicitly for sustainable exponential growth. We keep using metaphors of data “tsunamis” or “explosions”, but these metaphors are misleading. Big data in biology is not an unexpected disastrous event that we have to clean up after. The volume of data will continue to increase exponentially for the foreseeable future. We must make sober plans for sustainable exponential growth (this is not an oxymoron).
Datasets should be “tiered” into different appropriate representations of different volumes. Currently we tend to default to archiving raw data – which not only maximizes storage/communication challenges, but also impedes downstream analyses that require processed datasets (genome assemblies, sequence alignments, or short reads mapped to a reference genome, rather than raw reads).
Data structures for histograms of short reads mapped to a reference genome coordinate system for a particular ChIP-seq or RNA-seq experiment. Many analyses of ChIP-seq and RNA-seq data don’t need the actual reads, only the histogram.
Reduce waste on subscaled computing infrastructure. NIH-funded investigators with large computing needs are typically struggling. Many are building inefficient computing clusters in their individual labs. This is the wrong scale of computing, and it is wasting NIH money. Several competing forces are at work:
Clusters have short lives: typically about three years. They are better accounted for as recurring consumables, not a one-time capital equipment expense. There are many stories of institutions being surprised when an expensive cluster needs to be thrown away, and the same large quantity of money spent to buy a new one.
in actual experience that computational biologists simply do not use remote computing facilities, preferring to use local ones. HPC experts in other fields tend to assume that this reflects a lack of education in HPC, but many computational genomicists have extensive experience in HPC. I assert that it is in fact an inherent structural issue, that computing in biology simply has a different workflow in how it manipulates datasets. For example, a strikingly successful design decision in our HHMI Janelia Farm HPC resource was to make the HPC filesystem the same as the desktop filesystem, minimizing data transfer operations internally.
NIH funding mechanisms are good at funding individual investigators, or at one-time capital equipment expenses, or at charging services (there are line items on grant application forms that seem to have originated in the days of mainframe computing!), or at large (regional or national) technology resources. Institutions have generally struggled to find appropriate ways to fund equipment, facilities, and personnel for mid-scale technology cores at the department or institute level, and computing clusters are probably the most dysfunctional example.
Cloud computing deserves development, but is not yet a substitute for local infrastructure. Cloud computing will have an increasing impact. It offers the prospect of offloading the difficult infrastructural critical mass issues of clustered computing to very large facilities — including commercial clouds such as Amazon EC2 and Microsoft Azure, but also academic clouds, perhaps even clouds custom-built for genomics applications.
Make better integrated informatics and analysis plans in NHGRI big science projects. NHGRI planning of big science projects is generally a top-down, committee-driven process that excels at the big picture goal but is less well-suited for arriving at a fully detailed and internally consistent experimental design. This is becoming a weakness now that NHGRI is moving beyond data sets of simple structure and enduring utility (such as the Human Genome Project), and into large science projects that ask more focused, timely questions. Without more detailed planning up front, informatics and analysis is reactive and defensive, rather than proactive and efficient. The result tends to be a default into “store everything, who knows what we’ll need to do!”, which of course exacerbates all our data problems. Three suggestions:
- A large project should have a single “lead project scientist” who takes responsibility for overall direction and planning. This person takes input from the committee-driven planning process, but is responsible for synthesizing a well-defined plan.
- That plan should be written down, at a level of detail comparable to an R01 application. Forcing a consensus written plan will enable better cost/benefit analysis and more advance planning for informatics and analysis.
- That plan should be peer reviewed by outside experts, including informatics and analysis experts, before NHGRI approves funding.
Paper:Comparison of Multiple Genome Sequence Alignment Methods
Comparison of Multiple Genome Sequence Alignment Methods
Chen and Tompa, Nature Biotechnology
Xiaoyu Chen and Martin Tompa at the University of Washington in Seattle present their "comparative assessment of methods for aligning multiple genome sequences." In evaluating the level of agreement among the four ENCODE alignments, the team shows that Pecan "produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals."
Chen and Tompa, Nature Biotechnology
Xiaoyu Chen and Martin Tompa at the University of Washington in Seattle present their "comparative assessment of methods for aligning multiple genome sequences." In evaluating the level of agreement among the four ENCODE alignments, the team shows that Pecan "produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals."
Friday, 28 May 2010
ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data
ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing dataHossein Khiabanian, Pieter Van Vlierberghe, Teresa Palomero, Adolfo A Ferrando, Raul RabadanBMC Research Notes 2010, 3:147 (27 May 2010)
Q&A on ChIP-seq
http://www.biomedcentral.com/1741-7007/8/56
Flow scheme of the central steps in the ChIP-seq procedure.
Liu et al. BMC Biology 2010 8:56 doi:10.1186/1741-7007-8-56Illumina: an alternative to 454 in metagenomics?
Check out this BMC Bioinformatics paper entitled "Short clones or long clones? A simulation study on the use of paired reads in metagenomics"
"This paper addresses the problem of taxonomical analysis of paired reads. We describe a new feature of our metagenome analysis software MEGAN that allows one to process sequencing reads in pairs and makes assignments of such reads based on the combined bit scores of their matches to reference sequences. Using this new software in a simulation study, we investigate the use of Illumina paired-sequencing in taxonomical analysis and compare the performance of single reads, short clones and long clones. In addition, we also compare against simulated Roche-454 sequencing runs."
"Our study suggests that a higher percentage of Illumina paired reads than of Roche-454 single reads are correctly assigned to species."
"The gain of long-clone data (75 bp paired reads) over long single-read data (250 bp reads) is still significant at ≈ 4% (not shown)."
of course more importantly
"The authors declare that they have no competing interests."
I am not sure if such a program exists but I wonder if there is a aligner that takes into account the size between mate pairs and paired ends. Theoratically it should improve mapping. but by how much is unknown
"This paper addresses the problem of taxonomical analysis of paired reads. We describe a new feature of our metagenome analysis software MEGAN that allows one to process sequencing reads in pairs and makes assignments of such reads based on the combined bit scores of their matches to reference sequences. Using this new software in a simulation study, we investigate the use of Illumina paired-sequencing in taxonomical analysis and compare the performance of single reads, short clones and long clones. In addition, we also compare against simulated Roche-454 sequencing runs."
"Our study suggests that a higher percentage of Illumina paired reads than of Roche-454 single reads are correctly assigned to species."
"The gain of long-clone data (75 bp paired reads) over long single-read data (250 bp reads) is still significant at ≈ 4% (not shown)."
of course more importantly
"The authors declare that they have no competing interests."
I am not sure if such a program exists but I wonder if there is a aligner that takes into account the size between mate pairs and paired ends. Theoratically it should improve mapping. but by how much is unknown
Thursday, 27 May 2010
178 Microbial Reference Genomes Associated with the Human Body
Venter Institute Scientists, Along with Consortium Members of the NIH's Human Microbiome Project, Sequence 178 Microbial Reference Genomes Associated with the Human Body
Researchers from the J. Craig Venter Institute, a not-for-profit genomic research organization, have published (along with other members of the National Institutes of Health (NIH) Human Microbiome Jumpstart Reference Strains Consortium), a catalog of 178 microbial reference genomes isolated from the human body. Other members of the Consortium are: Baylor College of Medicine Human Genome Sequencing Center, the Broad Institute, and the Genome Center at Washington University. The paper is being published in the May 21 issue of the journal Science.
The human body is teeming with a variety of microbial species. This collective community is called the human microbiome. The role these microbes play in human health and disease is still relatively unknown but likely very important. The NIH Human Microbiome Project was launched in 2007, as part of the National Institutes of Health’s (NIH) Common Fund’s Roadmap for Medical Research. It is a $157 million, five-year effort that will implement a series of increasingly complicated studies that reveal the interactive role of the microbiome in human health.
Venter Institute Press Release 20th May
Mary Shelley's Frankenstein! bacteria that is..
A new form of life has been created in a laboratory, and the era of synthetic biology is dawning as reported in The Economist.
Craig Venter, Hamilton Smith have published in May 20th,issue of Science, how they had created a living creature. It isn't as impressive as generating life from a primordial soup (which would be really really big news) but its cutting edge nonetheless. To take a page from Frankenstein, they used parts of 'dead' organisms and instead of lightning they used synthesized DNA.
update: Genomeweb's news article
Why Craig Venter Isn't Actually God | The Daily Scan | GenomeWeb
Vote in the Poll as well!
Malaysian Genomics Center Offers Genome Mapping for $4K
Malaysia Genomics Resource Centre (MGRC) has announced a genome mapping pipeline service starting at $4000 per genome.
“The target market, said Robert Hercus, managing director of MGRC, “is smaller labs, hospitals, and university researchers who perhaps do not have the facilities or the bioinformaticians but want to do some sequencing.”
“We want to un-complicate the lives of the researchers who are doing wet lab biology and research,” Robert told Bio-IT World in a phone interview. He believes the service will help smaller labs that do not have the time or resources to invest in learning 10-20 open-sources packages or pay for software or hardware.
Read full article http://www.bio-itworld.com/2010/05/20/MGRC.html
“The target market, said Robert Hercus, managing director of MGRC, “is smaller labs, hospitals, and university researchers who perhaps do not have the facilities or the bioinformaticians but want to do some sequencing.”
“We want to un-complicate the lives of the researchers who are doing wet lab biology and research,” Robert told Bio-IT World in a phone interview. He believes the service will help smaller labs that do not have the time or resources to invest in learning 10-20 open-sources packages or pay for software or hardware.
Read full article http://www.bio-itworld.com/2010/05/20/MGRC.html
Wednesday, 26 May 2010
Knome invites researchers to apply for free sequencing and analysis of human exomes
I think there's nothing like the word 'free' that gets researchers to stand up and listen ;p
The 2010 KnomeDISCOVERY Research Awards
Knome invites researchers to apply for free sequencing and analysis of human exomes
Knome announces the launch of the KnomeDISCOVERY Research Awards, designed to spur novel discoveries by researchers working at the nexus of genomics and human health. Given annually, the KnomeDISCOVERY Research Awards will fund innovative projects that help reveal the genetic underpinnings of disease.
This year, Knome will award comprehensive sequencing and discovery-supportive analysis of a total of six (6) human exomes. These services will be distributed among three winning biomedical research groups, each of whom will use the award to study a pair of human exomes of their choice in order to answer a compelling biological question.
The 2010 KnomeDISCOVERY Research Awards
Knome invites researchers to apply for free sequencing and analysis of human exomes
Knome announces the launch of the KnomeDISCOVERY Research Awards, designed to spur novel discoveries by researchers working at the nexus of genomics and human health. Given annually, the KnomeDISCOVERY Research Awards will fund innovative projects that help reveal the genetic underpinnings of disease.
This year, Knome will award comprehensive sequencing and discovery-supportive analysis of a total of six (6) human exomes. These services will be distributed among three winning biomedical research groups, each of whom will use the award to study a pair of human exomes of their choice in order to answer a compelling biological question.
A scientific spectator's guide to next-generation sequencing
ROFL
I love the title!
My fave parts of the review
I love the title!
A scientific spectator's guide to next-generation sequencing
Dr Keith not only looks at next gen sequencing but also the emerging technologies of single molecule sequencing. Interesting read!
My fave parts of the review
"Finally, there is the cost per base, generally expressed in a cost per human genome sequenced at approximately 40X coverage. To show one example of how these trade off, the new PacBio machine has a great cost per sample (~U$100) and per run (you can run just one sample) but a poor cost per human genome – you’d need around 12,000 of those runs to sequence a human genome (~U$120K). In contrast, one can buy a human genome on the open market for U$50K and sub U$10K genomes will probably be generally available this year."
"Length is critical to genome sequencing and RNA-seq experiments, but really short reads in huge numbers are what counts for DGE/SAGE and many of the functional tag sequencing methods. Technologies with really long reads tend not to give as many, and with all of them you can always choose a much shorter run to enable the machine to be turned over to another job sooner – if your application doesn’t need long reads."
Coral Transcriptomics-a budget NGS approach?
Was surprised I didn't blog about this earlier.
Dr Mikhail Matz is a researcher in the field of coral genomics. His approach to doing de novo transcriptomics for an organism whose genome is unavailable.
his compute cluster is basically
"two Dell PowerEdge 1900 servers joined together with ROCKS clustering software v5.0. Each server had: two Intel Quad Core E5345 (2.33 Ghz, 1333 Mhz FSB, 2x4MB L2 Cache) CPU’s and 16 GB of 667 Mhz DDR2 RAM. The cluster had a combined total of 580 GB disk space."
Tools used are
- Blast executables from NCBI, including blast, blastcl3, and blastclust
- Washington University blast (Wu-blast)
- ESTate sequence clustering software
- Perl
Not sure if you have heard of just in time inventory. But I think "good enough" science takes a bit of dare to spend that money to ask those what-ifs.
Dr Mikhail Matz is a researcher in the field of coral genomics. His approach to doing de novo transcriptomics for an organism whose genome is unavailable.
his compute cluster is basically
"two Dell PowerEdge 1900 servers joined together with ROCKS clustering software v5.0. Each server had: two Intel Quad Core E5345 (2.33 Ghz, 1333 Mhz FSB, 2x4MB L2 Cache) CPU’s and 16 GB of 667 Mhz DDR2 RAM. The cluster had a combined total of 580 GB disk space."
Tools used are
- Blast executables from NCBI, including blast, blastcl3, and blastclust
- Washington University blast (Wu-blast)
- ESTate sequence clustering software
- Perl
He admits that the assembled transcriptome might be incomplete (~40,000 contigs with five-fold average sequencing see Figure 2 for the size distribution of the assembled contigs
But it is "good enough" to use as a reference transcriptome to align SOLiD reads accurately and to generate the coverage that 454 can't give for the same amount of grant money.
the results are published in BMC Genomics
Not sure if you have heard of just in time inventory. But I think "good enough" science takes a bit of dare to spend that money to ask those what-ifs.
Labels:
454,
assembly,
de novo,
Next Generation Sequencing,
SOLiD,
transcriptome,
transcriptomics
Installing R on CentOS 5.4 64 bit
create R.repo file in /etc/yum.repos.d/
[R-project]
name=R project for Statistical Computing repository
baseurl=http://rm.mirror.garr.it/mirrors/CRAN/bin/linux/redhat/el5/x86_64/
failovermethod=priority
enabled=1
gpgcheck=0
priority=15
yum install R
Propograte rpms across cluster..
/var/cache/yum/R-project/packages/R-2.10.0-2.el5.x86_64.rpm
[R-project]
name=R project for Statistical Computing repository
baseurl=http://rm.mirror.garr.it/mirrors/CRAN/bin/linux/redhat/el5/x86_64/
failovermethod=priority
enabled=1
gpgcheck=0
priority=15
yum install R
Propograte rpms across cluster..
/var/cache/yum/R-project/packages/R-2.10.0-2.el5.x86_64.rpm
Tuesday, 25 May 2010
EMMAX — efficient mixed-model association expedited new software for GWAS
New Software Promises to Ramp up GWAS
"While genome-wide association studies have certainly proven their worth when it comes to pinpointing which genes play a role in human disease development, they are far from perfect. Sometimes, the genealogy of the individuals included in these large-scale studies can throw a wrench in the works because rarely are pairs of individuals in a study completely unrelated. This pairwise relatedness has occasionally led researchers to believe they have discovered a gene involved in a particular disease when in fact it is an artifact. While most researchers have statistical approaches for dealing with different levels of relatedness that come in the form of population structure or hidden relatedness, a team of scientists from the University of Michigan and the University of California, Los Angeles, has developed a statistical approach for dealing with both forms of relatedness. The method has the added benefit of dramatically speeding up the analysis process from years to just a few hours. "
"While genome-wide association studies have certainly proven their worth when it comes to pinpointing which genes play a role in human disease development, they are far from perfect. Sometimes, the genealogy of the individuals included in these large-scale studies can throw a wrench in the works because rarely are pairs of individuals in a study completely unrelated. This pairwise relatedness has occasionally led researchers to believe they have discovered a gene involved in a particular disease when in fact it is an artifact. While most researchers have statistical approaches for dealing with different levels of relatedness that come in the form of population structure or hidden relatedness, a team of scientists from the University of Michigan and the University of California, Los Angeles, has developed a statistical approach for dealing with both forms of relatedness. The method has the added benefit of dramatically speeding up the analysis process from years to just a few hours. "
Wednesday, 19 May 2010
Has NGS changed the roles of Bioinformaticians?
Writing in Nature, Kelly Rae Chi examines the changing landscape of genome sequencing, and what it means for related careers.
an abstract can be read here
Going forward, bioinformaticians will likely take on "layered roles," she writes, in which they bring to the sequencing center their software engineering, database administration, and mathematics skills, among other things. Because bionformaticians are critical in the interpretation of data, Jim Mulkin, acting director of the National Institutes of Health's Intramural Sequencing Center, tells Nature that "almost every lab now needs to have a bioinformatician on their team."
With the exception of the last line, I hardly think that anything changed. Or perhaps its just in my region. And I beg to differ that every lab needs a bioinformatician.
You can
1) collaborate with others
2) outsource the bioinformatics
Hiring a bioinformatician that knows his/her stuff is not easy. My experience is that most times you need a couple of months before the new staff is settled down enough to churn or munge data without supervision.
an abstract can be read here
Going forward, bioinformaticians will likely take on "layered roles," she writes, in which they bring to the sequencing center their software engineering, database administration, and mathematics skills, among other things. Because bionformaticians are critical in the interpretation of data, Jim Mulkin, acting director of the National Institutes of Health's Intramural Sequencing Center, tells Nature that "almost every lab now needs to have a bioinformatician on their team."
With the exception of the last line, I hardly think that anything changed. Or perhaps its just in my region. And I beg to differ that every lab needs a bioinformatician.
You can
1) collaborate with others
2) outsource the bioinformatics
Hiring a bioinformatician that knows his/her stuff is not easy. My experience is that most times you need a couple of months before the new staff is settled down enough to churn or munge data without supervision.
MongoDB avail for Ubuntu.
excerpt from blog
"Fans of MongoDB and Ubuntu, rejoice. Installation just got easier, with the appearance of mongodb in the Ubuntu repositories."
Sigh why am I using CentOS 5.4 ?
"Fans of MongoDB and Ubuntu, rejoice. Installation just got easier, with the appearance of mongodb in the Ubuntu repositories."
Sigh why am I using CentOS 5.4 ?
Costs of Illumina sequencing
another blog post detailing the cost of Illumina NGS seq. Useful for penny pinching moments! What's unique is that the author showed "how Single end and Paired end sequencing costs drop as read-length increases"
There's 1001 ways to count costs and to factor in EVERYTHING like power, human resource. Naturally vendors dislike us to put it out in the open. ( Ahem, I have heard some company's lawyers sending emails to "correct" inaccurate figures posted )
Which is sad as I do not have a lawyer to send them an email about their inaccuracy or rather how they have conveniently left out upstream essential kit prices and other stuff. In their defence, everyone is trying to move towards the $1,000 genome but it does make my life difficult when I have to explain why adverts that say / promise $3,000 genomes are not really fibbing but they just conveniently left out costs which someone has to pay.
There's 1001 ways to count costs and to factor in EVERYTHING like power, human resource. Naturally vendors dislike us to put it out in the open. ( Ahem, I have heard some company's lawyers sending emails to "correct" inaccurate figures posted )
Which is sad as I do not have a lawyer to send them an email about their inaccuracy or rather how they have conveniently left out upstream essential kit prices and other stuff. In their defence, everyone is trying to move towards the $1,000 genome but it does make my life difficult when I have to explain why adverts that say / promise $3,000 genomes are not really fibbing but they just conveniently left out costs which someone has to pay.
What do you use for citation / bibliography / reference in writing?
Am looking at
http://www.zotero.org/
also exploring
http://www.wizfolio.com/
Found this on the web as well
http://www.easybib.com/
While I like it that zotero is well integrated with my browser and has openoffice plugins. But keeping a backup of the references and keeping it synced is a problem. I would much rather have my references on the cloud. which makes for easier sharing. Suggestions anyone?
Not Endnote please.. I seldom work on windows machines.
http://www.zotero.org/
also exploring
http://www.wizfolio.com/
Found this on the web as well
http://www.easybib.com/
While I like it that zotero is well integrated with my browser and has openoffice plugins. But keeping a backup of the references and keeping it synced is a problem. I would much rather have my references on the cloud. which makes for easier sharing. Suggestions anyone?
Not Endnote please.. I seldom work on windows machines.
Tuesday, 18 May 2010
Book review:Programming Collective Intelligence
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran
Permalink: http://amzn.com/0596529325
I have always wanted to explore classification methods and their theory to see how I can apply these to bioinformatics. But so far I have yet to encounter a book or website that explains the topic well with examples that you can do. It's a bonus that the examples are written in Python a language I know and has highly readable code for those that do not know.
Although the examples are not from biology but it is easy to see how some classical biological problems can be solved by SVM.
p.s. This amazon associates widget is cool! it will throw up relevant books based on my words in the blog post!
BGI to Sequence and Assemble 100 Vertebrates within Two Years for Genome 10K Project
Got to know of this news by www.genomeweb.com
BGI to sequence 100 vertebrate species for Genome 10K project. Andrea Anderson. May 14, 2010. GenomeWeb.
The http://www.genome10k.org/ project has lofty aims to capture ".. the genetic diversity of vertebrate species would create an unprecedented resource for the life sciences and for worldwide conservation efforts."
I do wonder however, if the selection will be biased towards life sciences or conservation... Never the twain shall meet? or am I wrong?
BGI to sequence 100 vertebrate species for Genome 10K project. Andrea Anderson. May 14, 2010. GenomeWeb.
The http://www.genome10k.org/ project has lofty aims to capture ".. the genetic diversity of vertebrate species would create an unprecedented resource for the life sciences and for worldwide conservation efforts."
I do wonder however, if the selection will be biased towards life sciences or conservation... Never the twain shall meet? or am I wrong?
Friday, 14 May 2010
Lincoln Stein makes his case for moving genome informatics to the Cloud
Matthew Dublin summarizes Lincoln's paper in Making the Case for Cloud Computing & Genomics in genomeweb
excerpt "....
Stein walks the reader through an nice explanation of what exactly cloud computing is, the benefits of using a compute solution that grows and shrinks as needed, and makes an attempt at tackling the question of the cloud's economic viability when compared to purchasing and managing local compute resources.
The take away is that Moore's Law and its effect on sequencing technology will soon force researchers to analyze their mountains of sequencing data in a paradigm where the software comes to the data rather than the current, and opposite, approach. Stein says that this means now more than ever, cloud computing is a viable and attractive option..... "
Yet to read it (my weekend bedtime story) will post comments here.
excerpt "....
Stein walks the reader through an nice explanation of what exactly cloud computing is, the benefits of using a compute solution that grows and shrinks as needed, and makes an attempt at tackling the question of the cloud's economic viability when compared to purchasing and managing local compute resources.
The take away is that Moore's Law and its effect on sequencing technology will soon force researchers to analyze their mountains of sequencing data in a paradigm where the software comes to the data rather than the current, and opposite, approach. Stein says that this means now more than ever, cloud computing is a viable and attractive option..... "
Yet to read it (my weekend bedtime story) will post comments here.
Labels:
cloud computing,
genome,
journal,
Next Generation Sequencing,
review
The elusive Bioscope on cloud service.
Reference to my last post about Cloud enabled Bioscope. I have found new documentation at applied biosystems.
But the service appears to be not yet public.
Oh the suspense!
But the service appears to be not yet public.
Oh the suspense!
Labels:
bioinformatics,
bioscope,
Next Generation Sequencing,
SOLiD
Tuesday, 11 May 2010
Solutions for Applying Automation to Next-Generation Sequencing Sample Prep
FYI might be useful for some.
disclaimer: I am not affliated with the companies involved.
http://www.genengnews.com/ngs
Solutions for Applying Automation to Next-Generation Sequencing Sample Prep
In recent years, next-generation sequencing (NGS) technologies have rapidly evolved to provide faster, better, cheaper and more reliable mapping of DNA and RNA sequences thus enabling a diverse set of genomic discoveries. This has been largely driven by innovations and improvisations at the technical end. However, challenges still remain with increasing sample throughput, enabling sample preparation, minimizing errors, and with improving data analysis.
This webinar provides the audience with an overview of the developments in NGS technologies, with an emphasis on the challenges that are routinely encountered at various stages in the sequencing and analysis of samples. It offers detailed information on the specific challenges associated with sample preparation and the benefits of using automation to alleviate some of the bottlenecks. The webinar features viewpoints expressed by three experts in the field, who share examples from their laboratories on how to effectively adopt and utilize automation for NGS applications.
disclaimer: I am not affliated with the companies involved.
http://www.genengnews.com/ngs
Solutions for Applying Automation to Next-Generation Sequencing Sample Prep
In recent years, next-generation sequencing (NGS) technologies have rapidly evolved to provide faster, better, cheaper and more reliable mapping of DNA and RNA sequences thus enabling a diverse set of genomic discoveries. This has been largely driven by innovations and improvisations at the technical end. However, challenges still remain with increasing sample throughput, enabling sample preparation, minimizing errors, and with improving data analysis.
This webinar provides the audience with an overview of the developments in NGS technologies, with an emphasis on the challenges that are routinely encountered at various stages in the sequencing and analysis of samples. It offers detailed information on the specific challenges associated with sample preparation and the benefits of using automation to alleviate some of the bottlenecks. The webinar features viewpoints expressed by three experts in the field, who share examples from their laboratories on how to effectively adopt and utilize automation for NGS applications.
What will be covered:
- Overview of NGS technologies and potential challenges in their adoption and use
- Tackling challenges associated specifically with sample prep for NGS
- Effective use of automation for alleviating some of the bottlenecks in sequencing
- Ways to increase and improve the sample throughput in sequencing
- Use and creation of high-diversity sequencing libraries
- Application of automated NGS platforms in areas like cancer genetics and CNS research
- Overview of targeted resequencing applications including effective whole exon sequencing, indexed/barcoded resequencing of small regions, automated targeted resequencing
Who will benefit from attending:
- Scientists involved in pharmaceutical/biotechnology R&D and clinical services
- Scientists keen to learn more about the use and adoption of NGS technologies
- Scientists and clinicians active in biomarker research
- Scientists looking to use sequencing for oncology and CNS research
Panelists include
- Shawn Levy, Ph.D., Faculty Investigator, HudsonAlpha Institute for Biotechnology
- Brian Minie, Ph.D., Broad Institute of MIT and Harvard
- Emily Leproust, Ph.D., Director, Applications and Chemistry R&D, Genomics, Agilent Technology
Monday, 10 May 2010
A plethora of solid2fastq or csfasta convertors to fastq
I hadn't realised that there's an accumulation of prog/scripts to do the same task. Last count is 4 of these in my tool closet.
The C binary from bfast
solid2fastq 0.6.4a
Usage: solid2fastq [options]
-c produce no output.
-n INT number of reads per file.
-o STRING output prefix.
-j input files are bzip2 compressed.
-z input files are gzip compressed.
-J output files are bzip2 compressed.
-Z output files are gzip compressed.
-t INT trim INT bases from the 3' end of the reads.
-h print this help message.
send bugs to bfast-help@lists
solid2fastq.pl from bfast-0.6.4a
with notes in the script to refer to the above
# Author: Nils Homer
# Please see the C implementation of this script.
EDIT: THANKS to iceman for his reminder in the comments
"Make sure that you use the BWA's solid2fastq.pl if you are going to use BWA as it "double-encodes" the reads."
solid2fastq.pl from bwa-0.5.7
Usage: solid2fastq.pl
Note: is the string showed in the `# Title:' line of a
".csfasta" read file. ThenF3.csfasta is read sequence
file andF3_QV.qual is the quality file. If
R3.csfasta is present, this script assumes reads are
paired; otherwise reads will be regarded as single-end.
The read name will be:panel_x_y/[12] with `1' for R3
tag and `2' for F3. Usually you may want to use short
to save diskspace. Long also causes troubles to maq.
# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.
maq-0.7.1/scripts/solid2fastq.pl
Usage: solid2fastq.pl
Note: is the string showed in the `# Title:' line of a
".csfasta" read file. ThenF3.csfasta is read sequence
file andF3_QV.qual is the quality file. If
R3.csfasta is present, this script assumes reads are
paired; otherwise reads will be regarded as single-end.
The read name will be:panel_x_y/[12] with `1' for F3
tag and `2' for R3. Usually you may want to use short
to save diskspace. Long also causes troubles to maq.
# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
The C binary from bfast
solid2fastq 0.6.4a
Usage: solid2fastq [options]
-c produce no output.
-n INT number of reads per file.
-o STRING output prefix.
-j input files are bzip2 compressed.
-z input files are gzip compressed.
-J output files are bzip2 compressed.
-Z output files are gzip compressed.
-t INT trim INT bases from the 3' end of the reads.
-h print this help message.
send bugs to bfast-help@lists
solid2fastq.pl from bfast-0.6.4a
with notes in the script to refer to the above
# Author: Nils Homer
# Please see the C implementation of this script.
EDIT: THANKS to iceman for his reminder in the comments
"Make sure that you use the BWA's solid2fastq.pl if you are going to use BWA as it "double-encodes" the reads."
solid2fastq.pl from bwa-0.5.7
Usage: solid2fastq.pl
Note:
".csfasta" read file. Then
file and
paired; otherwise reads will be regarded as single-end.
The read name will be
tag and `2' for F3. Usually you may want to use short
to save diskspace. Long
# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.
maq-0.7.1/scripts/solid2fastq.pl
Usage: solid2fastq.pl
Note:
".csfasta" read file. Then
file and
paired; otherwise reads will be regarded as single-end.
The read name will be
tag and `2' for R3. Usually you may want to use short
to save diskspace. Long
# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
Friday, 7 May 2010
Yet another viewer for NGS data, MagicViewer
MagicViewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation.
Hou H, Zhao F, Zhou L, Zhu E, Teng H, Li X, Bao Q, Wu J, Sun Z.
Nucleic Acids Res. 2010 May 5. [Epub ahead of print]
PMID: 20444865 [PubMed - as supplied by publisher]
Hou H, Zhao F, Zhou L, Zhu E, Teng H, Li X, Bao Q, Wu J, Sun Z.
Nucleic Acids Res. 2010 May 5. [Epub ahead of print]
PMID: 20444865 [PubMed - as supplied by publisher]
Abstract
New sequencing technologies, such as Roche 454, ABI SOLiD and Illumina, have been increasingly developed at an astounding pace with the advantages of high throughput, reduced time and cost. To satisfy the impending need for deciphering the large-scale data generated from next-generation sequencing, an integrated software MagicViewer is developed to easily visualize short read mapping, identify and annotate genetic variation based on the reference genome. MagicViewer provides a user-friendly environment in which large-scale short reads can be displayed in a zoomable interface under user-defined color scheme through an operating system-independent manner. Meanwhile, it also holds a versatile computational pipeline for genetic variation detection, filtration, annotation and visualization, providing details of search option, functional classification, subset selection, sequence association and primer design. In conclusion, MagicViewer is a sophisticated assembly visualization and genetic variation annotation tool for next-generation sequencing data, which can be widely used in a variety of sequencing-based researches, including genome re-sequencing and transcriptome studies. MagicViewer is freely available at http://bioinformatics.zj.cn/magicviewer/.PMID: 20444865 [PubMed - as supplied by publisher]
Subscribe to:
Posts (Atom)