Friday, 29 July 2011

:) Ion Torrent Server up!

Wednesday, 27 July 2011

Using the Acropora digitifera genome to understand coral responses to environmental change.

Using the Acropora digitifera genome to understand coral responses to environmental change.

Shinzato C, Shoguchi E, Kawashima T, Hamada M, Hisata K, Tanaka M, Fujie M, Fujiwara M, Koyanagi R, Ikuta T, Fujiyama A, Miller DJ, Satoh N.

Nature. 2011 Jul 24. doi: 10.1038/nature10249. [Epub ahead of print]

Despite the enormous ecological and economic importance of coral reefs, the keystone organisms in their establishment, the scleractinian corals, increasingly face a range of anthropogenic challenges including ocean acidification and seawater temperature rise. To understand better the molecular mechanisms underlying coral biology, here we decoded the approximately 420-megabase genome of Acropora digitifera using next-generation sequencing technology. This genome contains approximately 23,700 gene models. Molecular phylogenetics indicate that the coral and the sea anemone Nematostella vectensis diverged approximately 500 million years ago, considerably earlier than the time over which modern corals are represented in the fossil record (∼240 million years ago). Despite the long evolutionary history of the endosymbiosis, no evidence was found for horizontal transfer of genes from symbiont to host. However, unlike several other corals, Acropora seems to lack an enzyme essential for cysteine biosynthesis, implying dependency of this coral on its symbionts for this amino acid. Corals inhabit environments where they are frequently exposed to high levels of solar radiation, and analysis of the Acropora genome data indicates that the coral host can independently carry out de novo synthesis of mycosporine-like amino acids, which are potent ultraviolet-protective compounds. In addition, the coral innate immunity repertoire is notably more complex than that of the sea anemone, indicating that some of these genes may have roles in symbiosis or coloniality. A number of genes with putative roles in calcification were identified, and several of these are restricted to corals. The coral genome provides a platform for understanding the molecular basis of symbiosis and responses to environmental changes.

PMID:
21785439
[PubMed - as supplied by publisher]

Friday, 22 July 2011

Build an app that makes science more open

Build an app that makes science more open
Wow just after I blogged about Web 2.0 failing to make scientists more productive .. here comes an app challenge!

Hmmm fresh outta ideas at 2 am .. will sleep on this ..

PLoS and Mendeley, the popular reference manager and academic social network, have teamed up to create a Binary Battle contest to build the best apps that make science more open using PLoS and/or Mendeley’s APIs (Application Programming Interface). There’s $16,000 in prize money to be won plus other cool gifts and the opportunity to get your entries in front of a panel of influential judges from technology, media and science that include:

Tim O’Reilly - Founder and CEO of O’Reilly Media which is changing the world by spreading the knowledge of innovators. Tim also co-hosts the annual Science Foo Camp with Google and Nature.

Dr. Werner Vogels - CTO of Amazon.com and former research scientist at Cornell University. Werner is one of world’s top experts on cloud computing and ultra-scalable systems.

Juan Enriquez – Managing Director of Excel Venture Management and CEO of Biotechonomy. Juan is recognized as one of the world’s leading authorities on the economic and political impacts of life sciences.

John Wilbanks - VP for Science at Creative Commons. Seed Magazine named John a “Game Changer” among their Revolutionary Minds of 2008.

James Powell - CTO of Thomson Reuters, the world’s leading information services company. Still a nerd at heart, James is particularly interested in how technology gets applied to solve problems.

We have two APIs for you to mine in this competition. The PLoS Search API allows anyone to build their own applications for the web, desktop or mobile devices using PLoS content. The Mendeley API opens up a database of over 80 million research papers, usage statistics, reader demographics, social tags, and related research recommendations. Since Mendeley got this competition up and running before PLoS joined the party, you can see what some people have already made using their API. It’s also worth knowing that Mendeley are organizing two simultaneous Hackathons at their NY and London offices on Saturday June 11-Sunday June 12, 2011.

Here’s the lowdown on the amazing prizes:

  • Grand prize: $10001 + $1000 Amazon Web Services Credits
  • Second prize: $5000 + $500 Amazon Web Services Credits – an extra $1000 plus a Parrott AR Drone Quadricopter is available to the best combined PLoS/Mendeley app.
  • Last day to submit your app: September 30th 2011
  • Winner announced on: November 30th 2011

Entries will be judged on criteria such as: activity; popularity/usefulness; whether it increases collaboration and/or transparency and how cool is it (does it make our jaws drop!). Please note, we can not accept entries from PLoS or Mendeley staff or their immediate families, their investors or board members. To get started, developers need to get a key from PLoS, Mendeley or both. Don’t forget, the last day to submit your app is September 30th 2011.


Wednesday, 20 July 2011

Web 2.0 will revolutionize science!

nsaunders  laments on the sad fact that web 2.0 hasn't really influenced fellow scientists in improving productivity !


I’ve given up trying to educate colleagues in best practices. Clearly, I’m the one with the problem, since this is completely normal, acceptable behaviour for practically everyone that I’ve ever worked with. Instead, I’m just waiting for them to retire (or die). I reckon most senior scientists (and they’re the ones running the show) are currently aged 45-55. So it’s going to be 10-20 years before things improve.
Until then, I’ll just have to keep deleting your emails. Sorry.

Monday, 18 July 2011

Saturday, 16 July 2011

Genomics of Emerging Infectious Disease: A PLoS Collection

Genomics of Emerging Infectious Disease: A PLoS Collection 
This collection of essays, perspectives, and reviews from six PLoS Journals provides insights into how genomics can revolutionize our understanding of emerging infectious disease

International Crowdsourcing Initiative to Combat the E. Coli Breakout in Germany

Guest Post: International Crowdsourcing Initiative to Combat the E. Coli Breakout in Germany

 Permanent link
June 8, 2011 
Editor’s Note: In light of the recent E coli outbreak in Germany, NGS Leaders invited Joyce Peng from BGI to comment on the organization’s efforts to understand the culprit. Below Joyce describes BGI’s efforts to rally the international community in combating the outbreak. – Eric Glazer
In response to the recent E. coli outbreak in Germany, BGI and its collaborators at the University Medical Centre Hamburg-Eppendorf have released their third version of the assembled genome, which includes new data from this E. coli O104 (ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482/Escherichia_coli_TY-2482.contig.20110606.fa.gz ). The FTP site contains a file that provides the PCR primer sequences which researchers have used to create diagnostic kits for rapid identification of this highly infectious bacterium.

Friday, 15 July 2011

WGsim and bowtie paired end mapping

I wanted to simulate PE reads with wgsim but by using the default params, i didn't realise that I was creating reads with 500 bp insert. which doesn't go well with Bowtie's PE default of 250 bp which to cut a long story short, resulted in low pairing rates.

Only found this out by googling biostar  of course there is the obligatory discussion of merits of using bowtie versus bwa or vice versa. and there are some nice graphs in there that illustrate the problem at hand.

Supercomputing the Human Brain and credit derivative portfolios

JP Morgan Buys Into FPGA Supercomputing

One of the largest financial institutions in the world is using FPGA-based supercomputing for analyzing some of its largest and most complex credit derivative portfolios. JP Morgan, along with Maxeler Technologies, has built and deployed a state-of-the art HPC system capable of number-crunching the company's collateralized debt obligation (CDO) portfolio in near real-time. Read More...

The Next Step in Human Brain Simulation

Can the human brain devise a system capable of understanding itself? That's been something brain simulation researchers have been working toward for nearly a decade. With recent advances in supercomputing capabilities and modeling techniques, the question may soon be answered. Read More...

How add new line to start and end of a file, SED / Linux Goodness

saw this usage of sed in the forums posted by ghostdog74

sed '1 i this is first line' file


sed '$ a this is last line' file


link

Is a big ass server a want or a need?

"Big-Ass Servers™ and the myths of clusters in bioinformatics"

a topic title like that has to catch your attention ... 

I think that 

it is useful to have loads of ram and loads of cores for one person's use. But when it is shared (on the university's HPC), you have a hard time juggling resources in a fair manner especially in Bioinformatics where walltimes and ram requirements are known post analysis. A HPC engineer once told me that HPC for biologist means selfish hogging of resources. I can only shrug and concede at her comment. 

I don't know if there's a better way to do the things I do with more RAM and faster disks, but I do know that it will probably cost more in development time. 


That said Cloud computing is having trouble keeping up with I/O bound stuff like bioinformatics, and smaller cloud computing services are all trying to show that they have faster interconnects, but you can't really beat a BAS that's on a local network.

Performance of Ray @ Assemblathon 2

Ray is one of the assemblers that I watch closely but sadly lack the time to experiment with. Here's the candid email from Sebastien to the Ray mailing list on Ray's performance on the test data

For those who follow Assemblathon 2, my last run on my testbed (Illumina data from BGI and from Illumina UK for the Bird/Parrot):

(all mate-pairs failed detection because of the many peaks in each library, I will modify Ray to consider that)



Total number of unfiltered Illumina TruSeq v3 sequences: Total:
3 072 136 294, that is ~3 G sequences !


512 compute cores (64 computers * 8 cores/computer = 512)


Typical communication profile for one compute core:


[1,0]:Rank 0: sent 249841326 messages, received 249840303 messages.


Yes, each core sends an average of 250 M messages during the 18 hours !




Peak memory usage per core: 2.2 GiB


Peak memory usage (distributed in a peer-to-peer fashion): 1100 GiB


The peak occurs around 3 hours and goes down to 1.1 GiB per node immediately because the pool of defragmentation groups for k-mers occuring once is freed.



The compute cluster I use has 3 GiB per compute core. So using 2048 compute cores would give me 6144 GiB of distributed memory.




Number of contigs: 550764

Total length of contigs: 1672750795
Number of contigs >= 500 nt: 501312
Total length of contigs >= 500 nt: 1656776315
Number of scaffolds: 510607
Total length of scaffolds: 1681345451
Number of scaffolds >= 500 nt: 463741
Total length of scaffolds >= 500: 1666464367

k-mer length: 31

Lowest coverage observed: 1
MinimumCoverage: 42
PeakCoverage: 171
RepeatCoverage: 300
Number of k-mers with at least MinimumCoverage: 2453479388 k-mers
Estimated genome length: 1226739694 nucleotides
Percentage of vertices with coverage 1: 83.7771 %
DistributionFile: parrot-Testbed-A2-k31-

20110712.CoverageDistribution.txt

[1,0]: Sequence partitioning: 1 hours, 54 minutes, 47 seconds

[1,0]: K-mer counting: 5 hours, 47 minutes, 20 seconds
[1,0]: Coverage distribution analysis: 30 seconds
[1,0]: Graph construction: 2 hours, 52 minutes, 27 seconds
[1,0]: Edge purge: 57 minutes, 55 seconds
[1,0]: Selection of optimal read markers: 1 hours, 38 minutes, 13 seconds
[1,0]: Detection of assembly seeds: 16 minutes, 7 seconds
[1,0]: Estimation of outer distances for paired reads: 6 minutes, 26 seconds
[1,0]: Bidirectional extension of seeds: 3 hours, 18 minutes, 6 seconds
[1,0]: Merging of redundant contigs: 15 minutes, 45 seconds
[1,0]: Generation of contigs: 1 minutes, 41 seconds
[1,0]: Scaffolding of contigs: 54 minutes, 3 seconds
[1,0]: Total: 18 hours, 3 minutes, 50 seconds


10 largest scaffolds:


257646

266905
268737
272828
281502
294105
294106
296978
333171
397201


# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes

# Message passing interface rank Name Latency in microseconds
0 r107-n24 138
1 r107-n24 140
2 r107-n24 140
3 r107-n24 140
4 r107-n24 141
5 r107-n24 141
6 r107-n24 140
7 r107-n24 140
8 r107-n25 140
9 r107-n25 139
10 r107-n25 138
11 r107-n25 139
  S├ębastien

Wednesday, 13 July 2011

How Bright Promise in Cancer Testing Fell Apart Time.com

How Bright Promise in Cancer Testing Fell Apart
Published: July 7, 2011
A Duke University program to tailor cancer treatments to certain patterns of genes has ended in disaster and lawsuits.
http://www.nytimes.com/2011/07/08/health/research/08genes.html

on a related note

Making Genomics Routine in Cancer Care

Researchers are developing more and more genetic tests for specific cancers, but doctors don't use them as often as they should—a new network of clinics aims to change that.
http://www.technologyreview.com/biomedicine/37993/


RNA-seq on the Ion Torrent PGM

K I must admit with the 314 chip, 500k reads seem .... stretching the limits of usability for RNA-seq. but looking at the Life Tech's presentation New to RNA-seq; how it compares to microarrays. it does makes sense to use PGM over microarray for certain reasons.. and certain samples. e.g. bacteria / virus transcriptomes. Granted that PGM also gives better dynamic range than microarrays with a price that's not too far from microarray, it does make sense to beef up one's data with a run or two of Ion Torrent.

at USD $595 for a 316 chip run "All included"as quoted. They do make it very attractive for microarray users to switch over. Granted u might need a couple of runs to make sense of human samples.
Though I be wary about hidden costs not anticipated in their calculations.
What's interesting is that they claim no platform bias between SOLiD and PGM runs, no details are given, but i assume they ran PGM runs to match SOLiD Throughput and compared the output?

Would you consider PGM for ur RNA-seq?
post in comments please...

Tuesday, 12 July 2011

Exome Sequences Reveal Role for De Novo Mutations in Schizophrenia | GenomeWeb Daily News | Sequencing | GenomeWeb

Exome Sequences Reveal Role for De Novo Mutations in Schizophrenia

NEW YORK (GenomeWeb News) – Individuals with sporadic schizophrenia tend to carry more new, potentially deleterious, genetic changes than individuals in the general population, according to an exome sequencing study of schizophrenia-affected families that appeared online in Nature Genetics yesterday.

"The occurrence of de novo mutations, as observed in this study, may in part explain the high worldwide incidence of schizophrenia," co-senior author Guy Rouleau, a researcher at the University of Montreal and director of its CHU Sainte-Justine Research Center, said in a statement.



Author: I have wondered if there's any evolutionary significance for keeping schizo genes in a gene pool. But I guess, it would be hard to find such evidence now that it appears that de novo mutations contribute to the disease.

You Say Potato, I Say 'Highly Heterozygous Autotetraploid'

An international consortium of researchers has reported results from a study of the potato genome, reports New Scientist's Debora MacKenzie. Most potatoes hold four copies of its genome, each of which is different from the others, "making sequencing a nightmare," MacKenzie says. The Potato Genome Sequencing Consotium, which published its work in Nature, was able to get around this problem by growing a whole plant in culture from one pollen cell, producing potatoes with just one copy of the genome. .....
At the Genotype blog, blogger Playwright in the Cages has a different take on the story — Playwright finds it "comforting" that potatoes have about double the number of active genes that humans do because "it's just another demonstration that the cultural assumption that humans must be the most complex of nature's creations because we're (allegedly) the smartest ... is based on a false paradigm of 'evolution as an advance in complexity' with us at the top of the pyramid."
Daily Scan's sister publication GenomeWeb Daily News has more on this story here.

Author: I just love the title and the fact that we survive with less active genes than Mr Potato Head.

 excerpted from Genomeweb

Sequilab Launches Free Bioinformatics Portal to Link Sequence Analysis, Social Networking

A bioinformatics startup is giving users a free taste of its wares in hopes of eventually realizing commercial success.

The company, Sequilab, recently launched a free web portal of the same name that offers access to publicly available online bioinformatics tools for sequence analysis, as well as social-networking capabilities to improve collaborations between research groups.

Access is currently free for all users, but the Sequilab team is considering ways to "monetize" the website, including a premium version of the software that would include advanced features, as well as an ad-supported model, CEO Dan Melvin told BioInform.

"What we are trying to do is build the community now and then when it reaches a specific size of user base, that's when we would [go] commercial," Melvin said.


Read more

MiSeq data out in the wild as well!

MiSeq Aims for Ion Torrent

     Illumina has released a MiSeq dataset for E.coli MG1655 on its website.  Accompanying the FASTQ and BAM files is a presentation, the first half of which compares the performance to big brother HiSeq.  The second half is an explicit comparison against the available Ion Torrent. Read more »Q

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly

Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require (n2) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ovmax(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight =–|ovmax(x, y)| if and only if ≤, where |ovmax(x, y)| is the length of ovmax(x, y) and is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2–1)(2logn+log)n bits with a guarantee that the basic operation of accessing an edge takes O(log ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(nlogn) worse-case time and requires O() extra memory. The second one runs in O(n) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Availability: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip

Pros and Cons of RNA-seq as spelt out by RNA-seq Blog

 

The Magic of RNA-Seq

from RNA-Seq Blog 

A 3rd party evaluation of Ion Torrent's 316 chip data

Dan Koboldt (from massgenomics) has posted about what I know to be the 1st independent look at the data from Ion Torrent's 316 chip,
Granted the data was handed to him in a 'shiny report with color images' but he has bravely ignored that to give an honest look at the raw data itself.

The 316 chip gives a throughout that  nicely covers WGS reseq experiments for bacterial sized genomes. "The E. coli reference genome totals about 4.69 Mbp. With 175 Mbp of data, the theoretical coverage is around 37.5-fold across the E. coli genome."

For those wary of dry reviews, fear not, easily comprehensible graphs are posted within!

read the full post here

TimeTree2: species divergence times on the iPhone

Any takers to port this to Android?


Abstract

Summary: Scientists, educators and the general public often need to know times of divergence between species. But they rarely can locate that information because it is buried in the scientific literature, usually in a format that is inaccessible to text search engines. We have developed a public knowledgebase that enables data-driven access to the collection of peer-reviewed publications in molecular evolution and phylogenetics that have reported estimates of time of divergence between species. Users can query the TimeTree resource by providing two names of organisms (common or scientific) that can correspond to species or groups of species. The current TimeTree web resource (TimeTree2) contains timetrees reported from molecular clock analyses in 910 published studies and 17 341 species that span the diversity of life. TimeTree2 interprets complex and hierarchical data from these studies for each user query, which can be launched using an iPhone application, in addition to the website. Published time estimates are now readily accessible to the scientific community, K–12 and college educators, and the general public, without requiring knowledge of evolutionary nomenclature.
Availability: TimeTree2 is accessible from the URL http://www.timetree.org, with an iPhone app available from iTunes (http://itunes.apple.com/us/app/timetree/id372842500?mt=8) and a YouTube tutorial (http://www.youtube.com/watch?v=CxmshZQciwo). 

 Link

Thursday, 7 July 2011

New public MySQL server | Ensembl Blog

http://www.ensembl.info/blog/2011/07/06/ensembl-now-has-a-second-public-mysql-server/

OpGen Touts Technology's Ability to Improve De Novo Assembly, Correct Errors in Finished Genomes


By Monica Heger
When paired with next-gen sequencing technology, OpGen's Argus optical mapping technology can correct errors in assembled genomes and help close gaps, a company official said last week at a presentation during a one-day conference of BGI users in Rockville, Md.
Trevor Wagner, a senior scientist at OpGen, presented data on how the company has used the Argus platform to find errors in microbial assemblies from the Human Microbiome Project, as well as in finished genomes, and to close introduced gaps in sequenced human genomes.
While the platform has mostly been used for smaller genomes like bacteria and microbes, Wagner said that the company is now also moving into mammalian and plant genomes, using a "hybrid approach" that combines next-gen sequencing with single-molecule restriction maps.
Full article

BGI Announces Cloud Genome Assembly Service

I am very excited about cloud solutions for de novo assembly as they are quite computational intensive and with parameters tweaking, you have a massive parallelization  problem that just begs for computer cores. I do wonder if there's a need for a cloud solution for resequencing pipelines, especially when it involves BWA which can be run rather efficiently on desktop or in house clusters. Only whole genome reseq might require more compute hours, but I would think that any center that does WGS regularly would at least have a genome reseq capable cluster at the very minimum to just store the data before it is analyzed.

Anyway let's see if BGI will change the computational cloud scene ...


By Allison Proffitt 
July 6, 2011 | SHENZHEN, CHINA—At the BGI Bioinformatics Software Release Conference today, researchers announced two new Cloud-based software-as-a-service offerings for next-gen data analysis. Hecate and Gaea (named for Greek gods) are “flexible computing” solutions for do novo assembly and genome resequencing.  
These are “cloud-based services for genetic researchers” so that researchers don’t need to “purchase your own cloud clusters,” said Evan Xiang, part of the flexible computing group at BGI Shenzhen. Hecate will do de novo assembly, and Gaea will run the SOAP2, BWA, Samtools, DIndel, and BGI’s realSFS algorithms. Xiang expects an updated version of Gaea to be released later this year with more algorithms available.  .......full article

Monday, 4 July 2011

Compiling BEDTools on Ubuntu 10.04 LTS - the Lucid Lynx

You will need these libraries

sudo apt-get install g++
sudo apt-get install zlib1g-dev



after which, it is a simple
$ make clean
$ make all[ -d obj ] || mkdir -p obj
[ -d bin ] || mkdir -p bin
Building BEDTools:
=========================================================
- Building in src/utils/lineFileUtilities
  * compiling lineFileUtilities.cpp

- Building in src/utils/bedFile
  * compiling bedFile.cpp

- Building in src/utils/bedGraphFile
  * compiling bedGraphFile.cpp

- Building in src/utils/tabFile
  * compiling tabFile.cpp

- Building in src/utils/genomeFile
  * compiling genomeFile.cpp

- Building in src/utils/gzstream
g++ -Wall -O2 -c -o ../../../obj//gzstream.o gzstream.C -I.

.

.
.
.
.





genomeCoverageBed to look at coverage of your WGS

BEDTools is a very useful set of programs for looking at your NGS sequencing result. 


one of which I use regularly is 


    $ genomeCoverageBed -d -ibam sortedBamFile.bam -g genome.csv > coverage.csv


so you only need your bam file and a genome file (tab-delimited file with "contig_name contig_size" information for all contigs in the reference genome.
This file can be done automatically with 

   $  samtools faidx reference_genome.fasta

which generates a .fai file which contains the info required. (I very nearly went and wrote a bioperl script to generate this, luckily I remembered the contents of the fai file.




Update: 
Do read the helpful comments by Micheal for using samtools idxstats on the bam file. And the shortcomings of looking at the coverage this way. 
He has a helpful recent post here 

Accurate genome-wide read depth calculation (how-to)



I must say I hadn't thought of the points he raised, but I was generating these to check for evenness of coverage for a bacteria reseq project. I like the coverage plot in IGV but am not sure if there are opensource tools to do the same. 
Any tips to share?


Other references
Discussion at BioStar

BWA SOLiD Paired Ends mapping

The short answer seems to be 'no' BWA can't do it yet.
(I went through steps like the one here to find that out in the end, the only difference is that I used a modified solid2fastq.pl to process the F5 correctly)
as
   bwa sampe
expects the orientation to be the same as SOLiD mate pair
see http://biostar.stackexchange.com/questions/9086/paired-end-mapping-what-is-bwa-solid-paired-end-default-direction-bwa-sampe
while reverse complementing the F5 might work, that itself is problematic due to the colorspace nature of SOLiD reads.

Your options?
Bioscope :(
or bowtie (if you don't need indels) see http://bowtie-bio.sourceforge.net/manual.shtml#paired-end-colorspace-alignment

Saturday, 2 July 2011

CANGS DB: a stand-alone web-based database tool fo... [BMC Res Notes. 2011] - PubMed result

: Next generation sequencing (NGS) is widely used in metagenomic and transcriptomic analyses in biodiversity. The ease of data generation provided by NGS platforms has allowed researchers to perform these analyses on their particular study systems. In particular the 454 platform has become the preferred choice for PCR amplicon based biodiversity surveys because it generates the longest sequence reads. Nevertheless, the handling and organization of massive amounts of sequencing data poses a major problem for the research community, particularly when multiple researchers are involved in data acquisition and analysis. An integrated and user-friendly tool, which performs quality control, read trimming, PCR primer removal, and data organization is desperately needed, therefore, to make data interpretation fast and manageable. We developed CANGS DB (Cleaning and Analyzing Next Generation Sequences DataBase) a flexible, stand alone and user-friendly integrated database tool. CANGS DB is specifically designed to organize and manage the massive amount of sequencing data arising from various NGS projects. CANGS DB also provides an intuitive user interface for sequence trimming and quality control, taxonomy analysis and rarefaction analysis. Our database tool can be easily adapted to handle multiple sequencing projects in parallel with different sample information, amplicon sizes, primer sequences, and quality thresholds, which makes this software especially useful for non-bioinformaticians. Furthermore, CANGS DB is especially suited for projects where multiple users need to access the data. CANGS DB is available at http://code.google.com/p/cangsdb/. CANGS DB provides a simple and user-friendly solution to process, store and analyze 454 sequencing data. Being a local database that is accessible through a user-friendly interface, CANGS DB provides the perfect tool for collaborative amplicon based biodiversity surveys without requiring prior bioinformatics skills. PMID: 21718534 [PubMed -as supplied by publisher]
http://www.ncbi.nlm.nih.gov/pubmed/21718534
Shared by Dolphin Browser HD

Sent from an Android.

Linear amplification for deep sequencing. [Nat Protoc. 2011] - PubMed result

Abstract
Linear amplification for deep sequencing (LADS) is an amplification method that produces representative libraries for Illumina next-generation sequencing within 2 d. The method relies on attaching two different sequencing adapters to blunt-end repaired and A-tailed DNA fragments, wherein one of the adapters is extended with the sequence for the T7 RNA polymerase promoter. Ligated and size-selected DNA fragments are transcribed in vitro with high RNA yields. Subsequent cDNA synthesis is initiated from a primer complementary to the first adapter, ensuring that the library will only contain full-length fragments with two distinct adapters. Contrary to the severely biased representation of AT-or GC-rich fragments in standard PCR-amplified libraries, the sequence coverage in T7-amplified libraries is indistinguishable from that of nonamplified libraries. Moreover, in contrast to amplification-free methods, LADS can generate sequencing libraries from a few nanograms of DNA, which is essential for all applications in which the starting material is limited.
http://www.ncbi.nlm.nih.gov/pubmed/21720315
Shared by Dolphin Browser HD

Sent from an Android.

Datanami, Woe be me