Tuesday, 26 February 2013
Monday, 25 February 2013
Michael Schatz:Assembling Crop Genomes With SMS
http://schatzlab.cshl.edu/presentations/2013-02-20.AGBT.Assembling%20Crop%20Genomes.pdf
if you need an intro
"In a talk during the evening session, Mike Schatz, an assistant professor at Cold Spring Harbor Laboratory, spoke about “Assembling Crop Genomes with Single Molecule Sequencing.” Crops are important to sequence — 15 crops represent 90% of the world’s food, Schatz said — but are notoriously difficult to study because of their large genome size, high repeat content, and higher ploidy. Along with Sergey Koren and Adam Phillippy, he has built a pipeline to create hybrid genome assemblies using PacBio long reads combined with shorter-read sequence — either CCS reads from PacBio or data from another sequencing platform. In an example he offered of a rice strain, an attempted genome assembly using just Illumina reads yielded an N50 contig of 16Kb, but adding PacBio long reads to that boosted the N50 contig to 25Kb. Ultimately, Schatz said, he expects that as PacBio's readlength improves, this kind of approach could routinely generate megabase-size contigs or even pull plant chromosomes into single contigs.
$ perl -e 'print ">random\n"; @D=split //,"ACGT"; \for (1...100000000){print $D[int(rand(4))];} \print "\n"’ | fold > random.fa$ wgsim –r 0 -e 0 -N 50000000 -1 100 -2 1 \random.fa random.reads.fq /dev/null$ SOAPdenovo-63mer all –s random.cfg -K 63 -o random.63$ getlengths random.63.contig 1 99999990
[Bio-bwa-help] A new alignment algorithm merged into master (beta phase)
Date: Mon, Feb 25, 2013 at 1:34 PM
Subject: [Bio-bwa-help] A new alignment algorithm merged into master (beta phase)
Bwa-mem (mem for maximal exact match) is a new algorithm that essentially seeds alignments with the fastmap/fermi-exact algorithm and then extends seeds with SW. It combines some key features from both bwa-backtrack, the first algorithm, and bwa-sw and aims to replace them for high-quality 100bp-100kbp sequences. I made this move because bwa-backtrack will fail to deliver satisfactory results for 150bp+ reads (which a few centers have observed), while bwa-sw is relatively slow without achieving the accuracy that I think is possible given longer reads. I would recommend the current bwa-backtrack users to keep an eye on the progress of bwa-mem. You will have to change the mapper when hiseq reads reach 150bp+.
At present, bwa-mem has the basic elements of a typical aligner. On a couple of simulated data sets, it shows a similar speed to bowtie2 and bwa-backtrack, twice as fast as bwa-sw, and is more accurate. It can also achieve the same accuracy as bowtie2/bwa-sw at halved computing time. There are, though, still a few important things on the TODO list: fine tuning the algorithm for better performance; testing on more data sets; testing for BAC-sized long sequences; bug fixes. As I have merged bwa-mem to the master branch, I need your feedbacks to push it forward. This is also a good time to request features in bwa-mem when I am actively working on it.
Thank you,
Heng
PS: I plan not to add BAM support to bwa-mem, but because bwa-mem reads fastq only once, this is actually less a concern. You can:
samtools bam2fq reads.bam | bwa mem -p ref.fa -
to map paired-end reads. Here '-p' indicates the inputs are interleaved fasta/q. You can also put bam optional tags (e.g. barcodes) in the fasta/q comment. With "-C", mem/bwasw will copy these tags to the final SAM output. Bwa-mem also supports more advanced piping such as:
bwa mem ref.fa '<bzip2 -dc read1.bz2' '<bzip2 -dc read2.bz2'
which is equivalent to
bwa mem ref.fa <(bzip2 -dc read1.bz2) <(bzip2 -dc read2.bz2)
but without the bash support. The former is still working when you launch bwa in tcsh or outside a shell.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Bio-bwa-help mailing list
https://lists.sourceforge.net/lists/listinfo/bio-bwa-help
C. Jimmy Lin talks about crowdsourced funded Genomics research
Direct to consumer treatments and research for rare genetic disorders makes sense if you think about it. Getting funding for diabetes is likely easier and makes more economical sense since the dollars benefit a larger populace.
If you have a rare genetic disorder that might be limited to your family or a few families, you are kinda out of luck.
Your options are
- doing the research yourself
- convince your physician to take it up as a research project
- offer your patient sample for others to research (not the best example but TED Fellow Salvatore Iaconese open sourced his brain cancer for a cure or art)
- well now you can go to raregenomics
Citizen science is the latest trending topic/tag. From ecology projects like http://iseahorse.org/ and http://www.birds.cornell.edu/citsci/ to sequencing projects like uBiome to 'Bioinformatics as a game' like Phylo website it will be interesting to see how this plays out for research and science education as a trend. IMHO, you can only get rare genetics disorder research started as a non-profit now since the cost of sequencing isn't cheap enough yet (see 'A $1000 genome by 2013?' )
Related Links
'la cura, the cure' http://artisopensource.net/cure/
'Interview with C. Jimmy Lin'
http://blog.ted.com/2013/01/18/rare-gifts-fellows-friday-with-c-jimmy-lin/
http://tedfellows.posterous.com/rare-gifts-fellows-friday-with-c-jimmy-lin
Sunday, 24 February 2013
Article: SPA: a short peptide assembler for metagenomic data
SPA: a short peptide assembler for metagenomic data
http://nar.oxfordjournals.org/content/early/2013/02/22/nar.gkt118.short?rss=1
Sent via Flipboard
Sent from myPhone
Article: Scientists attacked over claim that 'junk DNA' is vital to life
Scientists attacked over claim that 'junk DNA' is vital to life
http://www.guardian.co.uk/science/2013/feb/24/scientists-attacked-over-junk-dna-claim
Sent via Flipboard
Sent from myPhone
Article: Thoughts on the Assemblathon 2 paper
Thoughts on the Assemblathon 2 paper
http://ivory.idyll.org/blog/thoughts-on-assemblathon-2.html
Sent via Flipboard
Sent from myPhone
Friday, 22 February 2013
A Simple Method for Obtaining Original Data from Published Graphs and Plots
How To Add Expiration Date To Shared Google Drive Folders http://www.hongkiat.com/blog/expiration-date-google-drive-folders/
Thursday, 21 February 2013
Rare variant detection using family-based sequencing analysis
Abstract
Next-generation sequencing is revolutionizing genomic analysis, but this analysis can be compromised by high rates of missing true variants. To develop a robust statistical method capable of identifying variants that would otherwise not be called, we conducted sequence data simulations and both whole-genome and targeted sequencing data analysis of 28 families. Our method (Family-Based Sequencing Program, FamSeq) integrates Mendelian transmission information and raw sequencing reads. Sequence analysis using FamSeq reduced the number of false negative variants by 14–33% as assessed by HapMap sample genotype confirmation. In a large family affected with Wilms tumor, 84% of variants uniquely identified by FamSeq were confirmed by Sanger sequencing. In children with early-onset neurodevelopmental disorders from 26 families, de novo variant calls in disease candidate genes were corrected by FamSeq as Mendelian variants, and the number of uniquely identified variants in affected individuals increased proportionally as additional family members were included in the analysis. To gain insight into maximizing variant detection, we studied factors impacting actual improvements of family-based calling, including pedigree structure, allele frequency (common vs. rare variants), prior settings of minor allele frequency, sequence signal-to-noise ratio, and coverage depth (∼20× to >200×). These data will help guide the design, analysis, and interpretation of family-based sequencing studies to improve the ability to identify new disease-associated genes.
Tuesday, 19 February 2013
[R-bloggers] 10 R packages every data scientist should know about (and 6 more aRticles)
The yhat blog lists 10 R packages they wish they'd known about earlier. Drew Conway calls them "10 reasons to always start your analysis in R". They're all very useful R packages that every data scientist should be aware of. They are:
- sqldf (for selecting from data frames using SQL)
- forecast (for easy forecasting of time series)
- plyr (data aggregation)
- stringr (string manipulation)
- Database connection packages RPostgreSQL, RMYSQL, RMongo, RODBC, RSQLite
- lubridate (time and date manipulation)
- ggplot2 (data visulization)
- qcc (statistical quality control and QC charts)
- reshape2 (data restructuring)
- randomForest (random forest predictive models)
You can find links to all of these packages and tips on how to use them at link below.
yhat blog: 10 R packages I wish I knew about earlier
Thursday, 14 February 2013
Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample
Comparison of sequencing platforms for single nucleotide variant calls in a human sample.
Source
Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America.
Abstract
Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required.
- PMID:
- 23405114
- [PubMed - in process]
Wednesday, 6 February 2013
Article: FedEx's file-transfer capacity versus the Internet
FedEx's file-transfer capacity versus the Internet
http://boingboing.net/2013/02/05/fedexs-file-transfer-capacit.html
Sent via Flipboard
Sent from myPhone
Article: Source Code for Biology and Medicine | Abstract | CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals
Source Code for Biology and Medicine | Abstract | CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals
http://www.scfbm.org/content/8/1/5/abstract
Sent via Flipboard
Sent from myPhone
Fwd: Save 15% on a New MiSeq System – Limited-Time Offer
Sent from myPhone
Begin forwarded message:
From: "Illumina" <Community@illumina.com>
Date: 6 February, 2013 5:35:59 AM GMT+08:00
To:
Subject: Save 15% on a New MiSeq System – Limited-Time Offer
Reply-To: Community@illumina.com
Bundle and Save with MiSeq
Get It Now www.illumina.com
Dear Valued Researcher,If you are ready to experience the power of MiSeq, the most accurate and easiest-to-use benchtop sequencer—why not save a little money on the way?
Now through March 31, 2013, purchase at least 15 MiSeq Sequencing Reagent Kits at list price and receive 15% off your MiSeq System.
Next-generation sequencing doesn't get any easier than this. Take advantage of this offer and get started today.
*Offer not valid with other discounts and promotions. Eligible kits are MiSeq Reagent Kits v2 (50-, 300- and 500-cycles), catalog numbers: MS-102-2001, MS-102-2002, and MS-102-2003.
Sent by Illumina | 5200 Illumina Way | San Diego, CA, 92122
www.illumina.com | Unsubscribe | Update Profile
If you no longer wish to receive these emails
Handling R packages Feb 2013 issue Linux Journal
Other topics in this issue includes
- Manage Your Virtual Deployment with ConVirt
- Use Fabric for Sysadmin Tasks on Remote Machines
- Spin up Linux VMs on Azure
- Make Your Android Device Play with Your Linux Box
- Create a Colocated Server with Raspberry Pi
You can check out a preview of the contents here
February 2013 Issue of Linux Journal: System Administration
Tuesday, 5 February 2013
Fwd: [BioRuby] Genomer: a ruby project to simplify genome finishing
From: Michael Barton <mail AT michaelbarton.me.uk>
Date: Sun, Feb 3, 2013 at 5:34 AM
Subject: [BioRuby] Genomer: a ruby project to simplify genome finishing
To: BioRuby Mailing List
Hi Everyone,
I've been working a sequencing microbial genomes during my current post doc
position. I've combined many of the ruby scripts I was using into a single tool
called "genomer" which might be of interest to other bioinformaticians working
in the same area.
I used this tool to simplify the smaller, mundane tasks associated with a
genome. For instance moving contigs and associated annotations around,
generating the required files to submit to GenBank, and generating summaries of
the genome scaffold.
I created a small screencast for anyone who is interested in finding out more:
http://youtu.be/HfsdJOELFjs?hd=1
I wrote this tool to satisfy my own needs and use genomer extensively for the
microbial genome projects in our lab. My GNU Makefile (http://bit.ly/WMKlCZ)
from a P. fluorescens project illustrates how genomer combined with GenBank's
tbl2asn can be used to build the all files required for genome submission.
Hopefully genomer may be useful to other bioinformaticians and simplify the
steps required to finish and submit a genome.
Thanks
Michael Barton
_______________________________________________
BioRuby Project - http://www.bioruby.org/
http://lists.open-bio.org/mailman/listinfo/bioruby
concrete example of using genomer to generate the files required to submit a
genome project to GenBank.
http://www.youtube.com/watch?v=jVn62pMnIRA&hd=1
Thanks
Mike