Kevin's GATTACA World: review

Showing posts with label review. Show all posts

Friday, 5 July 2013

Windows 8.1 Preview ISOs are available for download

Ah! I didn't know there's Windows 8.1
The ISOs should be helpful if you wish to 'futureproof' your spanking new application in the latest windows or test exisiting apps to see if they might break in the new win8.1

and *cough*usingtheisosasVMsinyourpreferredLinuxenvbutyoukindaneedawindozemachinetodothosetasksthatyoucan'tdoinlinuxcosotherprogrammershaven'theardofbuildingformultiplatformmachines*cough*

well another good reason to use it is that I am pretty sure this ain't happening in Mac or Linux

Microsoft is adding native support for 3D printing as part of the Windows 8.1 update, making it possible to print directly from an app to a 3D printer. The company is announcing the new feature this morning, working with partners including MakerBot Industries, 3D Systems, Afinia, AutoDesk, Netfabb and others.
http://www.geekwire.com/2013/dimension-windows-microsoft-adds-3d-printing-support/

:)

Go http://msdn.microsoft.com/en-us/windows/apps/bg182409 now!
loving the 1.5 Mb/s download here

Wednesday, 6 February 2013

Handling R packages Feb 2013 issue Linux Journal

The kind folks at http://www.linuxjournal.com/ have provided me an 2013 Feb issue. Can't tell you how much of Linux I have picked up from there with its easy prose and graphical howtos. In the Feb 2013 issue, they have focused on the theme sys admin. Definitely useful things inside for the starting bioinformatician who wishes to dabble with working directly off a *nix machine :)

Other topics in this issue includes

In the February 2013 issue:

Manage Your Virtual Deployment with ConVirt
Use Fabric for Sysadmin Tasks on Remote Machines
Spin up Linux VMs on Azure
Make Your Android Device Play with Your Linux Box
Create a Colocated Server with Raspberry Pi

You can check out a preview of the contents here

February 2013 Issue of Linux Journal: System Administration

By Shawn Powers | Feb 01, 2013

Sunday, 11 September 2011

Differential expression in RNA-seq: A matter of d... [Genome Res. 2011] - PubMed - NCBI

http://www.ncbi.nlm.nih.gov/pubmed/21903743
Abstract Next Generation Sequencing (NGS) technologies are revolutionizing genome research and in particular, their application to transcriptomics (RNA-seq) is increasingly being used for gene expression profiling as a replacement for microarrays. However, the properties of RNA-seq data have not been yet fully established and additional research is needed for understanding how these data respond to differential expression analysis. In this work we set out to gain insights into the characteristics of RNA-seq data analysis by studying an important parameter of this technology: the sequencing depth. We have analyzed how sequencing depth affects the detection of transcripts and their identification as differentially expressed, looking at aspects such as transcript biotype, length, expression level and fold-change. We have evaluated different algorithms available for the analysis of RNA-seq and proposed a novel approach -NOISeq-that differs from existing methods in that it is data-adaptive and non-parametric. Our results reveal that most existing methodologies suffer from a strong dependency on sequencing depth for their differential expression calls and that this results in a considerable number of false positives that increases as the number of reads grows. In contrast, our proposed method models the noise distribution from the actual data, can therefore better adapt to the size of the dataset and is more effective in controlling the rate of false discoveries. This work discusses the true potential of RNA-seq for studying regulation at low expression ranges, the noise within RNA-seq data and the issue of replication.

PMID: 21903743 [PubMed -as supplied by publisher]

Friday, 12 August 2011

is 12 million 90 bp transcriptome reads enough for transcriptome assembly?

Posted a pubmed link recently, the authors "report the use of next-generation massively parallel sequencing technologies and de novo transcriptome assembly to gain a comprehensive overview of the H. brasiliensis transcriptome. The sequencing output generated more than 12 million reads with an average length of 90 nt. In total 48,768 unigenes (mean size = 436 bp, median size = 328 bp) were assembled through de novo transcriptome assembly."

Do you think such an assembly truly is useful for research? or would a higher coverage been better?

Monday, 8 August 2011

Braintrust: What Neuroscience Tells Us about Morality.

What can science tell us about morality?
While the slippery and subjective nature of morality makes it a troubling specimen, it remains a crucial part of our lives—and therefore a topic ripe for scientific research.
However, scientists are skilled at describing what is—the circumstances under which people are more likely to lie, for instance—which is not the same as describing how we ought to live our lives, like when it’s OK to lie. So it’s not entirely clear what scientists can offer here without overstepping their bounds.
Yet in Braintrust: What Neuroscience Tells Us about Morality, Patricia Churchland carefully leads the reader through scientific findings with implications for morality and ethics, well aware of the pitfalls and rewards she may encounter along the way. Churchland, a professor of philosophy at the University of California, San Diego, quickly informs the reader that science cannot tell us what we ought to do to be moral, but that a review of findings from psychology and biology may explain how or why we do it. Her goal is to draw on these findings to build an objective framework in which to understand human morality.

I think the challenge is to actually link human genomics with neurochemistry. Although, I am not sure if anyone is prepared to face the ramifications of the studies.
Full review article here

Tuesday, 12 July 2011

A 3rd party evaluation of Ion Torrent's 316 chip data

Dan Koboldt (from massgenomics) has posted about what I know to be the 1st independent look at the data from Ion Torrent's 316 chip,
Granted the data was handed to him in a 'shiny report with color images' but he has bravely ignored that to give an honest look at the raw data itself.

The 316 chip gives a throughout that nicely covers WGS reseq experiments for bacterial sized genomes. "The E. coli reference genome totals about 4.69 Mbp. With 175 Mbp of data, the theoretical coverage is around 37.5-fold across the E. coli genome."

For those wary of dry reviews, fear not, easily comprehensible graphs are posted within!

read the full post here

Saturday, 30 April 2011

Evaluation of next-generation sequencing software in mapping and assembly.

Evaluation of next-generation sequencing software in mapping and assembly.

J Hum Genet. 2011 Apr 28;

Authors: Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 28 April 2011; doi:10.1038/jhg.2011.43.

PMID: 21525877 [PubMed - as supplied by publisher]

More...

Friday, 3 December 2010

When Playing games is working (if you are biologist that is)

Check out this flash game Phylo
If you are thinking it's related to phylogenetics then Bingo.. Kudos for excellent idea and excellent graphics and interface but wished they had a better name and less verbose introduction for laymen.
waiting eagerly for the iphone/ipod version to be out..

from http://phylo.cs.mcgill.ca/eng/about.html

What's Phylo all about?
Though it may appear to be just a game, Phylo is actually a framework for harnessing the computing power of mankind to solve a common problem; Multiple Sequence Alignments.

What is a Multiple Sequence Alignment? A sequence alignment is a way of arranging the sequences of D.N.A, R.N.A or protein to identify regions of similarity. These similarities may be consequences of functional, structural, or evolutionary relationships between the sequences.
From such an alignment, biologists may infer shared evolutionary origins, identify functionally important sites, and illustrate mutation events. More importantly, biologists can trace the source of certain genetic diseases.

The Problem Traditionally, multiple sequence alignment algorithms use computationally complex heuristics to align the sequences.
Unfortunately, the use of heuristics do not guarantee global optimization as it would be prohibitively computationally expensive to achieve an optimal alignment. This is due in part to the sheer size of the genome, which consists of roughly three billion base pairs, and the increasing computational complexity resulting from each additional sequence in an alignment.

Our Approach Humans have evolved to recognize patterns and solve visual problems efficiently.
By abstracting multiple sequence alignment to manipulating patterns consisting of coloured shapes, we have adapted the problem to benefit from human capabilities.
By taking data which has already been aligned by a heuristic algorithm, we allow the user to optimize where the algorithm may have failed.

The Data All alignments were generously made available through UCSC Genome Browser.
Infact, all alignments contain sections of human DNA which have been speculated to be linked to various genetic disorders, such as breast cancer.
Every alignment is received, analyzed, and stored in a database, where it will eventually be re-introduced back into the global alignment as an optimization.

Tuesday, 30 November 2010

Why can't Bioscope / mapreads write to bam natively?

Spotted this small fact in Bioscope 1.3.1 release notes.

There is significant disk space required for converting ma to BAM
when the option output.filter=none is used, which roughly needs
2TB peak disk space for converting a 500 million reads ma file.
Other options do not need such large peak disk space. The disk
space required per node is smaller if more jobs are dispatched to
more nodes.

I would love to see the calculation on how they arrived at the figure of 2 TB. I am glad that they moved to bam in bioscope workflow but I am not entirely sure what's the reason for keeping the .ma file format when only they are the ones using it.

Monday, 8 November 2010

Trimming adaptor seq in colorspace (SOLiD)

Needed to do research on small RNA seq using SOLiD.
Wasn't clear of the adaptor trimming procedure (its dead easy in basespace fastq files but oh well, SOLiD has directionality and read lengths dont' really matter for small RNA)

novoalign suggests the use of cutadapt as a colorspace adaptor trimming tool
was going to script one in python if it didn't exist
Check their wiki page

Sadly on CentOS I most probably will get this

If you get this error:

   File "./cutadapt", line 62
    print("# There are %7d sequences in this data set." % stats.n, file=outfile)
                                                                       ^
SyntaxError: invalid syntax

Then your Python is too old. At least Python 2.6 is needed for cutadapt.

have to dig up how to have two versions of Python on a CentOS box..

Wednesday, 27 October 2010

Tophat adds support for strand-specific RNA-Seq alignment and colorspace

Hooray!
testing Tophat 1.1.2 now
1.1.1

on a 8 Gb Ram CentOS box managed to align 1 million reads to hg18 in 33 mins and 2 million reads in 59 mins. using 4 threads
Nice scalability! But it was slower than I was used to for bowtie. I kept killing my full set of 90 million reads thinking there's something wrong. Guess I need to be more patient and wait for 45 hours.

I do wonder if the process can be mapped to separate nodes to speed up.

Monday, 6 September 2010

Evaluation of next generation sequencing platforms for population targeted sequencing studies

I came across this paper earlier but didn't have time to blog much about it.
Papers that compare the sequencing platforms are getting rarer as the hype for NGS dies down and people are more interested in the next next gen seq machines (usually termed single molecule seq )
targetted reseq is a popular use of NGS as prices for human whole genome reseq is still not within reach for most. (see Exome sequencing: the sweet spot before whole genomes. )

There are inherent biases that people should be aware of before they jump right into it.

1)The NGS technologies generate a large amount of sequence but, for the platforms that produce short-sequence reads, greater than half of this sequence is not usable.

On average, 55% of the Illumina GA reads pass quality filters, of which approximately 77% align to the reference sequence
ABI SOLiD, approximately 35% of the reads pass quality filters, and subsequently 96% of the filtered reads align to the reference sequenc
n contrast to the platforms generating short-read lengths, approximately 95% of the Roche 454 reads uniquely align to the target sequence.

Admittedly, the numbers have changed for this now that Illumina has longer read lengths. (the paper tested 36 bp vs 35 bp )

2) For PCR-based targetted sequencing, they observed that the mapped sequences corresponding to the 50 bp at the ends and the overlapping intervals of the amplicons have extremely high coverage.

These regions, representing about 2.3% (approximately 6 kb) of the targeted intervals, account for up to 56% of the sequenced base pairs for Illumina GA technology.
For the ABI SOLiD platform an amplicon end depletion protocol was employed to remove the overrepresented amplicon ends; this was partially successful and resulted in the ends accounting for up to 11% of the sequenced base pairs.
For the Roche 454 technology, overrepresentation of amplicon ends versus internal bases is substantially less, with the ends composing only 5% of the total sequenced bases; this is likely due to library preparation process differences between Roche 454 and the short-read length platforms.

The overrepresentation of amplicon end sequences is not only wasteful for the sequencing yield but also decreases the expected average coverage depth across the targeted intervals. Therefore, to accurately assess the consequences of sequence coverage on data quality, we removed the 50 bp at the ends of the amplicons from subsequent analyses.

I am not sure if this has changed since.

Note: Will update thoughts when i have more time.

Other Interesting papers
WGS vs exome seq
Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations.
Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia.

Exome sequencing: the sweet spot before whole genomes.
Whole human exome capture for high-throughput sequencing.
Screening the human exome: a comparison of whole genome and whole transcriptome sequencing.
Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing.

Family-based analysis and exome seq
Molecular basis of a linkage peak: exome sequencing and family-based analysis identify a rare genetic variant in the ADIPOQ gene in the IRAS Family Study.

Wednesday, 18 August 2010

Playing with NFS & GlusterFS on Amazon cc1.4xlarge EC2 instance types

I wished I had time to do stuff like what they do at bioteam.
Benchmarking the Amazon cc1.4xlarge EC2 instance.

These are the questions they aimed to answer

We are asking very broad questions and testing assumptions along the lines of:

Does the hot new 10 Gigabit non-blocking networking fabric backing up the new instance types really mean that “legacy” compute farm and HPC cluster architectures which make heavy use of network filesharing possible?
How does filesharing between nodes look and feel on the new network and instance types?
Are the speedy ephemeral disks on the new instance types suitable for bundling into NFS shares or aggregating into parallel or clustered distribtued filesystems?
Can we use the replication features in GlusterFS to mitigate some of the risks of using ephemeral disk for storage?
Should the shared storage built from ephermeral disk be assigned to “/scratch” or other non-critical duties due to the risks involved? What can we do to mitigate the risks?
At what scale is NFS the easiest and most suitable sharing option? What are the best NFS server and client tuning parameters to use?
When using parallel or cluster filesystems like GlusterFS, what rough metrics can we use to figure out how many data servers to dedicate to a particular cluster size or workflow profile?

Tuesday, 10 August 2010

PyroNoise:Accurate determination of microbial diversity from 454 pyrosequencing data

Using 454 to do microbial ecology / metagenomics of environmental / soil samples?
Then I think you should take a look at this paper.

Quince, C., Lanzén, A., Curtis, T., Davenport, R., Hall, N., Head, I., Read, L., & Sloan, W. (2009). Accurate determination of microbial diversity from 454 pyrosequencing data Nature Methods, 6 (9), 639-641 DOI: 10.1038/nmeth.1361

The Pathogens blog has a good summary post on it.

Wednesday, 14 July 2010

the nuts and bolts behind ABI's SAET

I really do not like to use tools that I have no idea what they are trying to do.

ABI's SAET SOLiD™ Accuracy Enhancer Tool (SAET) is a one example that had extremely brief documentation except what it promised to do

The SOLiD™ Accuracy Enhancer Tool (SAET) uses raw data generated by SOLiD™ Analyzer to correct miscalls within reads prior to mapping or contig assembly.
Use of SAET, on various datasets of whole or sub-genomes of < 200 Mbp in size and of varying complexities, readlengths, and sequence coverages, has demonstrated improvements in mapping, SNP calling, and de novo assembly results.
For denovo applications, the tool reduces miscall rate substantially

Recently attended an ABI's talk and finally someone explained it in a nice diagram. It is akin to Softgenetic's condensation tool.( I made the link ). Basically, it groups reads by similarity and where they find a mismatch that is not supported by high quality reads they correct the low quality read to reach a 'consensus'. I see it as a batch correction of sequencing errors which one can typically do by eye (for small regions). This correction isn't without its flaws. I now understand why such an error correction isn't implemented on the instrument. And is presented as a user choice. My rough experience with this tool is that it increases mapping by ~ 10% how this 10% would affect your results is debatable.

Wednesday, 26 May 2010

A scientific spectator's guide to next-generation sequencing

ROFL
I love the title!

A scientific spectator's guide to next-generation sequencing

Dr Keith not only looks at next gen sequencing but also the emerging technologies of single molecule sequencing. Interesting read!

My fave parts of the review
"Finally, there is the cost per base, generally expressed in a cost per human genome sequenced at approximately 40X coverage. To show one example of how these trade off, the new PacBio machine has a great cost per sample (~U$100) and per run (you can run just one sample) but a poor cost per human genome – you’d need around 12,000 of those runs to sequence a human genome (~U$120K). In contrast, one can buy a human genome on the open market for U$50K and sub U$10K genomes will probably be generally available this year."

"Length is critical to genome sequencing and RNA-seq experiments, but really short reads in huge numbers are what counts for DGE/SAGE and many of the functional tag sequencing methods. Technologies with really long reads tend not to give as many, and with all of them you can always choose a much shorter run to enable the machine to be turned over to another job sooner – if your application doesn’t need long reads."

Wednesday, 19 May 2010

What do you use for citation / bibliography / reference in writing?

Am looking at
http://www.zotero.org/
also exploring
http://www.wizfolio.com/

Found this on the web as well
http://www.easybib.com/

While I like it that zotero is well integrated with my browser and has openoffice plugins. But keeping a backup of the references and keeping it synced is a problem. I would much rather have my references on the cloud. which makes for easier sharing. Suggestions anyone?
Not Endnote please.. I seldom work on windows machines.

Tuesday, 18 May 2010

Book review:Programming Collective Intelligence

Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran

Permalink: http://amzn.com/0596529325

I have always wanted to explore classification methods and their theory to see how I can apply these to bioinformatics. But so far I have yet to encounter a book or website that explains the topic well with examples that you can do. It's a bonus that the examples are written in Python a language I know and has highly readable code for those that do not know.

Although the examples are not from biology but it is easy to see how some classical biological problems can be solved by SVM.

p.s. This amazon associates widget is cool! it will throw up relevant books based on my words in the blog post!

Friday, 14 May 2010

Lincoln Stein makes his case for moving genome informatics to the Cloud

Matthew Dublin summarizes Lincoln's paper in Making the Case for Cloud Computing & Genomics in genomeweb

excerpt "....
Stein walks the reader through an nice explanation of what exactly cloud computing is, the benefits of using a compute solution that grows and shrinks as needed, and makes an attempt at tackling the question of the cloud's economic viability when compared to purchasing and managing local compute resources.
The take away is that Moore's Law and its effect on sequencing technology will soon force researchers to analyze their mountains of sequencing data in a paradigm where the software comes to the data rather than the current, and opposite, approach. Stein says that this means now more than ever, cloud computing is a viable and attractive option..... "

Yet to read it (my weekend bedtime story) will post comments here.