PLoS ONE: myKaryoView: A Light-Weight Client for Visualization of Genomic Data
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0026345
1 VIB - Ghent University;
With the arrival of low-cost, next-generation sequencing a multitude of new plant genomes is being publicly released, providing unseen opportunities and challenges for comparative genomics studies. Here, we present PLAZA 2.5, a user-friendly online research environment to explore genomic information from different plants. This new release features updates to previous genome annotations and a substantial number of newly available plant genomes, as well as various new interactive tools and visualizations. Currently, PLAZA hosts 25 organisms covering a broad taxonomic range, including 13 eudicots, five monocots, one Lycopod, one moss, and five algae. The available data consist of structural and functional gene annotations, homologous gene families, multiple sequence alignments, phylogenetic trees, and colinear regions within and between species. A new Integrative Orthology Viewer, combining information from different orthology prediction methodologies, was developed to efficiently investigate complex orthology relationships. Cross-species expression analysis revealed that the integration of complementary data types extended the scope of complex orthology relationships, especially between more distantly related species. Finally, based on phylogenetic profiling, we propose a set of core gene families within the green plant lineage that will be instrumental to assess the gene space of draft or newly sequenced plant genomes during the assembly or annotation phase.
Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly, and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa, and Applied Biosystems' SOLiD.ART also allows the flexibility to use customized read error model parameters and quality profiles.
Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.
RNASEQR was written in Python 2.7 and runs on 64-bit Linux systems. It employs a Burrows–Wheeler transform (BWT)-based and a hash-based indexing algorithm. Briefly, there are three sequential processing steps: the first step is to align RNA-Seq sequences to a transcriptomic reference; the second step is to detect novel exons; the third step is to identify novel splice junctions using an anchor-and-align strategy.
http://m.nar.oxfordjournals.org/content/early/2011/12/22/nar.gkr1248.full
Reason 10 is kinda unexpected but it's a good effort by Life Tech
Designed to minimize environmental impact—The 5500 Series Genetic Analyzers separate the truly hazardous waste from the nonhazardous waste, reducing the amount of waste classified as hazardous by as much as 38%. All bottles and strip tubes are recyclable and clearly marked with either a (2) or a (5) recycling symbol. All product documentation, including user manuals, quick start guides, and product inserts are delivered on an Apple iPad® device. (We like trees.)
Dear Amazon S3 Customer,
Today we're excited to announce Object Expiration, a new feature to help you efficiently manage data stored in Amazon S3. Object Expiration enables you to schedule the removal of objects after a defined time period.
You can define Object Expiration rules for a set of objects in your bucket. Each expiration rule allows you to specify a prefix and an expiration period in days. The prefix field (e.g. "logs/") identifies the object(s) subject to the expiration rule, and the expiration period specifies the number of days from creation date (i.e. age) after which object(s) should be removed. You may create multiple expiration rules for different prefixes. After an Object Expiration rule is added, the rule is applied to objects with the matching prefix that already exist in the bucket as well as new objects added to the bucket. Once the objects are past their expiration date, they will be queued for deletion. You will not be charged for storage for objects on or after their expiration date. Amazon S3 doesn't charge you for using Object Expiration. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage. Using Object Expiration rules to schedule periodic removal of objects eliminates the need to build processes to identify objects for deletion and submit delete requests to Amazon S3.
You can start using Object Expiration today using the AWS Management Console or the Amazon S3 API.
To use Object Expiration from the AWS Management console:
Under the Amazon S3 tab, select the bucket on which you want to apply Object Expiration rules.
Select the "Properties" action on that bucket.
Select the "Lifecycle" Tab.
Under the "Lifecycle" tab, add an Object Expiration rule by specifying a prefix (e.g. "logs/") that matches the object(s) you want to expire. Also specify the number of days from creation after which object(s) matching the prefix should be expired.
You can optionally specify a name for the rule for better organization.
For more information on using Object Expiration, please see the Object Expiration topic in the Amazon S3 Developer Guide.
Sincerely,
The Amazon S3 Team
We hope you enjoyed receiving this message. If you wish to remove yourself from receiving future product announcements and the monthly AWS Newsletter, please update your communication preferences.
Amazon Web Services LLC is a subsidiary of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. This message produced and distributed by Amazon Web Services, LLC, 1918 8th Avenue, Seattle, WA 98101.
http://m.wired.com/wiredenterprise/2011/12/nonexistent-supercomputer/
Cycle Computing setup a virtual supercomputer for an unnamed pharmaceutical giant that spans 30,000 processor cores, and it cost $1,279 an hour. Stowe —who has spent more than two decades in the supercomputing game, working with supercomputers at Carnegie Mellon University and Cornell —says there's still a need for dedicated supercomputers you install in your own data center, but things are changing.
Department of Microbiology & Immunology, University of Michigan, Ann Arbor, Michigan, United States of America.
The advent of next generation sequencing has coincided with a growth in interest in using these approaches to better understand the role of the structure and function of the microbial communities in human, animal, and environmental health. Yet, use of next generation sequencing to perform 16S rRNA gene sequence surveys has resulted in considerable controversy surrounding the effects of sequencing errors on downstream analyses. We analyzed 2.7×10(6) reads distributed among 90 identical mock community samples, which were collections of genomic DNA from 21 different species with known 16S rRNA gene sequences; we observed an average error rate of 0.0060. To improve this error rate, we evaluated numerous methods of identifying bad sequence reads, identifying regions within reads of poor quality, and correcting base calls and were able to reduce the overall error rate to 0.0002. Implementation of the PyroNoise algorithm provided the best combination of error rate, sequence length, and number of sequences. Perhaps more problematic than sequencing errors was the presence of chimeras generated during PCR. Because we knew the true sequences within the mock community and the chimeras they could form, we identified 8% of the raw sequence reads as chimeric. After quality filtering the raw sequences and using the Uchime chimera detection program, the overall chimera rate decreased to 1%. The chimeras that could not be detected were largely responsible for the identification of spurious operational taxonomic units (OTUs) and genus-level phylotypes. The number of spurious OTUs and phylotypes increased with sequencing effort indicating that comparison of communities should be made using an equal number of sequences. Finally, we applied our improved quality-filtering pipeline to several benchmarking studies and observed that even with our stringent data curation pipeline, biases in the data generation pipeline and batch effects were observed that could potentially confound the interpretation of microbial community data.
Division of DNA Repair and Genome Stability, Department of Radiation Oncology;
Despite widespread interest in next-generation sequencing (NGS), the adoption of personalized clinical genomics and mutation profiling of cancer specimens is lagging, in part because of technical limitations. Tumors are genetically heterogeneous and often contain normal/stromal cells, features that lead to low-abundance somatic mutations that generate ambiguous results or reside below NGS detection limits, thus hindering the clinical sensitivity/specificity standards of mutation calling. We applied COLD-PCR (coamplification at lower denaturation temperature PCR), a PCR methodology that selectively enriches variants, to improve the detection of unknown mutations before NGS-based amplicon resequencing.
We used both COLD-PCR and conventional PCR (for comparison) to amplify serially diluted mutation-containing cell-line DNA diluted into wild-type DNA, as well as DNA from lung adenocarcinoma and colorectal cancer samples. After amplification of TP53 (tumor protein p53), KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog), IDH1 [isocitrate dehydrogenase 1 (NADP(+)), soluble], and EGFR (epidermal growth factor receptor) gene regions, PCR products were pooled for library preparation, bar-coded, and sequenced on the Illumina HiSeq 2000.
In agreement with recent findings, sequencing errors by conventional targeted-amplicon approaches dictated a mutation-detection limit of approximately 1%-2%. Conversely, COLD-PCR amplicons enriched mutations above the error-related noise, enabling reliable identification of mutation abundances of approximately 0.04%. Sequencing depth was not a large factor in the identification of COLD-PCR-enriched mutations. For the clinical samples, several missense mutations were not called with conventional amplicons, yet they were clearly detectable with COLD-PCR amplicons. Tumor heterogeneity for the TP53 gene was apparent.
As cancer care shifts toward personalized intervention based on each patient's unique genetic abnormalities and tumor genome, we anticipate that COLD-PCR combined with NGS will elucidate the role of mutations in tumor progression, enabling NGS-based analysis of diverse clinical specimens within clinical practice.
Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, Md., USA.
Linkage analysis was developed to detect excess co-segregation of the putative alleles underlying a phenotype with the alleles at a marker locus in family data. Many different variations of this analysis and corresponding study design have been developed to detect this co-segregation. Linkage studies have been shown to have high power to detect loci that have alleles (or variants) with a large effect size, i.e. alleles that make large contributions to the risk of a disease or to the variation of a quantitative trait. However, alleles with a large effect size tend to be rare in the population. In contrast, association studies are designed to have high power to detect common alleles which tend to have a small effect size for most diseases or traits. Although genome-wide association studies have been successful in detecting many new loci with common alleles of small effect for many complex traits, these common variants often do not explain a large proportion of disease risk or variation of the trait. In the past, linkage studies were successful in detecting regions of the genome that were likely to harbor rare variants with large effect for many simple Mendelian diseases and for many complex traits. However, identifying the actual sequence variant(s) responsible for these linkage signals was challenging because of difficulties in sequencing the large regions implicated by each linkage peak. Current 'next-generation' DNA sequencing techniques have made it economically feasible to sequence all exons or the whole genomes of a reasonably large number of individuals. Studies have shown that rare variants are quite common in the general population, and it is now possible to combine these new DNA sequencing methods with linkage studies to identify rare causal variants with a large effect size. A brief review of linkage methods is presented here with examples of their relevance and usefulness for the interpretation of whole-exome and whole-genome sequence data.
A collection of various perl scripts that utilize BioPerl modules for use in bioinformatics analysis. Tools are included for processing microarray data, next generation sequencing data, data file format conversion, querying datasets, and general high level analysis of datasets.
This tool box of programs heavily relies on storing genome annotation, microarray, and next generation sequencing data in bioperl databases, allowing for data retrieval relative to any annotated feature in the database.
Also included are programs for converting and importing data from UCSC gene tables and ensEMBL, as well as a variety of other formats, into a GFF3 file that can be loaded into a bioperl database.
These set of tools are designed to complement the Generic Genome Browser (GBrowse). If you view your model organism's genome annotation and microarray or next generation sequencing data using GBrowse, then these tools will assist you in fully analyzing your data.
Even if you don't use GBrowse, these programs may still be useful. Please check out the list of programs to see if it meets your needs.
This is a list of programs.
This is an example on preparing and loading data into the database.
This is a list of supported data formats. In short, it will work with GFF, BED, wig, bigWig, bigBed, and Bam data formats and annotations. Most bioinformatic data can be represented in one or more of these formats, or at the very least converted.
This is an example of how to collect data.
This is an example on working with Next Generation Sequencing data.
These are command line Perl programs designed for modern Unix-based computers. Most of the analysis programs rely heavily on BioPerl modules, so at a minimum this should be installed. Additionally, if you want to use Bam, bigWig, bigBed, or wig data files, additional modules will need to be installed. Most (all?) of the programs should fail gracefully if the required modules are not installed. Some programs are quite minimal and may run without even BioPerl installed. A utility is provided to check for any missing or out of date modules.
They were initially written to assist me in my own laboratory research. As they were expanded in scope, I realized they could be potentially useful to others in the same predicament as me. Thus, releasing these programs for others to use.
http://www.biospectrumasia.com/content/221211OTH17733.asp
The system is generating over 2 Gb data per run with a high perce over Q30. The high data yield and superior quality a conduct a wide variety of sequencing applications in multiplexed PCR amplicon sequencing, small genom and de novo sequencing, small RNA sequencing, targ resequencing and 16S metagenomics. "The addition of numerous Illumina MiSeqs adds an sequencing for our clients," said Mr. Ardy Arianpour of business development at Ambry Genetics. "Our scientists spent the last couple months validating sequencing amazing results so we can deliver and work with mu samples that fit on the MiSeq."
mincemeat.py is a Python implementation of the MapReduce distributed computing framework.
mincemeat.py is:Institute of Liver Studies, Liver Immunopathology, King's College London School of Medicine at King's College Hospital, Denmark Hill Campus, London SE5 9RS, UK.
Twin studies are powerful tools to discriminate whether a complex disease is due to genetic or environmental factors. High concordance rates among monozygotic (MZ) twins support genetic factors being predominantly involved, whilst low rates are suggestive of environmental factors. Twin studies have often been utilised in the study of systemic and organ specific autoimmune diseases. As an example, type I diabetes mellitus has been investigated to establish that that disease is largely affected by genetic factors, compared to rheumatoid arthritis or scleroderma, which have a weaker genetic association. However, large twin studies are scarce or virtually non-existent in other autoimmune diseases which have been limited to few sets of twins and individual case reports. In addition to the study of the genetic and environmental contributions to disease, it is likely that twin studies will also provide data in regards to the clinical course of disease, as well as risk for development in related individuals. More importantly, genome-wide association studies have thus far reported genomic variants that only account for a minority of autoimmunity cases, and cannot explain disease discordance in MZ twins. Future research is therefore encouraged not only in the analysis of twins with autoimmune disease, but also in regards to epigenetic factors or rare variants that may be discovered with next-generation sequencing. This review will examine the literature surrounding twin studies in autoimmune disease including discussions of genetics and gender.
Copyright © 2011 Elsevier Ltd. All rights reserved.
Genome sequencing has been revolutionized by next-generation technologies, which can rapidly produce vast quantities of data at relatively low cost. With data production now no longer being limited, there is a huge challenge to analyse the data flood and interpret biological meaning. Bioinformatics scientists have risen to the challenge and a large number of software tools and databases have been produced and these continue to evolve with this rapidly advancing field. Here, we outline some of the tools and databases commonly used for the analysis of next-generation sequence data with comment on their utility.
Next generation sequencing has enabled systematic discovery of mutational spectra in cancer samples. Here, we used whole genome sequencing to characterize somatic mutations and structural variation in a primary acral melanoma and its lymph node metastasis. Our data show that the somatic mutational rates in this acral melanoma sample pair were more comparable to the rates reported in cancer genomes not associated with mutagenic exposure than in the genome of a melanoma cell line or the transcriptome of melanoma short-term cultures. Despite the perception that acral skin is sun-protected, the dominant mutational signature in these samples is compatible with damage due to ultraviolet light exposure. A nonsense mutation in ERCC5 discovered in both the primary and metastatic tumors could also have contributed to the mutational signature through accumulation of unrepaired dipyrimidine lesions. However, evidence of transcription-coupled repair was suggested by the lower mutational rate in the transcribed regions and expressed genes. The primary and the metastasis are highly similar at the level of global gene copy number alterations, loss of heterozygosity and single nucleotide variation (SNV). Furthermore, the majority of the SNVs in the primary tumor were propagated in the metastasis and one nonsynonymous coding SNV and one splice site mutation appeared to arise de novo in the metastatic lesion.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
> source("http://bioconductor.org/biocLite.R")
> biocLite("HilbertVis")
Using R version 2.10.0, biocinstall version 2.5.11.
Installing Bioconductor version 2.5 packages:
[1] "HilbertVis"
Please wait...
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
argument 'lib' is missing: using '/usr/lib64/R/library'
Warning in install.packages(pkgs = pkgs, repos = repos, ...) :
'lib = "/usr/lib64/R/library"' is not writable
Would you like to create a personal library
'~/R/x86_64-redhat-linux-gnu-library/2.10'
to install packages into? (y/n)
y
ABSTRACT:
Next generation sequencing (NGS) enables a more comprehensive analysis of bacterial diversity from complex environmental samples. NGS data can be analysed using a variety of workflows. We test several simple and complex workflows, including frequently used as well as recently published tools, and report on their respective accuracy and efficiency under various conditions covering different sequence lengths, number of sequences and real world experimental data from rhizobacterial populations of glyphosate-tolerant maize treated or untreated with two different herbicides representative of differential diversity studies.
Alignment and distance calculations affect OTU estimations, and multiple sequence alignment exerts a major impact on the computational time needed. Generally speaking, most of the analyses produced consistent results that may be used to assess differential diversity changes, however, dataset characteristics dictate which workflow should be preferred in each case.
When estimating bacterial diversity, ESPRIT as well as the web-based workflow, RDP pyrosequencing pipeline, produced good results in all circumstances, however, its computational requirements can make method-combination workflows more attractive, depending on sequence variability, number and length.
Department of Computational Biology, Graduate School of Frontier Science, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken, 277-8561, Japan.
The growth of next generation sequencing means that more effective and efficient archiving methods are needed to store the generated data for public dissemination and in anticipation of more mature analytical methods later. This paper examines methods for compressing the quality score component of the data to partly address this problem.
We compare several compression policies for quality scores, in terms of both compression effectiveness and overall efficiency. The policies employ lossy and lossless transformations with one of several coding schemes. Experiments show that both lossy and lossless transformations are useful, and that simple coding methods, which consume less computing resources, are highly competitive, especially when random access to reads is needed.Availability and Implementation: Our C++ implementation, released under the Lesser General Public License, is available for download at http://www.cb.k.u-tokyo.ac.jp/asailab/members/rwan.
ABSTRACT:
For next generation DNA sequencing, we have developed a rapid and simple approach for preparing DNA libraries of targeted DNA content. Current protocols for preparing DNA for next-generation targeted sequencing are labor-intensive, require large amounts of starting material, and are prone to artifacts that result from necessary PCR amplification of sequencing libraries. Typically, sample preparation for targeted NGS is a two-step process where (1) the desired regions are selectively captured and (2) the ends of the DNA molecules are modified to render them compatible with any given NGS sequencing platform.
In this proof-of-concept study, we present an integrated approach that combines these two separate steps into one. Our method involves circularization of a specific genomic DNA molecule that directly incorporates the necessary components for conducting sequencing in a single assay and requires only one PCR amplification step. We also show that specific regions of the genome can be targeted and sequenced without any PCR amplification.
We anticipate that these rapid targeted libraries will be useful for validation of variants and may have diagnostic application.
We have implemented a temporary block on Google Chrome's access to Libraries' e-resources for the following reason: The built-in PDF plug-in API in Chrome (which reportedly improves the browser's support for PDF: http://blog.chromium.org/2010/06/bringing-improved-pdf-support-to-google.html) results in the same PDF document being downloaded multiple times. This method of downloading PDF files currently interfers with the method used by e-resources publishers and our library proxy violation prevention system to detect systematic and massive downloading, causing both the publishers and the library proxy to mistakenly consider it as a violation. We have encountered a number of false alarms before we activated the block.
We have since explored the possible solution(s) to this new problem. We found that the workaround to resolve this would involve significant structural system change and costs to the Library. While we continue to actively source for other more cost effective solutions and alternatives, we advise Chrome users to use other web browsers to access our e-resources in the meantime.
We seek your understanding in this matter and apologize for any inconvenience caused.
'If it ain't broke, don't fix it'
CNAnorm is a Bioconductor package to estimate Copy Number Aberrations (CNA) in cancer samples.
It is described in the paper:
Gusnanto, A., Wood, H.M., Pawitan, Y., Rabbitts, P. and Berri, S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next generation sequence data. 2011. Bioinformatics, epub ahead of print.
CNAnorm performs ratio, GC content correction and normalization of data obtained using very low coverage (one read every 100-10,000 bp) high throughput sequencing. It performs a "discrete" normalization looking for the ploidy of the genome. It also provides tumour content if at least two ploidy states can be found.
Get the latest version of CNAnorm and its documentation from Bioconductor. Prerequisite: you need a Fortran compiler, make
and DNAcopy from Bioconductor
You can also download the perl script bam2windows.pl
(version 0.3.3) to convert sam/bam files to the text files required by CNAnorm. For documentation on usage, run the script without arguments
perl bam2windows.pl
For further information on both programs, please contact Stefano Berri
We provide gc1000Base.txt.gz, an example file for GC content (build GRCh37/hg19) to optionally use with bam2windows.pl. It provides average GC content every 1000 bp. The size of the window in the GC content file should be at least an order of magnitude smaller than the window used for CNAnorm to minimise boundary effects. If you require higher resolution, you candowload the gc5Base tables from UCSD and/or make your own. The smaller the window size in the GC content file, the larger this will be, and the longer it will take to bam2windows.pl
to process it.
We provide the bam files used to produce the dataset included in CNAnorm
LS041_tumour.bam (139 MB)
LS041_control.bam (130 MB)
To produce a text file suitable as input for CNAnorm you can enter the following
perl bam2windows.pl --gc_file gc1000Base.txt.gz LS041_tumour.bam LS041_control.bam > LS041.tab
It will produce this file
You need samtools installed in a directory in your $PATH
if your input files are bam format
The data provided is the mapped colorspace output from LifeScope? v2.0 using default parameters. This PDF contains a high level summary of the data.
Bashir et al. have concluded that more than 90% of the transcripts in human samples are adequately covered with just one million sequence reads. Wang et al. showed that 8 million reads are sufficient to reach RNA-Seq saturation for most samples
Experiments whose purpose is to evaluate the similarity between the
transcriptional profiles of two polyA+ samples may require only modest depths of
sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are
mappable to the genome or known transcriptome, Experiments whose purpose is
discovery of novel transcribed elements and strong quantification of known
transcript isoforms requires more extensive sequencing. The ability to detect
reliably low copy number transcripts/isoforms depends upon the depth of
sequencing and on a sufficiently complex library.
The analysis from the current study demonstrated that 30 M (75 bp) reads is sufficient to detect all annotated genes in chicken lungs. Ten million (75 bp) reads could detect about 80% of annotated chicken genes.