Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge
|
Saturday, 30 June 2012
Recording Available: Illumina IGN Webinar Series: Webinar Two - Expanding your Current WGS Knowledge
Datanami: Virtualizing the Mighty Elephant
Vmware releases new project that allows hadoop to be deployed in a virtual environment.
http://www.datanami.com/datanami/2012-06-26/virtualizing_the_mighty_elephant.html
Thursday, 28 June 2012
FAQ: What is genome build 'hg_g1k_v37' ?
Here's a explanation lifted from the galaxy-user list
From: Jennifer Jackson
Date: 27 June 2012 23:50
Subject: Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)
The genome build 'hg_g1k_v37' is build "b37" in the GATK documentation. Hg19 is also included (as a distinct build). I encourage you to examine these if you are interested in crossing over between genomes or identifying other projects that have data based on the same genome build.
http://www.broadinstitute.org/gsa/wiki/index.php/Introduction_to_the_GATK ->
http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle
" GATK resource bundle: A collection of standard files for working with human resequencing data with the GATK.
The standard reference sequence we use in the GATK is the the b37 edition from the Human Genome Reference Consortium. All of the key GATK data files are available against this reference sequence. Additionally, we used to use UCSC-style (chr1, not 1) for build hg18, and provide lifted-over files from b37 to hg18 for those still using those files.
b37 resources: the standard data set
* Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
<more, please follow link for details ...>
hg19 resources: lifted over from b37
* Includes the UCSC-style hg19 reference along with all lifted over VCF files."
Hopefully this helps,
Jen
Galaxy team
Wednesday, 27 June 2012
Finding Waldo, a flag on the moon and multiple choice tests, with R - Freakonometrics
http://freakonometrics.blog.free.fr/index.php?post/2012/04/18/foundwaldo
Collapsing Methods for DNA-Sequence Analysis — SNP & Variation Suite v7.6.5 Documentation
http://doc.goldenhelix.com/SVS/latest/collapsing_methods.html
Statistical Tests for Detecting Rare Variants Using Variance-Stabilising Transformations.
1. | Ann Hum Genet. 2012 Jun 25. doi: 10.1111/j.1469-1809.2012.00718.x. [Epub ahead of print]Statistical Tests for Detecting Rare Variants Using Variance-Stabilising Transformations.Wang K, Fingert JH.SourceDepartment of Biostatistics, College of Public Health, The University of Iowa, Iowa City, IA, USA Department of Ophthalmology and Visual Sciences, Carver College of Medicine, The University of Iowa, IA, USA. AbstractNext generation sequencing holds great promise for detecting rare variants underlying complex human traits. Due to their extremely low allele frequencies, the normality approximation for a proportion no longer works well. The Fisher's exact method appears to be suitable but it is conservative. We investigate the utility of various variance-stabilising transformations in single marker association analysis on rare variants. Unlike a proportion itself, the variance of the transformed proportions no longer depends on the proportion, making application of such transformations to rare variant association analysis extremely appealing. Simulation studies demonstrate that tests based on such transformations are more powerful than the Fisher's exact test while controlling for type I error rate. Based on theoretical considerations and results from simulation studies, we recommend the test based on the Anscombe transformation over tests with other transformations. © 2012 The Authors Annals of Human Genetics © 2012 Blackwell Publishing Ltd/University College London. |
PMID: 22724536 [PubMed - as supplied by publisher] | |
Error-correcting properties of the SOLiD Exact Call Chemistry.
1. | BMC Bioinformatics. 2012 Jun 22;13(1):145. [Epub ahead of print]Error-correcting properties of the SOLiD Exact Call Chemistry.Massingham T, Goldman N.AbstractABSTRACT: BACKGROUND:The Exact Call Chemistry for the SOLiD Next-Generation Sequencing platform augments the two-base-encoding chemistry with an additional round of ligation, using an alternative set of probes, that allows some mistakes made when reading the first set of probes to be corrected. Additionally, the Exact Call Chemistry allows reads produced by the platform to be decoded directly into nucleotide sequence rather than its two-base 'color' encoding. RESULTS:We apply the theory of linear codes to analyse the new chemistry, showing the types of sequencing mistakes it can correct and identifying those where the presence of an error can only be detected. For isolated mistakes that cannot be unambiguously corrected, we show that the type of substitution can be determined, and its location can be narrowed down to two or three positions, leading to a significant reduction in the the number of plausible alternative reads. CONCLUSIONS:The Exact Call Chemistry increases the accuracy of the SOLiD platform, enabling many potential miscalls to be prevented. However, single miscalls in the color sequence can produce complex but localised patterns of error in the decoded nucleotide sequence. Analysis of similar codes shows that some exist that, if implemented in alternative chemistries, should have superior performance. |
PMID: 22726842 [PubMed - as supplied by publisher] | |
Sunday, 24 June 2012
Ray 2.0.0 codenamed "Dark Astrocyte of Knowledge" is available for download.
***************
Hello,
Ray 2.0.0 codenamed "Dark Astrocyte of Knowledge" is available for download.
This version ships with RayPlatform 1.0.3 codenamed "Gray Pylon of Wisdom".
Not much thing changed since v2.0.0-rc8.
Ray 2.0.0 can do de novo assembly of metagenomes and also taxonomic profiling
with k-mers.
To get Ray v2.0.0:
http://denovoassembler.sf.net
Also, there is a new section on the website for
frequently asked questions.
MyGenome for iPad on the iTunes App Store
The MyGenome app features:
•Actual genome of Illumina CEO, Jay Flatley (donated for educational purposes)
•Genome Map, Health Cards and Reports to explore the wealth of information that can be obtained through accessing the genome
•Video journey into the genome
Key Features:
Genome Map
•Tour the landscape of chromosomes and see how genetic variants in different locations translate into health impacts or biological traits.
•View individual genes, their locations, and biological impacts
•Visualize where and how genome sequences differ from the "reference" human genome
•Learn how much we understand about the variation in the human genome and how much more we have to learn
Health Cards
•Explore disease risks, genetically determined conditions and predispositions, and carrier traits
•Discover how different genetic variants contribute to health risks and can be passed on to children
•Find out how changes in the genome affect drug response
Reports
•Investigate the possible health impacts of genetic variants for > 200 conditions!
•See reports that illustrate how genetic information will likely be delivered in the future and used by medical professionals.
Soon, you and your physician will be able to sequence, download and explore your own genome. To learn more about individual genome sequencing or to find out about upcoming MyGenome app store releases, please visit www.everygenome.com.
Saturday, 23 June 2012
Chief Data Scientist at EMC in Singapore - Job | LinkedIn
a little sad/pointless though if the software is free for the single node edition ...
http://www.linkedin.com/jobs?viewJob=&jobId=3178913&trk=rj_em&ut=0-LnajpqbIVlg1
Chief Data Scientist
EMC - Singapore
Job Description
Responsibilities:
- Partner directly with APJ regional leadership and regional field, Greenplum Data Science leadership, and customers/prospects to establish a robust vision for the build-out of APJ's Data Science team.
- While managing existing team members, lead the recruiting and onboarding of a larger APJ regional Data Science team that addresses vertical and analytical knowledge requirements.
- Drive evangelization and education of Data Science services to Greenplum's APJ sales force, in particular educating the field on how to communicate the vision and value of advanced analytics, how to qualify interested prospects, and how to propose Data Science services.
- While working with customers and prospects, leverage significant experience directly working with data to define analytics use-cases that address customer requirements for value generation, and partner with Data Scientists to execute on these.
- Advise customers and prospects on technology and tool selection to best meet their emerging analytics requirements and to best drive value-generation on existing and future data.
- Lead relationship development and technology evaluation for new prospective regional analytics-centric partnerships.
- Work directly with customers to educate them on Greenplum's technologies, analytical use-cases, pros/cons of emerging tools, etc.
- Assist in customer engagement management, requirements definition, project scoping, timeline management, and results documentation to ensure professional relationship management with regional customers.
- Travel, as needed, to meet with customers (roughly 40-50%).
Desired Skills & Experience
- 5-10 years of experience and a proven passion for generating insights from data, with a strong familiarity with the higher-level trends in data growth, open-source platforms, and public data sets.
- A proven track record of building the function of data science, analytics as a service, or teams of data miners / machine-learning practitioners
- 5 to 10 years of experience in managing small to mid-sized teams, preferably in the services functions.
- Significant experience evangelizing the value of data analytics to broad audiences.
- At least 3 years of work in a related role within the APJ region, showing strong understanding of country-level industry players, vertical market trends, and status of data utilization.
- Strong knowledge of statistical methods generally, and particularly in the areas of modeling and business analytics
- Experience working with a variety of statistical languages and packages, including R, S-Plus, SAS and Matlab, and/or Mahout
- Experience working with relational databases and/or distributed computing platforms, and their query interfaces, such as SQL, MapReduce, PIG, and Hive.
- Preferably, experience working hands-on with large-scale data sets
- Familiarity with additional programming languages, including Python, Java, and C/C++.
- Experience leveraging visualization software and techniques (including Tableau), and business intelligence (BI) software, such as Microstrategy, Cognos, Pentaho, etc.
- Technical knowledge of distributed computing platforms, and common data process flows from data instrumentation & generation, to ETL, to the data warehouse itself.
- Advanced degree (PhD or Masters) in an analytical or technical field (e.g. applied mathematics, statistics, physics, computer science, operations research)
- A strong business-orientation, able to select the appropriate complex quantitative methodologies in response to specific business goals
- A team player, who is excited by and motivated by hard technical challenges
- Results-driven, self-motivated, self-starter
- Excellent written, verbal, and presentation skills in at least 1 key language relevant for APJ in addition to English
- Ability to travel as-needed to meet with customers, throughout the region.
Greenplum is setting the pace in the Big Data Analytics space. We are growing rapidly and providing solutions to major companies in the industry.
Company Description
EMC provides the technologies and tools that can help you release the power of your information. We can help you design, build, and manage flexible, scalable, and secure information infrastructures. And with these infrastructures, you'll be able to intelligently and efficiently store, protect, and manage your information so that it can be made accessible, searchable, shareable, and, ultimately, actionable. We believe that information is a business's most important asset. Ideas—and the people who come up with them—are the only real differentiator. Our promise is to help you take that differentiator as far as possible. We will deliver on this promise by helping organizations of all sizes manage more information more effectively than ever before. We will provide solutions that meet and exceed your most demanding business and IT challenges. We will bring your information to life. DISCUSS all things EMC, right here on LinkedIn! http://emc.im/DiscussOnLinkedIn This page maintained by @kemipa
Friday, 22 June 2012
Why You Should Care About Segmental Duplications | Our 2 SNPs…(R)
excellent blog post showing how segmental duplications can skew ur CNV analysis & SNP calling. The latter was something I wasn't aware of ....
Excerpted ...
Alert followers of this blog may recall a cautionary statement I made
previously about working with Illumina CNV data — that males and
females sometimes have different baseline signal intensity levels (this
was more of a GenomeStudio software issue than a hardware problem).
To find out if this issue affects the Omni2.5, I ran a simple t-test to
compare the Log-R Ratio (LR) intensity values between males and
females across the genome. The results are shown in the Manhattan Plot
below.
http://blog.goldenhelix.com/?p=1153&_cldee=ZXBobGt5QG51cy5lZHUuc2c%3d
Thursday, 21 June 2012
BMC Genomics| Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort
MC Genomics 2012, 13:241 doi:10.1186/1471-2164-13-241
Published: 15 June 2012Abstract (provisional)
Background
Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets.
Results
Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500 K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550 K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs.
Conclusion
Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.
introduction to R slides for a 5 day course at King's College London, 2011
slides for a 5 day course at King's College London, 2011
http://rcourse.iop.kcl.ac.uk/2011/
saw this @ www.biostars.org
Tuesday, 19 June 2012
Two-Stage Extreme Phenotype Sequencing Design for Discovering and Testing Common and Rare Genetic Variants: Efficiency and Power.
1. | Hum Hered. 2012 Jun 7;73(3):139-147. [Epub ahead of print]Two-Stage Extreme Phenotype Sequencing Design for Discovering and Testing Common and Rare Genetic Variants: Efficiency and Power.Kang G, Lin D, Hakonarson H, Chen J.SourceDepartment of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pa., USA. AbstractNext-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies. Copyright © 2012 S. Karger AG, Basel. |
PMID: 22678112 [PubMed - as supplied by publisher] | |
|
Fwd: [Velvet-users] Velvet 1.2.07
Velvet 1.2.07 is now available on github or at www.ebi.ac.uk/~zerbino/velvet/velvet_latest.tgz.
In it:
- David Powell added file format option '-fmtAuto' to auto-detect compression (using gunzip/bunzip2) and format (only FastA or FastQ for now).
- Yasubumi Sakakibara and Tsuyoshi Hachiya updated MetaVelvet
- I silenced a bug in unit testing spotted by Nathan Weeks
- A compilation bug was corrected.
- I corrected a memory compilation bug reported by @thakki
Regards,
Daniel
_______________________________________________
Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users
Monday, 18 June 2012
Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis.
1. | PLoS One. 2012;7(6):e38538. Epub 2012 Jun 4.Caution in Interpreting Results from Imputation Analysis When Linkage Disequilibrium Extends over a Large Distance: A Case Study on Venous Thrombosis.Germain M, Saut N, Oudot-Mellakh T, Letenneur L, Dupuy AM, Bertrand M, Alessi MC, Lambert JC, Zelenika D, Emmerich J, Tiret L, Cambien F, Lathrop M, Amouyel P, Morange PE, Trégouët DA.SourceINSERM UMR_S 937, ICAN Institute, Université Pierre et Marie Curie, Paris, France. AbstractBy applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10(-11) and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ∼1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings. |
PMID: 22675575 [PubMed - in process] | |
|
Sunday, 17 June 2012
Two Adaptive Weighting Methods to Test for Rare Variant Associations in Family-Based Designs - Fang - 2012 - Genetic Epidemiology - Wiley Online Library
Keywords:
- family-based design;
- rare variants;
- adaptive weights;
- quantitative traits
Although next-generation DNA sequencing technologies have made rare variant association studies feasible and affordable, the development of powerful statistical methods for rare variant association studies is still under way. Most of the existing methods for rare variant association studies compare the number of rare mutations in a group of rare variants (in a gene or a pathway) between cases and controls. However, these methods assume that all causal variants are risk to diseases. Recently, several methods that are robust to the direction and magnitude of effects of causal variants have been proposed. However, they are applicable to unrelated individuals only, whereas family data have been shown to improve power to detect rare variants. In this article, we propose two adaptive weighting methods for rare variant association studies based on family data for quantitative traits. Using extensive simulation studies, we evaluate and compare our proposed methods with two methods based on the weights proposed by Madsen and Browning. Our results show that both proposed methods are robust to population stratification, robust to the direction and magnitude of the effects of causal variants, and more powerful than the methods using weights suggested by Madsen and Browning, especially when both risk and protective variants are present. Genet. Epidemiol. 36:499-507, 2012. © 2012 Wiley Periodicals, Inc.
Biases and Errors on Allele Frequency Estimation and Disease Association Tests of Next-Generation Sequencing of Pooled Samples - Chen - 2012 - Genetic Epidemiology - Wiley Online Library
Detection of identity by descent using next-generation whole genome sequencing data
Background
Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, nextgeneration DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.
Results
Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.
Conclusion
We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.
Thursday, 14 June 2012
Fwd: Family-based association studies for next-generation sequencing.
1. | Am J Hum Genet. 2012 Jun 8;90(6):1028-45.Family-based association studies for next-generation sequencing.Zhu Y, Xiong M.SourceHuman Genetics Center and Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA. AbstractAn individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics. Copyright © 2012 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved. |
PMID: 22682329 [PubMed - in process] | |
|
SEQuel: improving the accuracy of genome assemblies.
1.
Bioinformatics. 2012 Jun 15;28(12):i188-i196.
SEQuel: improving the accuracy of genome assemblies.
Ronen R, Boucher C, Chitsaz H, Pevzner P.
Source
Bioinformatics Graduate Program, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093 and Department of Computer Science, Wayne State University, Detroit, MI 48202, USA.
Abstract
MOTIVATION:
Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.
RESULTS:
SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.
AVAILABILITY:
SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.
CONTACT:
PMID: 22689760 [PubMed - in process]
Wednesday, 13 June 2012
Bioinformatician "Data Finish & Quality Control" - SEQanswers
Requirements
- Master's degree in Bioinformatics or related discipline with at least 1 year experience, or Bachelor's degree with 3+ years of practical experience ( I really do feel that Master's doesn't help make a person a better bioinformatician but I am assuming that you want a person with 3 years of experience, 2 years of which you are doing a Msc out of it but you do not wish to hire a PhD )
- Very able in Unix, Linux and Windows operating systems and networks (are you discounting my Mac skills? )
- In depth experience with Unix shell scripts and Python
- Excellent teamwork, organizational and communicative skills
- Good spoken and written English
Desirable
- Experience in genomic data analysis
- Proficiency in SQL. (I suppose this is for the LIMS)
- Knowledge of Perl, R and/or C/C++ (ah C++ is optional! am glad my C++ skills slowly atrophied to non existence)
- Experience with batch processing on a cluster
- Experience with next generation sequencing data
- Experience in a high-tech organization of several interacting, specialized teams
Responsibilities
- Ongoing development of the quality control pipeline on a computer cluster so as to assure automated continuous operation and integration with LIMS (Lab Information Management System)
- Maintenance of an automated system for determining sequence quality based on signal quality, alignment to reference genomes and other measurements
- Management, analysis and quality control of large amounts of sequence data
- Efficient testing and integration into the production pipeline of new software as it becomes available
- Development of new software and IT solutions for the fast-changing field of DNA sequencing
- Communication and troubleshooting with sequencing and bioinformatics groups
Tuesday, 12 June 2012
Elements of Bioinformatics
http://elements.eaglegenomics.com/
Sunday, 10 June 2012
[Biopython] EU-codefest
---------- Forwarded message ----------
From: "Peter Cock"
Date: Jun 10, 2012 6:25 PM
Subject: [Biopython] EU-codefest
Dear Biopythoneers,
Some of you might like to attend an Open-Bio Hackathon in Italy this
summer - 19 and 20 July 2012, in Lodi.
This is about a week after BOSC and the pre-BOSC CodeFest in California
http://www.open-bio.org/wiki/BOSC_2012
Peter
---------- Forwarded message ----------
From: *Pjotr Prins*
Date: Saturday, June 9, 2012
Subject: EU-codefest
Hi Chris and Peter,
Would you mind sending a reminder of the EU-codefest to your lists?
Registration form is up:
http://www.open-bio.org/wiki/EU_Codefest_2012
Three main topics will be worked on during the CodeFest:
NGS and high performance parsers for OpenBio projects.
RDF and semantic web for bioinformatics.
Bioinformatics pipelines definition, execution and distribution.
other tracks are welcome!
Pj.
_______________________________________________
Biopython mailing list -
http://lists.open-bio.org/mailman/listinfo/biopython
An Explanation of Velvet Parameter exp_cov | Homologus
Appropriate choice of the 'exp_cov' (expected coverage) parameter in Velvet is very important to get an assembly right. In the following figure, we show data from a calculation on a set of reads taken from a 3Kb region of a genome, and reassembling them with varying exp_cov parameters. X-axis in the chart shows the exp_cov and y-axis shows the size of the largest scaffold assembled by Velvet.
http://www.homolog.us/blogs/2012/06/08/an-explanation-of-velvet-parameter-exp_cov/
Friday, 8 June 2012
Lampreys delete 20% of their genome
Best Things in Life are Free: cost of NGS analysis & open source
Thursday, 7 June 2012
Released VAGUE 1.0 - a JVM-based GUI front-end for Velvet
GUI for Velvet called
VAGUE. It is written in JRuby but compiled to Java bytecode and will
run on Mac and Linux. You need to have the latest Velvet binaries (>=
1.2.06) as David has made improvements to Velvet to make VAGUE simpler
to use. You can optionally install velvetk.pl which I announced
recently on this list.
You can look at screenshots and download it from here:
http://bioinformatics.net.au/software.vague.shtml
Enjoy!
--
--Dr Torsten Seemann
--Scientific Director : Victorian Bioinformatics Consortium, Monash
University, AUSTRALIA
--Senior Researcher : VLSCI Life Sciences Computation Centre,
Parkville, AUSTRALIA
--http://www.bioinformatics.net.au/
_______________________________________________
Velvet-users mailing list
http://listserver.ebi.ac.uk/mailman/listinfo/velvet-users
Seagate GoFlex Desk Thunderbolt Adapter Review | StorageReview.com - Storage Reviews
With technology like aspera, multi part S3 upload. I wonder how many people are still using portable HDD to transfer data. My lab has raw seq, archived on goflex hdd coincidentally. Getting data out fast is a pain when u have them split across ten drives.
It would be cool to have this adaptor to speed things up though!
But being sata drives as mentioned I think u can only reach USB 3 speeds. So even if I were to plug a disk via thunderbolt to a Mac via wired LAN the speed up might be just twice.
For the price, I wonder if getting a small NAS with USB ports to backup the entire portable HDD then use cron to pull the data to central storage might be better.
The cool thing about the adaptor is that it really is just a sata to thunderbolt adaptor. If u have a goflex drive, u would know what I mean.
Would be cool if someone made a SATA RAID mirror with thunderbolt output adaptor! Then u won't have to use SSDs to achieve higher speeds
Sent from my iPad
Genomic Dark Matter: The reliability of short read mapping illustrated by the Genome Mappability Score
Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and don't directly measure the problematic repeats across the genome. Here we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position, and thus measures the overall composition of the genome itself.
Results: We have developed the Genome Mappability Analyzer (GMA) to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly, and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the dark matter of the genome, including of known clinically relevant variations in these regions.
Availability: The source code and profiles of several model organisms are available at http://gma-bio.sourceforge.net
Wednesday, 6 June 2012
14TB of liver cancer genome data from this available in GigaDB
YT plugging recent ACRG collaboration with BGI. 14TB of liver cancer genome data from this available in GigaDB
http://gigadb.org/hepatocellular-carcinoma/
Genomic DNA was purified for at least 30-fold coverage paired-end (PE) sequencing, and PE reads were mapped on human reference genome (UCSC build hg19) and HBV (NC_003977). Two sequencing libraries with different insert size were constructed for each genomic DNA sample (200bp and 800bp). Paired end, 90bp read length sequencing was performed in the HiSeq 2000 sequencer according to the manufacturer's instructions. Raw gene expression profiling data of these human HCC samples have been deposited to GEO with the accession number GSE25097.
Raw data
History
May 31, 2012: Data released.
In accordance with our terms of use, please cite this dataset as:
Kan, Z; Zheng, H; Liu, X; Li, S; Barber, TD; Gong, Z; Gao, H; Hao, K; Willard, MD; Xu, J; Hauptschein, R; Rejto, PA; Fernandez, J; Wang, G; Zhang, Q; Wang, B; Chen, R; Wang, J; Lee, NP; Lee, WH; Ariyaratne, PN; Tennakoon, C; Mulawadi, FH; Wong, KF; Liu, AM; Chan, KL; Hu, Y; Chou, WC; Buser, C; Zhou, W; Lin, Z; Peng, Z; Yi, K; Chen, S; Li, L; Fan, X; Yang, J; Ye, R; Ju, J; Wang, K; Estrella, H; Deng, S; Wulur, IH; Liu, J; Ehsani, ME; Zhang, C; Loboda, A; Sung, WK; Aggarwal, A; Poon, RT; Fan, ST; Wang, J; Hardwick, J; Reinhard, C; Dai, H; Li, Y; Luk, JM; Mao, M; the Asian Cancer Research Group (2012): Hepatocellular carcinoma genomic data from the Asia Cancer Research Group. GigaScience.http://dx.doi.org/10.5524/100034
Related manuscript available at:
Accession codes associated with this data:
EMBL-EBI ENA ERP001196
MacBookPro - Debian Wiki
Fwd: [Velvet-users] velvetk.pl - choose a good k-value for your genome automatically
Hi all,
I have written a simple script to choose (or list) good k-values for
YOUR data with YOUR genome.
It's called velvetk.pl and it needs two things:
(1) the target genome size (can supply a number eg. 4.8M) or a fasta
file of a close reference
(2) your read files (fasta/fastq and uncompressed/bzip2/gzip should work)
Example uses might be:
# For manual examination
% velvetk.pl --size 3.8M reads.fastq morereads.fa.gz morereads.fq.bz2 paired.fa
K #Kmers Kmer-Cov
91 34649310 34.6
93 27719448 27.7
95 20789586 20.8
# For automated scripts
% velvetk.pl --genome Ecoli.fna --best reads.fastq morereads.fa.gz
morereads.fq.bz2
93
You can download it from here:
http://bioinformatics.net.au/software.velvetk.shtml
If it is deemed to work well, then we will aim to:
1. incorporate it as "velvetk" in the Velvet distribution
2. rewrite in "C" if needed
3. add a new "auto" option instead of a fixed k-value in velveth.
--
--Dr Torsten Seemann
--Scientific Director : Victorian Bioinformatics Consortium, Monash
University, AUSTRALIA
--Senior Researcher : VLSCI Life Sciences Computation Centre,
Parkville, AUSTRALIA
--http://www.bioinformatics.net.au/
_______________________________________________
Velvet-users mailing list