Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism
Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.
Vice versasingleton genes can be identified to elucidate the specific properties of an individual genome. Since initial publication, the EDGAR platform has become one of the most established software tools in the field of comparative genomics.
Over the last years, the software has been continuously improved and a large number of new analysis features have been added. For the new version, EDGAR 2.0, the gene orthology estimation approach was newly designed and completely re-implemented. Among other new features, EDGAR 2.0 provides extended phylogenetic analysis features like AAI (Average Amino Acid Identity) and ANI (Average Nucleotide Identity) matrices, genome set size statistics and modernized visualizations like interactive synteny plots or Venn diagrams. Thereby, the software supports a quick and user-friendly survey of evolutionary relationships between microbial genomes and simplifies the process of obtaining new biological insights into their differential gene content.
All features are offered to the scientific community via a web-based and therefore platform-independent user interface, which allows easy browsing of precomputed datasets.
The web server is accessible at http://edgar.computational.bio.
Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016)
EDGAR 2.0: an enhanced software platform for comparative gene content analyses.
Nucleic Acids Research
DOI | PubMed | Europe PMC
Short DNA motifs are involved in a multitude of functions such as for example chromosome segregation, DNA replication or mismatch repair. Distribution of such motifs is often not random and the specific chromosomal pattern relates to the respective motif function. Computational approaches which quantitatively assess such chromosomal motif patterns are necessary. Here we present a new computer tool DistAMo (Distribution Analysis of DNA Motifs). The algorithm uses codon redundancy to calculate the relative abundance of short DNA motifs from single genes to entire chromosomes. Comparative genomics analyses of the GATC-motif distribution in γ-proteobacterial genomes using DistAMo revealed that (i) genes beside the replication origin are enriched in GATCs, (ii) genome-wide GATC distribution follows a distinct pattern, and (iii) genes involved in DNA replication and repair are enriched in GATCs. These features are specific for bacterial chromosomes encoding a Dam methyltransferase. The new software is available as a stand-alone or as an easy-to-use web-based server version at this link.
Sobetzko P, Jelonek L, Strickert M, Han W, Goesmann Alexander, Waldminghaus T (2016)
DistAMo: A web-based tool to characterize DNA-motif distribution on bacterial chromosomes .
DOI | PubMed | Europe PMC
Motivation: Fast algorithms and well-arranged visualizations are required for the comprehensive analysis of the ever-growing size of genomic and transcriptomic next generation sequencing (NGS) data.
Results: ReadXplorer is a software offering straightforward visualization and extensive analysis functions for genomic and transcriptomic DNA sequences mapped on a reference. A unique specialty of ReadXplorer is the quality classification of the read mappings. It is incorporated in all analysis functions and displayed in ReadXplorer's various synchronized data viewers for (i) the reference sequence, its base coverage as (ii) normalizable plot and (iii) histogram, (iv) read alignments and (v) read pairs. ReadXplorer's analysis capability covers RNA secondary structure prediction, single nucleotide and deletion-insertion polymorphism (SNP and DIP) detection, genomic feature and general coverage analysis. Especially for RNA-Seq data, it offers differential gene expression analysis, transcription start site (TSS) and operon detection as well as RPKM value and read count calculations. Furthermore, ReadXplorer can combine or superimpose coverage of different data sets.
Hilker, R., Stadermann, K.B., Doppmeier, D., Kalinowski, J., Stoye, J., Straube, J., Winnebald, J., Goesmann, A., (2014) ReadXplorer - Visualization and Analysis of Mapped Sequences. Bioinformatics, btu205.
In recent years, the number of published genome sequences has increased substantially owing to major developments in next-generation sequencing (NGS) technologies, concomitant reduction of sequencing costs and improvements in assembly strategies. In 2011, the genome of Chinese hamster ovary (CHO)-K1 cells, the most frequently used mammalian production cell line for biopharmaceutical products, was published. In this issue, the genomes of several related CHO cell lines as well as of the genome of the Chinese hamster are also presented. Although this information provides long-awaited and necessary insights for scientists working with these important production hosts, it also highlights a major drawback of short-read NGS technology, namely, the difficulty of assembling short-read data and scaffolding these sequences into a fully structured genome. This is especially critical for CHO cells, which are known to be genomically unstable, with frequent chromosome rearrangements and loss. In this correspondence to Nature, we describe how a chromosome sorting approach can facilitate genome assembly from short-read sequences.
Brinkrolf, K., Rupp, O., Laux, H., Kollin, F., Ernst, W., Linke, B., Kofler, R., Romand, S., Hesse, F., Budach, W. E., Galosy, S., Müller, D., Noll, T., Wienberg, J., Jostock, T., Leonard, M., Grillari, J., Tauch, A., Goesmann, A., Helk, B., Mott, J.E., Pühler, A., Borth, N. (2013).
Chinese hamster genome sequenced from sorted chromosomes.
Nature Biotechnology 31, 694–695.
The research area metabolomics achieved tremendous popularity and development in the last couple of years. Owing to its unique interdisciplinarity, it requires to combine knowledge from various scientific disciplines. Advances in the high-throughput technology and the consequently growing quality and quantity of data put new demands on applied analytical and computational methods. Exploration of finally generated and analyzed datasets furthermore relies on powerful tools for data mining and visualization.
To cover and keep up with these requirements, we have created MeltDB 2.0, a next-generation web application addressing storage, sharing, standardization, integration and analysis of metabolomics experiments. New features improve both efficiency and effectivity of the entire processing pipeline of chromatographic raw data from pre-processing to the derivation of new biological knowledge. First, the generation of high-quality metabolic datasets has been vastly simplified. Second, the new statistics tool box allows to investigate these datasets according to a wide spectrum of scientific and explorative questions.
The system is publicly available at https://meltdb.cebitec.uni-bielefeld.de. A login is required but freely available.
Kessler, N., Bonte, A., Langenkämper, G., Niehaus, K., Goesmann, A., & Nattkemper, T.W. In Press. “MeltDB 2.0 - Advances of the metabolomics software system”. Bioinformatics 29(19).
Since about two years affordable Next Generation Sequencing (NGS) machines with a fast turnaround time are available on the market. A team lead by the University of Münster now compared three different benchtop NGS platforms and how they evolved in the course of time. More specific the consortium comprising of researchers from the Universities of Münster and Bielefeld, Alfred Wegener Institute Bremerhaven (all based in Germany) and the Austrian University of Vienna challenged the GS Junior (Roche; Titanium 400 base-pair [bp] chemistry), MiSeq (Illumina; 2x 150bp & 2x 250bp paired-end consumables) and PGM (Ion Torrent; 100bp, 200bp, 300bp & 400bp kits) with bacterial whole genome sequencing. Discrepancies to a high-quality reference genome sequence were furthermore clarified by traditional bidirectional Sanger sequencing.
What the team found was that the MiSeq made a very strong official debut with only very few substitution and no insertion and deletion (indel) errors at consensus level. The GSJ had by far the lowest throughput thereby making it more costly to operate than the other two platforms. The PGM evolved rapidly in the past two years and with the newest 300/400bp chemistries this platform showed only one substitution error and a dramatic reduced number of indel errors. As these errors are systematic by nature - nearly all are related to homo-polymer stretches in the sequence - appropriate software tools can compensate for them, says the last and communicating author Dr. Dag Harmsen, a scientist from the Department for Periodontology at the University of Münster.
”The de novo assembly qualities of the MiSeq and PGM systems are amazingly good. Therefore, I expect both platforms being used routinely by early public health adopters for microbial epidemiologic surveillance to detect faster and more accurate outbreaks starting this year,” added Harmsen.
“To conduct a ‘fair’ NGS platform comparison is pretty hard. However, it is certainly the consensus accuracy, not the raw read accuracy, that is the relevant metric for normal end users,” explained the first author of the Nature Biotechnology publication Sebastian Jünemann, a bioinformatician from the Institute for Bioinformatics, Center for Biotechnology, Bielefeld University.
As the focus with such good and fast sequencing results is shifting away from the laboratory towards analyzing the huge amount of generated data, turnkey software tools are more than ever needed. Thus, the next topic is to work on user-friendly software solutions that bridge the gap from data to knowledge and opens the door wide for routine application of NGS in clinical and public health microbiology, explained Harmsen.
Jünemann, S., Sedlazeck, F.J., Prior, K., Albersmeier, A., John, U., Kalinowski, J., Mellmann, A., Goesmann, A., von Haeseler, A., Stoye, J., Harmsen, D. (2013)
Updating benchtop sequencing performance comparison.
Nature Biotechnology 31, 294–296.