Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores
Plasmids are extrachromosomal genetic elements that replicate independently of the chromosome and play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next-generation sequencing methods, the quantity of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included in existing software pipelines due to their technical design or software implementation.
In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS), which achieved an accuracy of 96.6 %. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid-specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5 %) and more balanced predictions (F1=82.6 %) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced Escherichia coli isolates.
Platon is available at: http://platon.computational.bio
Schwengers, O., Barth, P., Falgenhauer, L., Hain, T., Chakraborty, T., & Goesmann, A.
Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores.
Microbial Genomics (2020), 95, 295. DOI: 10.1099/mgen.0.000398
ASA³P: An automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates
Whole genome sequencing of bacteria has become daily routine in many fields. Advances in DNA sequencing technologies and continuously dropping costs have resulted in a tremendous increase in the amounts of available sequence data. However, comprehensive in-depth analysis of the resulting data remains an arduous and time-consuming task. In order to keep pace with these promising but challenging developments and to transform raw data into valuable information, standardized analyses and scalable software tools are needed.
Here, we introduce ASA³P, a fully automatic, locally executable and scalable assembly, annotation and analysis pipeline for bacterial genomes. The pipeline automatically executes necessary data processing steps, i.e. quality clipping and assembly of raw sequencing reads, scaffolding of contigs and annotation of the resulting genome sequences. Furthermore, ASA³P conducts comprehensive genome characterizations and analyses, e.g. taxonomic classification, detection of antibiotic resistance genes and identification of virulence factors. All results are presented via an HTML5 user interface providing aggregated information, interactive visualizations and access to intermediate results in standard bioinformatics file formats. We distribute ASA³P in two versions: a locally executable Docker container for small-to-medium-scale projects and an OpenStack based cloud computing version able to automatically create and manage self-scaling compute clusters. Thus, automatic and standardized analysis of hundreds of bacterial genomes becomes feasible within hours. The software and further information is available at: asap.computational.bio.
Schwengers, O., Hoek, A., Fritzenwanker, M., Falgenhauer, L., Hain, T., Chakraborty, T., & Goesmann, A.
ASA3P: An automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates.
PLoS Computational Biology (2020), 16(3), e1007134. DOI: 10.1371/journal.pcbi.1007134
The characterization of microbial communities based on sequencing and analysis of their genetic information has become a popular approach also referred to as metagenomics; in particular, the recent advances in sequencing technologies have enabled researchers to study even the most complex communities.
Metagenome analysis, the assignment of sequences to taxonomic and functional entities, however, remains a tedious task: large amounts of data need to be processed. There are a number of approaches addressing particular aspects, but scientific questions are often too specific to be answered by a general-purpose method.
We present MGX, a flexible and extensible client/server-framework for the management and analysis of metagenomic datasets; MGX features a comprehensive set of adaptable workflows required for taxonomic and functional metagenome analysis, combined with an intuitive and easy-to-use graphical user interface offering customizable result visualizations. At the same time, MGX allows to include own data sources and devise custom analysis pipelines, thus enabling researchers to perform basic as well as highly specific analyses within a single application.
With MGX, we provide a novel metagenome analysis platform giving researchers access to the most recent analysis tools. MGX covers taxonomic and functional metagenome analysis, statistical evaluation, and a wide range of visualizations easing data interpretation. Its default taxonomic classification pipeline provides equivalent or superior results in comparison to existing tools.
Jaenicke S, Albaum SP, Blumenkamp P, Linke B, Stoye J and Goesmann A.
Flexible metagenome analysis using the MGX framework.
Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism
Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy.
Vice versasingleton genes can be identified to elucidate the specific properties of an individual genome. Since initial publication, the EDGAR platform has become one of the most established software tools in the field of comparative genomics.
Over the last years, the software has been continuously improved and a large number of new analysis features have been added. For the new version, EDGAR 2.0, the gene orthology estimation approach was newly designed and completely re-implemented. Among other new features, EDGAR 2.0 provides extended phylogenetic analysis features like AAI (Average Amino Acid Identity) and ANI (Average Nucleotide Identity) matrices, genome set size statistics and modernized visualizations like interactive synteny plots or Venn diagrams. Thereby, the software supports a quick and user-friendly survey of evolutionary relationships between microbial genomes and simplifies the process of obtaining new biological insights into their differential gene content.
All features are offered to the scientific community via a web-based and therefore platform-independent user interface, which allows easy browsing of precomputed datasets.
The web server is accessible at http://edgar.computational.bio.
Blom J, Kreis J, Spänig S, Juhre T, Bertelli C, Ernst C, Goesmann A (2016)
EDGAR 2.0: an enhanced software platform for comparative gene content analyses.
Nucleic Acids Research
DOI | PubMed | Europe PMC
Short DNA motifs are involved in a multitude of functions such as for example chromosome segregation, DNA replication or mismatch repair. Distribution of such motifs is often not random and the specific chromosomal pattern relates to the respective motif function. Computational approaches which quantitatively assess such chromosomal motif patterns are necessary. Here we present a new computer tool DistAMo (Distribution Analysis of DNA Motifs). The algorithm uses codon redundancy to calculate the relative abundance of short DNA motifs from single genes to entire chromosomes. Comparative genomics analyses of the GATC-motif distribution in γ-proteobacterial genomes using DistAMo revealed that (i) genes beside the replication origin are enriched in GATCs, (ii) genome-wide GATC distribution follows a distinct pattern, and (iii) genes involved in DNA replication and repair are enriched in GATCs. These features are specific for bacterial chromosomes encoding a Dam methyltransferase. The new software is available as a stand-alone or as an easy-to-use web-based server version at this link.
Sobetzko P, Jelonek L, Strickert M, Han W, Goesmann Alexander, Waldminghaus T (2016)
DistAMo: A web-based tool to characterize DNA-motif distribution on bacterial chromosomes .
DOI | PubMed | Europe PMC
Motivation: Fast algorithms and well-arranged visualizations are required for the comprehensive analysis of the ever-growing size of genomic and transcriptomic next generation sequencing (NGS) data.
Results: ReadXplorer is a software offering straightforward visualization and extensive analysis functions for genomic and transcriptomic DNA sequences mapped on a reference. A unique specialty of ReadXplorer is the quality classification of the read mappings. It is incorporated in all analysis functions and displayed in ReadXplorer's various synchronized data viewers for (i) the reference sequence, its base coverage as (ii) normalizable plot and (iii) histogram, (iv) read alignments and (v) read pairs. ReadXplorer's analysis capability covers RNA secondary structure prediction, single nucleotide and deletion-insertion polymorphism (SNP and DIP) detection, genomic feature and general coverage analysis. Especially for RNA-Seq data, it offers differential gene expression analysis, transcription start site (TSS) and operon detection as well as RPKM value and read count calculations. Furthermore, ReadXplorer can combine or superimpose coverage of different data sets.
Hilker, R., Stadermann, K.B., Doppmeier, D., Kalinowski, J., Stoye, J., Straube, J., Winnebald, J., Goesmann, A., (2014) ReadXplorer - Visualization and Analysis of Mapped Sequences. Bioinformatics, btu205.
In recent years, the number of published genome sequences has increased substantially owing to major developments in next-generation sequencing (NGS) technologies, concomitant reduction of sequencing costs and improvements in assembly strategies. In 2011, the genome of Chinese hamster ovary (CHO)-K1 cells, the most frequently used mammalian production cell line for biopharmaceutical products, was published. In this issue, the genomes of several related CHO cell lines as well as of the genome of the Chinese hamster are also presented. Although this information provides long-awaited and necessary insights for scientists working with these important production hosts, it also highlights a major drawback of short-read NGS technology, namely, the difficulty of assembling short-read data and scaffolding these sequences into a fully structured genome. This is especially critical for CHO cells, which are known to be genomically unstable, with frequent chromosome rearrangements and loss. In this correspondence to Nature, we describe how a chromosome sorting approach can facilitate genome assembly from short-read sequences.
Brinkrolf, K., Rupp, O., Laux, H., Kollin, F., Ernst, W., Linke, B., Kofler, R., Romand, S., Hesse, F., Budach, W. E., Galosy, S., Müller, D., Noll, T., Wienberg, J., Jostock, T., Leonard, M., Grillari, J., Tauch, A., Goesmann, A., Helk, B., Mott, J.E., Pühler, A., Borth, N. (2013).
Chinese hamster genome sequenced from sorted chromosomes.
Nature Biotechnology 31, 694–695.
The research area metabolomics achieved tremendous popularity and development in the last couple of years. Owing to its unique interdisciplinarity, it requires to combine knowledge from various scientific disciplines. Advances in the high-throughput technology and the consequently growing quality and quantity of data put new demands on applied analytical and computational methods. Exploration of finally generated and analyzed datasets furthermore relies on powerful tools for data mining and visualization.
To cover and keep up with these requirements, we have created MeltDB 2.0, a next-generation web application addressing storage, sharing, standardization, integration and analysis of metabolomics experiments. New features improve both efficiency and effectivity of the entire processing pipeline of chromatographic raw data from pre-processing to the derivation of new biological knowledge. First, the generation of high-quality metabolic datasets has been vastly simplified. Second, the new statistics tool box allows to investigate these datasets according to a wide spectrum of scientific and explorative questions.
The system is publicly available at https://meltdb.cebitec.uni-bielefeld.de. A login is required but freely available.
Kessler, N., Bonte, A., Langenkämper, G., Niehaus, K., Goesmann, A., & Nattkemper, T.W. In Press. “MeltDB 2.0 - Advances of the metabolomics software system”. Bioinformatics 29(19).
Since about two years affordable Next Generation Sequencing (NGS) machines with a fast turnaround time are available on the market. A team lead by the University of Münster now compared three different benchtop NGS platforms and how they evolved in the course of time. More specific the consortium comprising of researchers from the Universities of Münster and Bielefeld, Alfred Wegener Institute Bremerhaven (all based in Germany) and the Austrian University of Vienna challenged the GS Junior (Roche; Titanium 400 base-pair [bp] chemistry), MiSeq (Illumina; 2x 150bp & 2x 250bp paired-end consumables) and PGM (Ion Torrent; 100bp, 200bp, 300bp & 400bp kits) with bacterial whole genome sequencing. Discrepancies to a high-quality reference genome sequence were furthermore clarified by traditional bidirectional Sanger sequencing.
What the team found was that the MiSeq made a very strong official debut with only very few substitution and no insertion and deletion (indel) errors at consensus level. The GSJ had by far the lowest throughput thereby making it more costly to operate than the other two platforms. The PGM evolved rapidly in the past two years and with the newest 300/400bp chemistries this platform showed only one substitution error and a dramatic reduced number of indel errors. As these errors are systematic by nature - nearly all are related to homo-polymer stretches in the sequence - appropriate software tools can compensate for them, says the last and communicating author Dr. Dag Harmsen, a scientist from the Department for Periodontology at the University of Münster.
”The de novo assembly qualities of the MiSeq and PGM systems are amazingly good. Therefore, I expect both platforms being used routinely by early public health adopters for microbial epidemiologic surveillance to detect faster and more accurate outbreaks starting this year,” added Harmsen.
“To conduct a ‘fair’ NGS platform comparison is pretty hard. However, it is certainly the consensus accuracy, not the raw read accuracy, that is the relevant metric for normal end users,” explained the first author of the Nature Biotechnology publication Sebastian Jünemann, a bioinformatician from the Institute for Bioinformatics, Center for Biotechnology, Bielefeld University.
As the focus with such good and fast sequencing results is shifting away from the laboratory towards analyzing the huge amount of generated data, turnkey software tools are more than ever needed. Thus, the next topic is to work on user-friendly software solutions that bridge the gap from data to knowledge and opens the door wide for routine application of NGS in clinical and public health microbiology, explained Harmsen.
Jünemann, S., Sedlazeck, F.J., Prior, K., Albersmeier, A., John, U., Kalinowski, J., Mellmann, A., Goesmann, A., von Haeseler, A., Stoye, J., Harmsen, D. (2013)
Updating benchtop sequencing performance comparison.
Nature Biotechnology 31, 294–296.