Inhaltspezifische Aktionen

Open thesis topics

Within our group we can offer various topics in the field of applied bioinformatics, high-throughput data analysis, genome and metagenome research as well as postgenomics and systems biology. Below you can find a list of suggested open topics for BSc and MSc theses and student projects. For further details on each topic or alternative projects please contact us.

 

Comparative genome analysis of Streptococcus agalactiae (GBS) from elephants (M.Sc.)

Background

Group B Streptococci are fairly common. In livestock, they are the causative agent of an udder inflamation, most often seen in dairy cows. 

In elephants, S. agalactiae is associated with Paronchya.
Under human care, elephants are known to reach a high age. This comes with an age-related decline in their immune system, which can lead usually harmless skin- or foot diseases to become chronic. Gaining a better knowledge about the bacterial infections is a vital foundation for optimized treatments and therapeutic approaches. 

In a newer study done by the "Hessische Landeslabor" (Hesse state labratory (LHL)), some S. agalactiae isolates were compared, using microbiological methods and had extensive biochemical profiles created. 
Noticable was the high number of isolates, for which the serotypes could not be determined. For this reason some isolates got sequenced, so a full comparative genome analysis could be done, using the latest methods in bioinformatics.

Thesis aims

  • Implementation of typical bioinformatic analyses (Assembly, mapping, annotation...)
  • Comparative analysis of GBS Isolates (ABR, pan- and coregenome, virulence factors...)
  • Closer inspection of Genes for serotyping

Prerequisites

  • Interested in solving biological/veterenary questions by usage of bioinformatics
  • Extensive knowledge of the Linux command line
  • Ability to work independently and methodical

Contact: Linda Fenske

 

Workflow Design (Nextflow) (M.Sc.)

 

Background

Analysing (bacterial) sequence data for biological/medical questions means often repeating certain standard processes (QC, Assembly, Annotation etc.)

For better reproduceability and simplification of these processes, flexible pipelines with a wide palette of tools are used. Often Nextflow (of similar workflow tools) is used to enable support for a variety of enviroments or to simplify the installation.

With DSL2, Nextflow recently introduced a significant development of the Nextflow language, which promises a better scalability and modulariziation of pipelines, along with a better design of workflows.

Thesis aims

  • Revision and updating of an existing workflow for analysing bacerial data
  • Transmission of the workflow from nf-DSL1 to DSL2
  • Visualising the results (creating a GUI)

Prerequisites 

  • Extensive knowledge of the Linux command line
  • Knowledge of Nextflow or motivation to become acquainted with Nextflow
  • Programming knowledge in Python, Groovy (Nextflow) or similar
  • Knowledge and interest in visualisation and processing of data

Contact: Linda Fenske

 

Platon Bioinformatics Tool Enhancement for Faster Plasmid Identification (M.Sc.) - taken

Background

Modern high-throughput sequencing devices enable the rapid determination of sequence data obtained from interacting microbial communities without a prior cultivation step. Hereby, access to genetic information from otherwise unculturable microbiota is easily achieved. (Computational) Interpretation of such data relies on either assignment of raw sequencing reads to corresponding source organisms in order to infer their taxonomic origin or gene-coding content, or, these metagenome datasets can be assembled, thereby recovering longer contiguous DNA stretches of the underlying microbial genomes.

Assembled metagenomic contigs are typically clustered (most often, depending on coverage or nucleotide composition), yielding individual draft or complete genomes of novel bacterial species. In this process, however, contigs of non-chromosomal origin such as plasmids are often overlooked.

Still, the analysis of plasmids is of utmost imoprtance, since they constitute a key mechanism of horizontal gene transfer between microbial hosts. They are known to harbor essential genes that are beneficial or important for microbial fittness or survival under certain environmental conditions (e.g. in the presence of certain antimicrobial agents) or perform metabolic processes that they otherwise wouldn‘t have been able to (e.g. degradation of novel substrates).

Several bioinformatics applications have been developed for the computational identification of plasmid-borne contigs, most typically focusing on the extraction of plasmid contigs from the assemblies of individual draft genomes. Among these tools are Platon (Schwengers et al., 2020), PlasClass (Pellow et al., 2020) and PlasFlow (Krawczyk et al., 2018), of which Platon exhibits excellent performance, but its runtime characteristics currently impede its application to potentially large metagenome assemblies.

 

Thesis aims

  • Overhaul of the Platon code base, switching from a contig-centered approach to one based on bulk data processing in order to significantly decrease overall runtime.
  • Inlining of certain sub-analysis steps such as circularity testing into the python codebase instead of relying on the invocation of external tools: (Pyrodigal, pyHMMER, PyTrimal)
  • Conditional tool execution: Do not invoke additional tools if preceding steps already exclude a sequence from being a plasmid
  • Runtime and performance assessment with regard to the original implementation

 

Requirements

  • Familiarity with Linux and (modular) python programming (incl. unit testing)
  • Methodological way of working
  • Able to work independently

Contact: Oliver Schwengers

 

Develop and Compare Curare Modules for Different DGE Libraries (M. Sc)

 

Background

Differential gene expression analysis (DGE) is a commonly used method in RNA sequencing, in which the expressions of different genes in samples from different conditions are statistically compared to identify relevant genes in stress or defense situations. To simplify the execution of these analyses, the software Curare was developed.

Currently, the R library DESeq2 is used for the statistical evaluation of expression data, but there are also alternative libraries such as edgeR or Limma that pursue similar or completely different statistical approaches.

This Master's thesis aims to write, compare, and combine Curare modules for various DGE libraries. This requires working with different R libraries, integrating the evaluation into Curare (written in Snakemake), and visualizing the results in an HTML report.

Thesis aims

  • Write Curare modules for different DGE libraries and compare and combine them.
  • Learn about different R libraries for statistical analysis of expression data.
  • Integrate the analysis in Curare (written in Snakemake) and visualize the results in an HTML report.

Contact: Patrick Blumenkamp

 

Reconstruction and visualization of KEGG metabolic pathways in the EDGAR platform (M.Sc.)

Background

EDGAR is a web-based platform for analyzing microbial data. It is developed by employees of the Bioinformatics and Systems Biology department at JLU Giessen and provides multifaceted methods for investigating genomes.

KEGG ( Kyoto Encyclopedia of Genes and Genomes) provides curated databases and resources for (among other things) the functional annotation and classification of genes. In previous projects, KEGG functional categories for all organisms and their corresponding genes were computed in the EDGAR platform. These are currently displayed directly in two analysis modules, in purely quantitative terms.

MinPath is a program for reconstructing biological/metabolic pathways. It attempts to infer a minimal biological metabolic network by excluding redundant metabolic pathways that can explain the genes found in a given dataset. The above-mentioned KEGG categories will be used as input for this program.

The goal of the project is to develop a comparative analysis module, based on KEGG pathway information, for the EDGAR platform.

Thesis Aims

  • Parse the available KEGG data in a structured manner and compute KEGG metabolic pathways for all given genomes in EDGAR using MinPath.
  • Design comparative visualizations for the EDGAR frontend using the resulting data, allowing users to interactively explore their data (see fig. 4 here as an example)
  • Adjust the project scope in consultation with the student depending on the project status to accommodate shared ideas, as EDGAR incorporates a wide selection of data with potential for creative analysis methods.

Requirements 

  • Programming skills in Python and JavaScript (can also be learned during the process)

  • Basic SQL database knowledge

 

PlasmidHunter: Validation of a metagenome-based plasmid search using public plasmid sequences (M.Sc.)

Background

Plasmids play an important role in the genetic variability of organisms. They replicate independently and between organisms - within and between species. Therefore, plasmids are key drivers of horizontal gene transfer. Often, they are the effective and only difference between commensal and pathogenic bacterial strains. In recent years, it became obvious that plasmids belong to the main mechanisms for the dissemination of antimicrobial resistances and hence are of special interest in medical microbiology. Detecting plasmids and analyzing their dissemination is an important epidemiological and scientific topic that might help to detect current and prevent future outbreaks of antibiotic resistances.

One promising data source containing known and unknown plasmids are whole-metagenome datasets of samples from different sources (soil, waste water, the human gut). For many of these samples, sequencing data is freely accessible in public databases, often annotated with additional meta information such as date, source and location of each sample.

Our project processes these datasets from the MGnify database in a standardized way via modern cloud technologies and makes them accessible to users for a fast search of new plasmids within this huge amount of data.

This master thesis should validate this search via existing plasmid databases (such as PLSDB) and analyze search results including comprehensive visualizations.

Thesis Aims

  • Implementation of a workflow to process PLSDB entries with our existing search workflow
  • Statistical analysis of the results, and screen for potential interesting candidates for further analysis
  • Visualization of the results

Prerequisites 

  • Knowledge of command line tools and Python
  • Interest in cloud technologies
  • Prior experience with workflow systems, like Nextflow or Snakemake

Contact: Sebastian Beyvers

 

Webservice for searching gene families in plants (M. Sc.)

 

Background

The input is a list of protein sequences. In step 1a, a Pfam search is performed with the sequences to find common domains. In step 1b, a multiple sequence alignment of the sequences is calculated. The conserved regions are automatically extracted from the alignment to calculate HMMs. In step 2, the HMMs of the domains from 1a and 1b are used to search a database of plant proteins.

Thesis Aims

  • The results are visualized and made available for download
  • Steps 1 and 2 are also provided as a command-line tool

Prerequisites

  • The programming language(s) and frameworks can be freely chosen
  • Test data will be provided

Contact: Oliver Rupp

 

Ribosomal binding site prediction based on 16S-rRNA (M.Sc.)

 

Background

Bacterial translation is initiated by the assembly of ribosomal proteins as part of the translation initiation complex at the coding sequence (CDS) start site. For most CDS, there is a ribosomal binding site (RBS) immediately upstream of the gene, consisting of a 5-10bp spacer and a (partial or complete) Shine-Dalgarno sequence (SD) 5’-AGGAGG-3’ to which the ribosome binds. However, some genes have neither an SD nor a known RBS and are still expressed (Omotajo, D. et al., 2015). The Shine-Dalgarno sequence was first described in E. coli but is found in many bacterial genomes and is complementary to the anti-SD sequence at the 3′-end of 16S-rRNA.

The exact Shine-Dalgarno and spacer sequences vary between bacterial species. However, because the anti-Shine-Dalgarno sequence is present in the 16S-rRNA of each bacterial genome, it can be used to predict RBS in a species-independent manner.  Therefore, a deep learning approach using the 16S-rRNA sequences and the sequence upstream of the CDS is promising for accurately predicting the presence of RBS independent of species-specific variants.

Thesis Aims

  • Design and implementation of a neural network for ribosomal binding site prediction in bacteria,
  • evaluation of the features used by the neural network, and
  • analysis of the presence of RBS in exemplary bacterial genomes

Prerequisites 

  • Prior experience with deep learning frameworks such as Tensorflow/Keras, or willingness to learn them
  • Prior experience in the development of documented code and dependency management or willingness to learn them

Contact: Julian Hahnfeld

 

Integrative Omics FAIR Workflow (M.Sc.)

Background

Processing and analysing 'omics data often requires applying predefined building blocks of code, i.e. for performing quality control, statistical analysis or machine learning. However, biologists and ecologists are often overwhelmed with the technical complexity of programmatic approaches and interfaces. Hence, scientific workflows can not just automate, but also facilitate important re-occuring processes in high-throughput 'omics analysis.

The existing modularized iESTIMATE pipeline aims at automating and facilitating the complex analysis of ecological metabolomics data and the integration with other phenomics and preparation for sequencing and (meta-)genomics data. The central aim of the pipeline is to extract so called molecular traits that explain molecular mechanisms in plants or microorganisms.

Thesis Aims

  • Revision and modularisation of existing code to create the R package "iESTIMATE"
  • Implementing a workflow in NextFlow or Common Workflow Language (CWL) using test data, implementing unit tests and capture provenance information
  • Publish R package and the workflow following the FAIR principles

Prerequisites 

  • Knowledge of R and a bit of Python
  • Knowledge of Linux command line, containers, NextFlow (Groovy), YAML, or motivation to become acquainted with them
  • Keen interest in analysis of integrative 'omics data and in topics in molecular ecology

Contact: Kristian Peters