Inhaltspezifische Aktionen

Open Theses Topics

Current suggestions for B.Sc. / M.Sc. / Ph.D. theses topics. Should you have an idea for a topic yourself, don't hesitate to contact us to discuss it.

tbody.taken tr * { opacity: 0.7; }

The bioinformatics masters curriculum contains two lab rotations (Laborpraktika). Students shall independently work on a e.g. programming or analysis project to experience necessary skills prior to their masters thesis. We always welcome interested students and have plenty of topics; some examples are listed here. Please get in contact with us!


Transmembrane Prediction via Algebraic Dynamic Programming

Almost 25 years ago, Hidden Markov Models (HMM) were successfully used in the emerging new field of Bioinformatics to predict the topology of trans-membrane proteins (original papers "A hidden Markov model for predicting transmembrane helices in protein sequences" and "Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes")

Authors of the program TM-HMM were excited about the tight coupling of mathematical modeling via HMMs and the modeled molecular biology. They used multiple algorithms to "score" amino acid sequences e.g. for being transmembrane proteins (Viterbi) or mark sub-sequences for being the helical regions (forward-backward) transecting the membrane. Labeling the many states of the HMM into 'inside', 'outside', 'transmembrane' or 'unlabeled' turns computation of the most probable "label" into an NP-hard problem - as proven much later by Broňa Brejová et al.. The TM-HMM program is a great example for real-world sized instances of dynamic programming; although a modern version is now based on deep learning - which, in principle, also uses forward-backward ideas.

Within a thesis, you would need to re-implement the HMM in Algebraic Dynamic Programming such that we can then use novel auto-generated algorithmic ideas to potentially improve prediction accuracy for this long standing problem.


TAKEN: Speed-Up RNA prediction

We make use of the high abstraction programming discipline "Algebraic Dynamic Programming" (ADP) to provide a wide variety of RNA secondary structure prediction programs (source code, web-services). Our focus was on short development times. Unfortunately, some implementation details - mainly outside the core ADP algorithm - slow down execution of the programs significantly. See image for a profiling example: too much time is spent with unnecessary conversion from and to log-space (red) or repetitive computation of information (magenta) that should better be done prior to the core algorithm. Please give us a hand and benchmark current execution times, make suggestions for speed-ups (there are a lot of low hanging fruits), implement ideas into our ADP compiler and benchmark your improvements as a rewarding thesis!

TAKEN: AI to predict Leukemia

The current working hypothesis for the development of Leukemia includes two independent steps. The first is a genetic mutation aquired in utero before birth. A screening showed the alarming fact that 1-5% of all newborn childring carry such a mutations. Fortunately, only a small fraction will develop leukemia - it remains unclear why, i.e. what the triggering factors for the second step is. In a mouse model, we found that the stool microbiome was highly specific for the first step. If we are able to translate this to human we might have a diagnostic tool to identify children at risk. As always, we are lacking data to train such a prediction tool. However, there is a growing number of microbiome experiments that collect these data, e.g. Gut microbiome in pediatric acute leukemia: from prediction to cure. The task of this project would be two-fold: 1) collect microbial data from various experiments 2) use the data to train simple machine learning tools like Random Forest and test if they can predict leukemia suceptibility.

BA: Ubuntu Package for fold-grammars software

Create a github action to automatically deploy multiple software packages of the fold-grammars repository (pKiss, RNAshapes, ...) as Ubuntu Launchpad packages, as it was manually done here https://launchpad.net/~bibi-help/+archive/ubuntu/bibitools

TAKEN: Fungal genome annotation pipeline

The transition to a biobased economy involving the depolymerization and fermentation of renewable agro-industrial sources is a challenge that can only be met by achieving the efficient hydrolysis of biomass to monosaccharides. In nature, lignocellulosic biomass is mainly decomposed by fungi. Further read.

In collaboration with the Institute of Food Chemistry and Food Biotechnology, we are sequencing specific fungi genomes and aim to understand their functional capabilities through wet-lab experiments and in-silico genomic annotations. We use Funannotate for the latter, however this pipeline needs to be adapted to operate on our high-performance compute cluster and be specialized towards the fungal genomes for which we generate short- and long- genomic reads and short- transcriptional reads.


SILVA - SEPP (M.Sc.)

The microbiome, i.e. the sum of bacteria, archaea, fungi and viruses, living in and on your body have an enormous impact on your health and well being, e.g. risk to develop autoimmune diseases, tendency for obesity, chronic gut diseases but also predisposition to depression or Parkinson's.

The majority of high profile studies of the microbiome are technically based on sequencing the 16S ribosomal RNA gene (i.e. amplicon sequencing) as a proxy to discern the different bacteria / archaea living in a microbiome. Recent algorithmic advances in amplicon-based microbiome studies enable the inference of exact amplicon sequence fragments to increase taxonomic resolution. However, these short (e.g., 150-nucleotide [nt]) DNA sequence fragments do not contain sufficient phylogenetic signal to reproduce a reasonable phylogenetic tree, introducing a barrier in the utilization of critical phylogenetically aware metrics such as Faith's PD or UniFrac.

Valid phylogenetic trees can be created by placing the sequenced fragments into a given high quality reference tree. This has been done for the Greengenes reference, but it is an open question how well this approach works for the popular SILVA reference tree: https://msystems.asm.org/content/3/3/e00021-18.

The aim of this project would be the evaluation of phylogenetic placements via SEPP into SILVA instead of Greengenes and the impact on the ability to detect biological signals in the samples that would be compared via metrics using this SILVA based tree. Furthermore, the use of a SILVA based insertion trees for taxonomic assignment shall be benchmarked within the TAX CREdiT framework.


RNAshapes studio front-end (B.Sc./M.Sc.)

Many RNA molecules do not encode proteins (mRNA), but by themselves exert important biological functions. Most functions are realized through their three dimensional structure - which is often times more conserved than the primary nucleotide sequence. Thus, sequence homology search with e.g. BLAST often does not return good results.

Secondary structure of RNA is the set of nucleotides that form base-pairs and functions as a scaffold for the final 3D structure. Since RNA folds hierarchically, prediction of secondary structure gives valuable insights about an RNA - while true 3D predictions are mostly computational intractable.

RNAshapes / pKiss / KnotInFrame and other software packages were developed at Bielefeld University and are popular tools to predict different aspects of secondary structures. They are based on the same algorithmic ideas (algebraic dynamic programming) and share the same code base back-end. The front end is written in Perl and lacks modern software engineering must-haves, like continuous testing, modularity or documentation.

Since our group is working on several algorithmic extensions of the back-end, we are looking for an encouraged student who re-implements the front-end in Python3, adds unit tests, creates bioconda packages, ... to allow for easy maintenance and distribution.


16S full length (M.Sc.)


http://www.clpmag.com/2017/04/smrt-sequencing/

Short read amplicon sequencing is still the state-of-the-art protocol to investigate large numbers of microbial samples. Although cheap and fast, phylogenetic resolution, i.e. the ability to discriminate different bacterial species / strains, is limited by the short read length of ~200 nucleotides. The other end of the spectrum is whole-metagenome shotgun sequencing. It is able to recover complete genomes of multiple bacterial species in a sample, but is expensive and needs high computational resources.

A compromise might be full length 16S rRNA sequencing. Third generation sequencing instruments like PacBio or Oxford Nanopore allow to span the full 16S gene (~1,800 bases) and thus capture potentially 10-fold more phylogenetic resolution. Benchmarks on mock communities (mixing ~20 bacterial isolates or rRNA genes) and on taxonomic assignments already exist, but it is unclear what the advantages and disadvantages are with respect to measure biological effect sizes in real experimental settings.

In cooperation with the University Clinic Düsseldorf, we sequenced ~500 samples with short and long read technology. Thus providing an ideal benchmarking scenario to develop recommendations for Vx / full-length / whole-metagenome sequencing to support future investigators to decide on the best suitable sequence strategy.


Benchmark genomic big data cancer processing pipelines (M.Sc.)


https://experiment.com/u/oSMmA

A single mutation in one of the three billion nucleotides that make up your genome might make the difference between a healthy life and suffering e.g. from cancer1,2. Our collaboration partner - the Department of Pediatric Oncology, Hematology and Clinical Immunology at Dusseldorf University - focuses on specific forms of childhood cancer and uses modern sequencing technologies to encypher and detect those mutations3,4,5. In fact, every human carries tens of thousands of mutations - compared to a given reference - and it requires elaborate computational pipelines to narrow down the list of potentially lethal mutations such that their wet lab team can concentrate on the most "promising" ones, i.e. those that might explain why a child has cancer - and ideally inform about possible treatments.

During the last six years, over 1,000 samples have been sequenced, producing 64 TB of raw data. The processing pipeline https://github.com/sjanssen2/spike encompasses over 50 individual computational steps which together took 25 CPU-years to produce meaningful mutation short lists.

However, during those years not only the computational tools improved, but also the human reference genome got updated with a more accurate one (hg38)6. Furthermore, we are going to employ a new sequencing technology, producing even more raw data. It is unclear how big the impact of those three dimensions on the filtration characteristics is. With those three measures, we aim to improve resolution such that more of the to date still unresolved cases can be explained. Unfortunately, it is not clear to what degree the changes impact positively or negatively on our capability to classify the important mutations7.

Thorough evaluation of those three changes would be the topic of a master thesis. Due to Duesseldorf's strong wet lab team, we are in the exceptional situation of having validation results (confirmation and rejection) of a high number of mutations, which would make up the gold standard for our evaluation.

If you love digging through really big data, are not afraid of commanding thousands of processors with a single key stroke from a command line, know how to program in our preferred language, e.g. Python, R, Java, and have an interest in boosting our cancer research, shoot a mail to stefan.janssen@computational.bio.uni-giessen.de or stop by at my office.

  1. Genomic profiling of Acute lymphoblastic leukemia in ataxia telangiectasia patients reveals tight link between ATM mutations and chromothripsis.
    Ratnaparkhe M. et al. Leukemia 2017.
  2. EBV Negative Lymphoma and Autoimmune Lymphoproliferative Syndrome Like Phenotype Extend the Clinical Spectrum of Primary Immunodeficiency Caused by STK4 Deficiency.
    Schipp C. et al. Front Immunology 2018.
  3. Genomics and drug profiling of fatal TCF3-HLF-positive acute lymphoblastic leukemia identifies recurrent mutation patterns and therapeutic options.
    Fischer U. et al. Nature Genetics 2015.
  4. Next-generation-sequencing of recurrent childhood high hyperdiploid acute lymphoblastic leukemia reveals mutations typically associated with high risk patients.
    Chen C. et al. Leukemia Research 2015.
  5. Next-generation-sequencing-based risk stratification and identification of new genes involved in structural and sequence variations in near haploid lymphoblastic leukemia.
    Chen C. et al. Genes Chromosomes Cancer 2013.
  6. Similarities and differences between variants called with human reference genome HG19 or HG38.
    Bohu Pan et al. BMC Bioinformatics, 2019.
  7. Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipeline.
    Somak Roy et al. Journal of Molecular Diagnostics, 2018.
Further reading:

Quantify folding ensemble differences (M.Sc.)

Stup: Bray-Curtis of dot-plot for related RNA sequences + UniFrac with tree that reflects neighborhood?

also good for FlowSoFine instead of gaussian weight?!


Surveys for the "American Gut Project"

The American Gut Project is the world's largest crowd-sourced, citizen science microbiome research project. You can take and send in a sample, and you'll receive a report that shows a "snapshot" of the microbes found in your sample, along with comparisons to the rest of the population. In addition, you'll contribute to a dataset that can contribute to scientific research surrounding the gut.

To collect these data, we present a general survey to gather health and lifestyle information from participants (including a COVID-19 specific survey). The surveys at the moment are provided as single, very long, pages. A repeated concern from participants is the extent of "survey fatigue" they experience. From UI/UX research this past summer, it was noted that this fatigue may be reduced if questions were presented one-at-a-time with an indication of progress. The surveys themselves are described in JSON using a Vue.js compatible schema, which includes the question, its responses and response type (e.g., single choice, multiple choice, etc). Vue.js is used for rendering right now.

Join forces with the American Gut Developers and help improve the user experience to increase data quality and thus foster microbiome research! This programming project will focus on decomposing these structures to present individual questions. It should primarily orient around some Javascript modifications, with some HTML and Jinja2 templating.


UTF8 for RNA

Bellman's GAP cannot parse non-ASCII bases, i.e. base modifications. But those become ever more important, like pseudouridine in BioTechs Covid19 vaccine. There are a few thermodynamic parameters out there, that might be worth being integrated into Vienna's parameters + enable UTF8 parsing for Bellman's GAP.


What's in my sample

stup: machine learning on EMP dataset to guess metadata values for novel microbial samples, or "enrich" metadata.


Non-Terminal Report Algebra

Similar to the automatic generation of "enum" algebras, one could think of automatic algebras that report non terminal use. Could be handy for debugging of outside code generation.

Plotting algebra

Another use for automatic algebra generation would be for "drawing" candidates ala SVG.

Bioconda for FlowSoFine

Create two (bio)conda packages for the R software FlowSoFine and its app.

TAKEN: QIIME2 wrapper for Dimensionality Reduction techniques (B.Sc.)

Dimensionality reduction is necessary to project data of microbial experiments into 2D or 3D to allow interactive exploration and hypothesis generation. In QIIME2, one of the leading software platforms for microbial analysis, this is currently supported via Principal Coordinate Analysis (PCoA) with subsequent visualization via the Emperor tool. As with every projection, PCoA cannot preserve all pairwise distances which can generate visual artifacts that mislead the analyst. T-distributed Stochastic Neighbor Embedding (t-SNE) is an alternative to PCoA that focuses on local structures. A similar technique is Uniform Manifold Approximation and Projection (UMAP).
Within this project, existing QIIME2 plugins shall be extended such that users can not only use PCoA but also t-SNE and UMAP to compute projections that can be explored via Emperor. Python implementations and conda installable packages for t-SNE and UMAP exist but need to be properly wrapped into the Python3 based QIIME2 platform.


Plasmid hunter

We found an antibiotic resistence gene, encoded on a plasmid in multiple clinical samples. The plasmid swapped bacterial hosts and might spread further. In order to assess necessary counter measures, we want to know if the very same plasmid was sequenced somewhere on the world before - probably as bycatch of metagenomic projects.

We need to develop a system that can crawl through the huge raw data of ENA metagenomic WGS samples (currently 39521 files totalling in 46.0 TB) and test if reads map to your plasmid. If so, we can create a world map of matches based off the geographic locations of the original sample sites.


TAKEN: Bioinformatics of microbial pattern changes detected by flow cytometry (M.Sc.)

Masterarbeit Nicole Dauzenroth

Background
The microbiota is an essential part of the body for different organisms and interacts with the host in countless ways. This interaction has a strong influence of the host health status and wellbeing. Currently, deep sequencing methods such as 16S next-generation sequencing is the gold-standard to assess the bacterial microbiome. However, there are different methods available. A recently described method by Zimmermann et al. (E.J. Immunol. 2016) uses flow cytometry do discriminate bacteria based on DNA-staining and size/shape.

At the "Leibniz Research Institute for Environmental Medicine" in Düsseldorf, we used this method and developed a novel and easy approach to evaluate characteristic features and differences between given microbiota samples. In our approach we use hexagonal binning across the bivariate flow cytometry data and the resulting hexagonal gates for dissimilarity calculations.

In context of a master thesis our new developed approach should be compared to already established tools for flow cytometry data analysis (FlowEMMi, FlowDiv, ...). Furthermore, the influence of the number of hexagons for automatic and representative clustering as well as for distance matrix of the treatment groups should be analysed. Additionally, correlations analysis with sequencing data are envisaged.

Data
Flow cytometry files (*.fcs) of samples from murine microbiota. The microbiota was manipulated by different treatments (e.g. dioxin, nanoparticles, diets). Corresponding next generation sequencing data for correlation analysis.

Requirements
Knowledge of the R statistical programming language and/or Python. Creativity and persistency to solve problems with data processing. An open-mind for biological questions and communications skills for explaining the work done to researchers from different fields.

Contact
For more information or questions, please contact or (IUF Düsseldorf)


TAKEN: Mine Qiita (B.Sc.)

Qiita aggregates over 230,000 publicly available microbiome samples. The majority was created following the Earth Microbiome Project protocol, i.e. using Illumina short read sequencing to obtain V4 amplicon data. These samples from hundreds of studies cover diverse ecosystems and thus capture large portions of bacterial diversity. Many of them are not culturable and don't even have a name.

Phylogenetically placing (see "SILVA-SEPP" project) all the sequences (approx. 16,000,000) of those microbiota into the same reference tree, e.g. Greengenes, should reveal interesting trends. All available references are known to be incomplete. Hot spots in the tree, where many different sequences accumulate, would point to potentially novel clades. Grafting de-novo trees of those sequences to the reference might be a practical means to sharpen phylogenetic metrics or suggest sub-sets of original samples that might be worth being subjected to whole genome shotgun sequences to shed light on unseen organisms.

The project's aim is to develop a methodology to mine this rich dataset and highlight "interesting" patterns in the phylogenetic reference tree.

TAKEN: Influence of the genetic background on murine gut microbiome composition and
diversity over three generations (M.Sc.)

The mammal gut is a complex ecosystem harbouring approximately 1014 microorganisms with an important role in health and disease of the host. The main factors that contribute to the inter-individual variation of the intestinal microbiota are the environment, diet, age, gender and genotype.

Whether and to which extent the host genotype shapes the gut microbiome is still subject of debate. While some allocate a decisive role to the genotype, others demonstrate that the foster-mother's gut microbiota rather than the genotype are essential in the future composition of the gut microbiota. However, these hypotheses have been drowned secondarily in settings following other objectives and by different methods.

Thus, the main goal of this project is to study the drift of gut microbiome of simultaneously, with the same microbiome, naturally colonized C57BL/6J and BALB/c inbreed mice in relation to host genotype, sex and cage. For this C57BL/6J and BALB/c embryos were implanted into B6CF1 recipient foster-mothers in order to obtain parental C57BL/6J and BALB/c mice colonised naturally with the same gut microbiota of the foster-mothers. The two different mice strains were breed completely separated from each other in the standardised environmental conditions of individually ventilated cages (IVCs) over three generations. The standardisation of the environmental conditions will strengthen the role of intrinsic factors such as genotype or sex on microbiome variation. From the parental as well as from each of the three following generations, the composition of gut microbiome was recorded at 7 and 15 weeks of age by next-generation-sequencing analysis of V3-V4 regions of the 16S rRNA genes isolated from fecal and cecal samples.

This 16S analysis project requires to apply multiple alpha- and beta-diversity metrics to quantify complexity and diversity of the different microbial communities. Furthermore, differentially abundant features must be identified via discrete false discovery rates or similar statistical tools. These analyses among the two genetic backgrounds will allow us concluding whether the microbiome of the C57BL/6J and BALB/c can be tailored by the host genotypes.

In addition, recording of the cytokines levels defining the Th1/Th2 immune answer in serum and of the calprotectin concentration in the cecal content will correlate possible phenotypic outcomes with the gut microbiome. Overall, these aspects are essential for different pathologies where microbiota seem to be involved and a contribution of host genetic is assumed.


TAKEN: Selection of suitable reference genomes for downstream analysis of bacterial cohorts (M.Sc.)

The enormous success and ubiquitous application of next and third generation sequencing has led to a large number of available high-quality draft and complete microbial genomes in the public databases. Today, the NCBI RefSeq database contains ~16,000 complete bacterial genomes. Concurrently, the selection of appropriate reference genomes (RGs) is increasingly important as it has enormous implications for routine in-silico analyses, as for example in detection of single nucleotide polymorphisms, scaffolding of draft assemblies, comparative genomics, etc. To address this issue many databases, methods and tools have been published in recent years e.g. RefSeq, DNA-DNA hybridization, average nucleotide identity (ANI) as well as percentage of conserved DNA values and kmer hashing methods (Mash). Nevertheless, the sheer amount of currently available databases and potential RGs contained therein, together with the plethora of tools available, often requires manual selection of the most suitable RGs.

To tackle this issue the bioinformatics command line tool ReferenceSeeker (https://github.com/oschwengers/referenceseeker) was recently designed and implemented which combines a fast kmer profile-based lookup of candidate reference genomes (CRGs) from high quality databases with rapid computation of (mutual) highly specific ANI and conserved DNA values in a scalable and rapid implementation.

As the analysis of cohorts of microbial genomes becomes increasingly important (viral & bacterial outbreaks), this approach should be extended from now providing the m best RGs for a single query genome to the best m RGs for a cohort of n query genomes.

Therefore, new scoring metrics and ranking methods need to be tested and evaluated. In addition, a precise assessment of the impact of the RG selection on subsequent analyses (SNP detection & creation of phylogenetic trees) is an interesting question which deserves further attention.


TAKEN: QC 16S trimming

16S NGS data often have not been trimmed, or were uploaded to ENA untrimmed. Thus, when imported to Qiita, the user shall be warned about this situation. How to detect it?

Map fragments to rep set of GG and check if leading mismatches occur.


Grammar to SVG

We think of the search space of an optimization Problem in ADP in terms of a tree grammar. These can be drawn as collections of little trees / forests and is a great way to communicate design decisions to others. However, for the actual computer program, we need to translate these drawing into something the machine can handle, i.e. ASCII text. To keep program and design in sync, it would be greate extend our compiler gapc such that it can directly generate SVG graphics from the ASCII version of the tree grammars.

Taken: CLI for RNAhybrid 3.0

The program RNAhybrid predicts potential targets for miRNAs. It was written in the early days of ADP in Haskell (version 1) and was ported into C via ADPc (version 2) some 18 years ago. It is time to lift this program into Bellman's GAP with new energy parameters, temperature modifications, fixes to the underlying grammar, and many more algorithmic tricks.

Besides the core algorithm, we need a modern command line interface (CLI) - written in Python, acompanied with a rich test bed and high quality documentation.


Taken: Benchmark for RNAhybrid 3.0

The program RNAhybrid predicts potential targets for miRNAs. It was written in the early days of ADP in Haskell (version 1) and was ported into C via ADPc (version 2) some 18 years ago. It is time to lift this program into Bellman's GAP with new energy parameters, temperature modifications, fixes to the underlying grammar, and many more algorithmic tricks. To ensure that our reimplementation faithfully resembles the original version(s), we need to set up a larger test set and create a system that runs the different software versions and compares results for equality. The next step would be to test if changes to the algorithm improves prediction accurracy, which we ideally test with the same system.

Additional open thesis topics are offered by our partner lab of Prof. Dr. Alexander Goesmann for Bioinformatics & Systems Biology. You will find open topics here.