Open Topics
Here, you can find open topics for internships, Master’s and Bachelor’s theses, as well as research projects within our group. If you’re interested in any of these topics, please feel free to contact us to discuss your preferences in more detail. If you have your own research topic idea, please don’t hesitate to contact us to discuss and develop your concept further. You can start a research project in our group and develop it for your thesis in a direction aligned with your interests.
Improvements for MCC (Internship)
• Keywords: Metabolic modeling, Software Development, Curation
Overview
Long-term usability is a key attribute of good software. Project and code maintenance, however, can be quite challenging, especially in academia. Recently, Mostolizadeh et al. published the Mass-Charge Curation Tool. To the best of our knowledge, it is the first tool of its kind to automate curation of mass and charge balances in genome-scale metabolic models (GEMs). In the context of GEMs, this is a major advancement because annotating and curating such models is still a manual process. To enhance the user experience and create a codebase that is easy to maintain in the long term, we aim to improve MCC.
Objective
We identified several ways to improve the MCC code base. In addition to improving the code’s general performance, we plan to enhance its structure and replace some dependencies. We hope these changes will improve the user and developer experience.
- Use uv as modern project management system
- Replace pandas with polars under the hood
- Implementation of a command line interface
- Download of databases instead of API requests
- General code improvements
Requirements
• Good knowledge of python is essential
•Basic knowledge about metabolic models
•Initial experience with libraries like polars or numpy are a plus
References:
• MCC
• Polars on GitHub
• UV on GitHub
Contact: lukas.beierle
________________________________________________________________________________
Assessment of the Functionality in the Reconstructed Genome-scale Metabolic
Models (Internship/ Thesis)
• Keywords: Python, metabolic modeling, MEMOTE, Pathways
Overview
The MEMOTE score serves as the standard for evaluating genome-scale metabolic models and is widely accepted within systems biology. However, this score exhibits certain limitations. Its primary components are stoichiometric consistency and model annotation. While the MEMOTE score reflects the degree of curation of a metabolic
model, it does not assess model functionality or the underlying metabolic processes. This project aims to introduce a set of complementary metrics designed to evaluate and compare metabolic models based on the metabolic functions they represent.
Objective
A set of complementary metrics has been defined to assess the underlying genome and metabolism represented in a metabolic model. The objective is to quantify the extent to which a model captures an organism’s metabolic potential. These metrics combine annotations from the reference genome or proteome with pathway databases. Several metrics can be computed directly from the model using established methods, including flux balance analysis and flux variance analysis. A detailed list of the metrics is available upon request. These metrics are intended to quantify the metabolic potential represented and facilitate comparison of the performance of multiple metabolic models.
Requirements
• A good knowledge of Python is mandatory.
• Initial experience with metabolic models in SBML format is a plus.
• Knowledge / experience with databases like KEGG/BRENDA/MetaCyc is also a plus.
References:
• MEMOTE on GitHub
• Cobrapy on GitHub
Contact: lukas.beierle
________________________________________________________________________________
Curation of gene annotation in metabolic models (Internship)
• Keywords: Metabolic models, Reference Genome
Overview
The annotation and curation of metabolic models require significant time and effort, particularly when annotating the model's genes with high accuracy. Recreating or reannotating most existing metabolic models is challenging because reference genomes, gene identifiers, and annotations change over time. This project aims to systematically review the current state of gene annotations in genome-scale metabolic models (GEMs) and assess the need for updates.
Objective
The project will calculate gene and annotation coverage metrics using a set of reference GEMs and their corresponding source genomes. These metrics will inform decisions on whether to update primary gene identifiers for specific GEMs. In addition to evaluating annotation quality, the project aims to incorporate missing annotations into the models using existing data. The resulting application will enable users to either populate missing annotation fields or update existing identifiers to the latest versions.
Requirements
• Good knowledge of Python is essential.
• Initial experience with metabolic models in SBML/cobrapy is a plus.
• Basic knowledge about genome file formats is also a plus.
Contact: lukas.beierle
________________________________________________________________________________
Antiviral peptide simulation against Epstein-Barr virus (Thesis)
• Keywords: Molecular dynamics simulation, Epstein-Barr virus, antiviral peptides
Overview
This project is planned as a collaboration between the research group and the group of Prof. Franz Cemič at the University of Applied Sciences (THM) in Giessen. The primary goal is to identify an antiviral peptide from literature or databases that demonstrates activity against the Epstein-Barr virus. Subsequently, the project will simulate the peptide’s attachment or binding to the viral protein surface.
Objective
The initial stage involves identifying an antiviral peptide with experimentally validated activity against Epstein-Barr virus (EBV) by searching relevant literature and specialized databases, such as AVPdb. The subsequent step is to select an appropriate simulation target, such as blocking or binding to viral surface proteins or inducing membrane lysis. The final stage consists of setting up and conducting the simulations.
Requirements
• A strong background in maths or physics is recommended.
• Knowledge of GROMACS or other MD-simulation software is a plus.
• Knowledge of the Linux command-line and Python is required.
Contact: lukas.beierle
________________________________________________________________________________
The impact of annotation quality on genome-scale metabolic reconstructions
(Internship / Thesis)
• Keywords: Metabolic modeling, Metabolic reconstruction, genome annotation
Overview
Annotated genomes are essential for constructing draft genome-scale metabolic models. This project aims to investigate how annotation quality and completeness affect the resulting draft reconstructions. To this end, we will create test cases based on wellannotated genomes used to create GEMs. These test cases will assess the impact of missing, incomplete, or incorrect annotations on reconstruction tools. The project will utilize the carveme and gapseq tools for draft reconstruction.
Objective
The initial step involves precisely defining the test cases and determining the methods for modifying genome annotations, such as manual or random alterations. The analysis may be extended to include the core and pan genomes of the selected organisms. The resulting draft reconstructions will be evaluated for functionality and completeness.
Requirements
• A good knowledge of Python is recommended.
• Any knowledge about genome annotation or metabolic models is a plus.
References:
• Carveme on GitHub
• Gapseq on GitHub
Contact: lukas.beierle
________________________________________________________________________________
Large-scale Literature and Text-mining pipeline (Internship (master) / Thesis)
• Keywords: Python, PubMed, Text-mining, Literature crawling
Overview
Text and literature mining is a flexible tool for knowledge discovery with numerous applications. Previously, a prototype workflow was developed to identify closely related publications from a set of references. The workflow utilized the titles, keywords, and abstracts of all publications available in the download section of the PubMed literature database. The application was trained on a dataset of reference publications using a support vector machine. Text, titles, and abstracts were embedded using a pretrained large language model (LLM). To enhance user experience and eliminate dependency on reference publications, the project aims to create a precomputed search index of all text documents. This central index will offer greater flexibility and can be updated with new publications as required. The next step is to identify an appropriate search framework or algorithm, such as vector databases or retrieval-augmented generation (RAG) frameworks that utilize large language models (LLMs) to query the search index.
Objective
The first step is to implement a small download mechanism that will allow us to update the search index based on PubMed database releases. Next, we will compare tools such as FAISS to identify an appropriate, user-friendly data structure for the search index. The third step is to implement the search function to identify related publications or extract specific information directly from the index. Future plans include extending the search function to support pre-selection of full-text PDFs, potentially using additional tools such as Docling.
Additionally, this pipeline will be integrated into the NBREATH-DB database to facilitate information updates and reduce manual workload.
Requirements
• A good knowledge of Python is recommended.
• Basic knowledge of LLMs, text mining, or vector databases is a plus.
References:
• FAISS on GitHub
• pubmed_parser on GitHub
• Docling on GitHub
Contact: lukas.beierle
________________________________________________________________________________
Architectures for deep learning based antimicrobial peptide generation (Internship/
Thesis)
• Keywords: neural networks, dense, convolutional, recurrent, embeddings
Overview
The demand for new antibiotic treatments or alternative medications remains unmet. In a recent work (see the preprint below), we compared different model types for their ability to generate antimicrobial peptides, a promising class of molecules for new antibiotics. In this project, we want to further compare the architectures of the models used for any clear preferences for sequence generation. Therefore, a systematic comparison of different neural network architectures for variational and Wasserstein autoencoders is planned.
Objective
We want to systematically compare several neural network architectures, which refers to the type of layers used: Dense, Convolutional, or Recurrent, for any preferences for sequence generation. In simple terms, we want to see whether there is a single architecture that performs best for a given autoencoder in sequence generation. Different sequence encodings can also be tested alongside the architectures.
Requirements
• Good knowledge of Python and the Linux command-line is essential
• Knowledge about workflow management systems like nextflow or snakemake is an advantage
• Knowledge about genome-scale metabolic models is also a plus
References:
• metaGEM on GitHub
• CarveMe on GitHub (metabolic model creation tool)
• Zorrilla, Francisco, et al. "metaGEM: reconstruction of genome scale metabolic models directly from metagenomes." Nucleic acids research 49.21 (2021): e126-e126. (It is for metagenomics)
Contact: lukas.beierle
________________________________________________________________________________
pymCADRE (Internship)
• Keywords: mCADRE, algorithms, optimization, Python packaging
Overview
In certain cases, analyzing the metabolism of a particular tissue or cell requires developing a context-specific metabolic model. This process is frequently referred to as ontextualization. In previous studies, members of our research group contributed to the development of pymCADRE, a contextualization algorithm for metabolic models. pymCADRE is essentially the Python implementation of the original mCADRE algorithm, which was written in MATLAB. In certain studies, mCADRE-based algorithms have demonstrated superior accuracy in contextualizing mammalian cells compared to analogous algorithms. The objective of this project is to modernize and refine pymCADRE's implementation, with the overarching goal of enhancing its performance, usability, and long-term maintainability.
Objective
A review of the current implementation of pymCADRE is necessary to assess its quality and performance. In addition to dependency updates, it is imperative to consider potential additions that could enhance overall runtime and computational performance. To enhance maintainability, it is necessary to integrate a contemporary Python packaging toolchain, such as UV, into the project. Furthermore, the implementation of supplementary evaluations with Pytest is currently under consideration, along with the creation of a comprehensive user and developer manual.
Requirements
• A very good knowledge of Python is mandatory.
• Initial experience with the following libraries is also an advantage: cobrapy and numpy (optional: numba and polars).
• A good knowledge of algorithms and mathematics is also required.
• Knowledge about Python packaging is also recommended.
• Initial knowledge with contextualization or metabolic models is a plus.
References:
• mCADRE publication
• pymCADRE publication
• pymCADRE GitHub
• UV tool on GitHub
Contact: lukas.beierle
________________________________________________________________________________
Metabolic tasks for model validation (Internship)
• Keywords: metabolic models, metabolic reactions, simulation
Overview
Metabolic models constitute a versatile framework for in silico studies of metabolic processes. Recent studies have demonstrated that incorporating specific metabolic tasks during model development and validation can yield beneficial outcomes. In general, these tasks refer to essential reactions that most cells and tissues must perform. Richelle et al. published a list of 210 of these tasks.
Objective
The objective of this project is to implement a Python script that evaluates a cobrapy model against the 210 metabolic tasks. It is imperative that the script be free of any external dependencies, except for cobrapy. The model's evaluation across all tasks should be summarized or visualized at the end.
Requirements
• Good knowledge of Python is essential.
• First experience with SBML models or cobrapy is a plus.
References:
• Troppo a tool implementing these tasks
• Gopalakrishnan et al
• Richelle et al
Contact: lukas.beierle
________________________________________________________________________________
Proof of concept: Namespace translation for metabolic models (Internship)
• Keywords: Metabolic models, BiGG, VMH, MetaNetX, data mining
Overview
The primary namespace of a metabolic model is derived from identifiers specific to the designated database used to create the model at hand. This may result in complications during subsequent tasks, including model evaluation and contextualization. The utilization of specific namespaces is a prerequisite for certain systems biology applications. Conversely, some systems biology applications are incompatible with the namespaces of different databases. Presently, no application is available to facilitate the translation of a model's primary namespace from one database to another. A critical concern is the potential for entities to be omitted during translation, as it is not guaranteed that all objects from one database will have an entry in the target database.
Objective
The objective of this project is to demonstrate the feasibility of translating one model’s primary namespace to another target namespace. If entities are absent after translation, it is necessary to cross-check them against multiple databases. The present investigation uses the Ensembl Biomart tool and the MetaNetX database as proxies for the most accurate translation.
Requirements
• Good knowledge of Python is essential.
• First experience with SBML models, cobrapy, or metabolic modeling is a plus.
References:
• BiGG database content
• VMH database content
• MetaNetX database content
• Biomart database
• Mergem a tool for partially translating namespaces
Contact: lukas.beierle
________________________________________________________________________________
Large-scale draft reconstructions of microbial communities (Internship / B.Sc Thesis)
• Keywords: Workflow, reconstruction, genome-scale metabolic models
Overview
A particular concern that arises in the context of genome-scale metabolic models is the potential unavailability of the code utilized for their creation. This is notable in studies that have developed multiple models, such as a series of draft reconstructions or a community model. Ensuring reproducibility is of great importance in scientific research, as it guarantees that, in the event that a created model is later adopted by other researchers, they will be able to recreate it with the most up-to-date annotations and references. Furthermore, the utilization of a reproducible pipeline during the model creation process proves to be of considerable benefit when conducting research on microbial communities. This approach ensures that all models are constructed employing the same tools and under identical conditions, thereby enhancing the reliability of the research outcomes. Another important aspect in terms of modeling a microbial community is that the creation of high-quality models for each member can take years (in the worst case). The objective of this workflow is to initiate the process with annotated genomes, aiming to generate a draft community model. This approach is designed to guarantee that each member has at least a draft model and uses a high level of automation.
Objective
The objective of this project is to develop a small workflow for the creation of multiple genome-scale metabolic models directly based on annotated genomes. A comparative analysis of the available tools for reconstruction and annotation is necessary, followed by their integration into the workflow. Our group maintains a list of a particular microbial community that can be used as a direct test case to reconstruct a set of models based on the genomes of the community.
Requirements
• Good knowledge of Python and the Linux command-line is essential
• Knowledge about workflow management systems like nextflow or snakemake is an advantage
• Knowledge about genome-scale metabolic models is also a plus
References:
• metaGEM on GitHub
• CarveMe on GitHub (metabolic model creation tool)
• Zorrilla, Francisco, et al. "metaGEM: reconstruction of genome scale metabolic models directly from metagenomes." Nucleic acids research 49.21 (2021): e126-e126. (It is for metagenomics)
Contact: lukas.beierle
________________________________________________________________________________
Quality reporting and visualization of metabolic models (Internship / B.Sc project)
• Keywords: Workflow, visualization, genome-scale metabolic models
Overview
Metabolic models are abstract representations of the complete metabolism of a microorganism, a specific tissue, or a cell. Such a model is typically comprised of interconnected reactions, metabolites, and genes. These models tend to be very complex, especially for eukaryotic organisms. Moreover, such models are not stored in a format that is easily accessible or manually readable by humans. This factor makes their analysis a challenging task. A detailed analysis is crucial for understanding metabolic models. However, there is a paucity of tools capable of generating comprehensive and readily comprehensible reports on metabolic models. This project has as its primary focus the exploration of diverse methodologies for visualizing the interconnected components of metabolic models, with the objective of enhancing analysis and making these models more accessible and user-friendly.
Objective
The objective of this project is to produce a comprehensive and accessible report for metabolic models. This encompasses the retrieval of metadata, the analysis of annotations relating to reactions and metabolites in the model, and the visualization of individual components (compartments, reactions, genes). A further consideration is the visualization of the biomass objective function, or more precisely, the objective function of the model, inclusive of all its reactants and products. The final step in the process involves compiling all figures, along with their respective descriptions, into a single PDF file.
Requirements
• Good knowledge of Python
• Knowledge of metabolic models is a plus.
References
Several tools for the analysis of metabolic models:
• refineGEMs on GitHub
• Memote on GitHub
Sample visualization of different reactions: CORDA algorithm on GitHub
Contact: lukas.beierle
_______________________________________________________________________________
Gene expression in genome-scale metabolic models (Internship / Thesis)
• Keywords: Genome-scale metabolic models, gene expression, modeling
Overview
A substantial proportion of genome-scale metabolic models focuses on prokaryotic organisms. The transition to models for eukaryotic organisms poses a significant challenge. For instance, the modeling of eukaryotic gene expression with a GEM is possible; however, it is imperative that each component interacting with the gene, such as enzymes, cofactors, or transcription factors, be endowed with its own reactions for definition. This results in a sizable collection of sub-reaction networks for expressing individual genes.
Objective
This project aims to develop a proof-of-concept model for the preliminary design of components associated with eukaryotic gene expression reactions. In this regard, genome-scale metabolic models of plants may serve as a valuable template. In the event of a favorable outcome, our objective is to incorporate the reactions of the eukaryotic gene expression into our model of the Epstein-Barr virus.
Requirements
• Basic knowledge of Python and the Linux command-line
• Knowledge about genome-scale metabolic models is an advantage
• No fear of literature research
References:
• Lynch, Michael, and Georgi K. Marinov. "The bioenergetic costs of a gene." Proceedings of the National Academy of Sciences 112.51 (2015): 15690-15695.
• Feist, Adam M., et al. "Reconstruction of biochemical networks in microorganisms." Nature Reviews Microbiology 7.2 (2009): 129-143.
Contact: lukas.beierle
________________________________________________________________________________
Updating / re-implementing popular systems biology software (Internship)
• Keywords: Systems biology, software development
Overview
Like other bioinformatics domains, systems biology relies on open-source software developed by different research groups or individuals. In many instances, the termination of software project funding results in the subsequent abandonment of these projects.
Objective
Most systems biology projects focused on genome-scale metabolic model reconstruction rely on one of these tools: BOFdat for creating biomass objective functions. Another one is a toll for creating nasal microbial community, NCMW. The goal is to completely rewrite one of these tools and add bugfixes and useful features that have accumulated in recent years without any development activity. The precise implementation details and objectives are to be delineated with the supervisor at the commencement of the project.
Requirements
• Good knowledge of Python is essential
• Knowledge about the Linux command line and software development is a plus
• Knowledge about the tools mentioned is a plus
References:
• BOFdat on GitHub
• Lachance, Jean-Christophe, et al. "BOFdat: Generating biomass objective functions for genome-scale metabolic models from experimental data." PLoS computational biology 15.4 (2019): e1006971.
• NCMW on GitHub