Inhaltspezifische Aktionen

PhD project of Julian Hahnfeld: Exploring Small Proteins: Advancing Prediction with Deep Learning for Bacterial sORFs and Specialized Databases

Small proteins with fewer than 100 and, in particular, fewer than 50 amino acids are still largely unexplored. They are encoded by small open reading frames (sORFs) and represent an important part of the genetic repertoire of bacteria that often remains neglected. In recent years, the development of ribosome profiling protocols has led to an increasing number of newly detected small proteins. Despite this, they are frequently overlooked during computational gene prediction and automated genome annotation. In addition, functional descriptions often cannot be assigned to predicted small proteins due to a lack of homologs with high sequence similarity in public databases. For this reason, new approaches for the in silico prediction of bacterial sORFs and small proteins, as well as specialized small protein databases are needed.

 

For the prediction approach, deep learning techniques are promising as they have provided excellent results for many conventional biological questions and are capable of self-learning relevant features from sequence data.

Since the gene and protein features of sORFs partly differ from those of longer genes, traditional gene prediction algorithms exhibit only poor prediction performance for sORFs, due to high false positive rates. These features can be used to develop a new sORF specific prediction approach. For this purpose, the current state of known sORFs and small proteins in public databases was investigated to find suitable features and potential biases. Promising features were compared between sORFs and long ORFs, such as the relative GC content of genes, amino acid composition, transcription initiation and termination mechanisms, and physicochemical properties of proteins.

Based on these distinct features, a new deep-learning-based approach for the prediction of sORFs will be developed and analyzed in terms of feature importance and model performance.