Quality Control / Clipping
ASA³P provides quality overviews of all sequenced reads before and after the actual quality clipping. During the clipping process reads unsuitable for subsequent analysis steps are filtered out. Quality of sequenced reads is measured via FastQC. A check for potential contaminations is conducted via FastQ Screen. Reads sequenced on an Illumina platform are quality clipped with Trimmomatic. Reads sequenced on a Pacific Bioscience platform are not quality clipped as this is internally performed by the HGAP 4 assembler. ONT reads are quality clipped via Filtlong
Orders and orientations of assembled contigs are somewhat arbitrary. During a scaffolding step ASA³P maps such contigs onto a set of closely related (user provided) reference genomes in order to rearrange them. Taking into account this additional information scaffolders can fix order and orientation and merge multiple contigs into scaffolds. As a modern multi-reference scaffolder ASA³P internally takes advantage of MeDuSa. Finally, raw contigs as well as oriented and linked scaffolds are mapped onto all provided reference genomes in order to compare the results of this step.
To annotate contigs and scaffolds ASA³P internally uses Prokka. For high quality annotation genus specific information is used. Therefore, ASA³P uses genus specific Blast databases comprising all RefSeq genome annotations related to a certain genus. In order to further increase annotation quality ASA³P uses a combination of smaller high quality databases such as CARD for antimicrobial resistance genes and VFDB for virulence factors.
For the taxonomic classification of bacterial isolates ASA³P uses three distinct methods:
16S sequence homology
Comparison of average nucleotide identities (ANI)
Kmer profiles are analyzed via Kraken and subsequent kmer profile hits are extracted from a custom RefSeq based database. In order to search for 16S homology the pipeline uses Infernal to extract the best scoring 16S sequence and subsequently queries it against the RDP 16S database.
Multilocus Sequence Typing (MLST)
MLST is a typing method for closely related bacterial strains within a species. Therefore, genomes are blasted against public databases containing 5 to 7 thoroughly selected loci for each typed organism. Each combination of alleles determines a unique sequence type.
ASA³P uses a proprietary implementation based on BLASTn and the public database PubMLST.
Antibiotic Resistance Detection (ABR)
There are many different molecular mechanisms for ABR posing a major bioinformatic challenge. Addressing this issue ASA³P takes advantage of the Comprehensive Antibiotic Resistance Database (CARD) and its corresponding search tool. CARD provides its own sophisticated ontology in order to classify detected ABRs. To our best knowledge it’s the only database/tool which can detect, classify and describe several different types of ABR, e.g. gene homology and mutations driven mechanisms.
Virulence Factor (VF) Detection
As VF have a major impact on whether bacterial strains are harmless or severe pathogens ASA³P detects potential VFs. Therefore, the pipeline searchs VFs against the virulence factor database (VFDB).
In order to assess an isolate genome size compared to a reference genome and subsequently enable the calling of single nucleotide variants quality clipped reads are mapped to the reference genome. For reads sequenced on Illumina, ONT and Pacific Bioscience platforms ASA³P uses Bowtie 2, Minimap2 and blasr, respectively. Finally, generated Sequence Alignment/Map (SAM) files are converted to ordered Binary Alignment/Map (BAM) files via SAMtools.
Single Nucleotide Polymorphism (SNP)
SNP analyses provide variant information on single nucleotide resolution level compared to a reference genome. Therefore, ASA³P takes advantage of the SAMtools tool suite in order to call SNPs from mapped read files. Genomic variants in the resulting Variant Call Format (VCF) file are then filtered via SnpSift. Finally, filtered variants get annotated via SnpEff in order to predict resulting effects.
Core - pan genome
Coding sequences (CDS) of the analysed genomes get clustered and assigned to gene abundance groups via Roary. These groups consist of genes present in all genomes (core), genes present at least two genomes (accessory) and genes unique to one a single genome (singletons).