Fig. 1From: KEGG orthology prediction of bacterial proteins using natural language processingSchematic overview of our pipeline. In this study, we started by collecting KO and non-KO data from the KEGG GENES database to construct our classifier (left). Subsequently, we employed the classifier to mine protein sequences for the identification of potential KOs and used an embedding-based clustering module to assign a specific K number (middle). To validate our results, we performed structural alignment between the candidate KO sequences and the known sequences in the KEGG database (right)Back to article page