Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452(7187):629–32. https://doi.org/10.1038/nature06810.
Article
CAS
PubMed
Google Scholar
Carlson-Jones JA, Kontos A, Kennedy D, Martin J, Lushington K, McKerral J, et al. The microbial abundance dynamics of the paediatric oral cavity before and after sleep. J Oral Microbiol. 2020;12(1):1741254.
Article
CAS
Google Scholar
Bartle L, Mitchell JG, Paterson JS. Evaluating the cytometric detection and enumeration of the wine bacterium, Oenococcus oeni. Cytom Part A. 2021;99(4):399–406.
Article
CAS
Google Scholar
Wattam AR, Davis JJ, Assaf R, Boisvert S, Brettin T, Bun C, et al. Improvements to PATRIC, the all-bacterial bioinformatics database and analysis resource center. Nucleic Acids Res. 2017;45(D1):D535–42. https://doi.org/10.1093/nar/gkw1017.
Article
CAS
PubMed
Google Scholar
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018;46(D1):D851–60.
Article
CAS
Google Scholar
Oliveira C, Domingues L. Guidelines to reach high-quality purified recombinant proteins. Appl Microbiol Biotechnol. 2018;102(1):81–92. https://doi.org/10.1007/s00253-017-8623-8.
Article
CAS
PubMed
Google Scholar
Consortium GO. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32(suppl_1):258–61. https://doi.org/10.1093/nar/gkh036.
Article
CAS
Google Scholar
Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res. 2014;42(D1):D206–14.
Article
CAS
Google Scholar
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016;44(14):6614–24. https://doi.org/10.1093/nar/gkw569.
Article
CAS
PubMed
PubMed Central
Google Scholar
Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001;313(4):903–19.
Article
CAS
Google Scholar
Antczak M, Michaelis M, Wass MN. Environmental conditions shape the nature of a minimal bacterial genome. Nat Commun. 2019;10(1):3100. https://doi.org/10.1038/s41467-019-10837-2.
Article
CAS
PubMed
PubMed Central
Google Scholar
Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol. 2013;9(5):e1003063.
Article
CAS
Google Scholar
Wen J, Zhang Y, Yau SST. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol. 2014;363:145–50. https://doi.org/10.1016/j.jtbi.2014.08.028.
Article
CAS
PubMed
Google Scholar
Zhang Y, Wen J, Yau SST. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics. 2019;111(6):1298–305. https://doi.org/10.1016/j.ygeno.2018.08.010.
Article
CAS
PubMed
Google Scholar
Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE. 2015;10(11):e0141287.
Article
Google Scholar
Unsal S, Atas H, Albayrak M, Turhan K, Acar AC, Doğan T. Learning functional properties of proteins with language models. Nat Mach Intell. 2022;4(3):227–45. https://doi.org/10.1038/s42256-022-00457-9.
Article
Google Scholar
Cai Y, Wang J, Deng L. SDN2GO: an integrated deep learning model for protein function prediction. Front Bioeng Biotechnol. 2020;8:391. https://doi.org/10.3389/fbioe.2020.00391.
Article
PubMed
PubMed Central
Google Scholar
Kim S, Lee H, Kim K, Kang J. Mut2Vec: distributed representation of cancerous mutations. BMC Med Genomics. 2018;11(2):33. https://doi.org/10.1186/s12920-018-0349-7.
Article
PubMed
PubMed Central
Google Scholar
Yin R, Luo Z, Zhuang P, Lin Z, Kwoh CK. VirPreNet: a weighted ensemble convolutional neural network for the virulence prediction of influenza A virus using all eight segments. Bioinformatics. 2021;37(6):737–43. https://doi.org/10.1093/bioinformatics/btaa901.
Article
CAS
PubMed
Google Scholar
Ostrovsky-Berman M, Frankel B, Polak P, Yaari G. Immune2vec: embedding B/T cell receptor sequences in ℝN using natural language processing. Front Immunol. 2021. https://doi.org/10.3389/fimmu.2021.680687.
Article
PubMed
PubMed Central
Google Scholar
Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019;20(1):723. https://doi.org/10.1186/s12859-019-3220-8.
Article
CAS
PubMed
PubMed Central
Google Scholar
Nambiar A, Heflin M, Liu S, Maslov S, Hopkins M, Ritz A: Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics. 2020. pp. 1–8.
Wang D, Zhang Q, Yuan C-A, Qin X, Huang Z-K, Shang L. Motif discovery via convolutional networks with K-mer embedding. In: Huang D-S, Jo K-H, Huang Z-K, editors. Intelligent computing theories and application. Cham: Springer International Publishing; 2019. p. 374–82.
Chapter
Google Scholar
Le NQK, Huynh T-T. Identifying SNAREs by incorporating deep learning architecture and amino acid embedding representation. Front Physiol. 2019;10:1501. https://doi.org/10.3389/fphys.2019.01501.
Article
PubMed
PubMed Central
Google Scholar
Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol. 2020;38(9):1079–86. https://doi.org/10.1038/s41587-020-0501-8.
Article
CAS
PubMed
Google Scholar
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42(D1):D581–91.
Article
CAS
Google Scholar
Abu-Doleh AA, Al-Jarrah OM, Alkhateeb A. Protein contact map prediction using multi-stage hybrid intelligence inference systems. J Biomed Inform. 2012;45(1):173–83. https://doi.org/10.1016/j.jbi.2011.10.008.
Article
PubMed
Google Scholar
Rives A, Goyal S, Meier J, Guo D, Ott M, Zitnick CL, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv. 2019. p. 622803.
Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics. 2020. https://doi.org/10.1093/bioinformatics/btaa701.
Article
PubMed Central
Google Scholar
Boeckmann B, Bairoch A, Apweiler R, Blatter M-C, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–70. https://doi.org/10.1093/nar/gkg095.
Article
CAS
PubMed
PubMed Central
Google Scholar
Asgari E. protVec_100d_3grams.csv. Harvard Dataverse 2015. https://doi.org/10.7910/DVN/JMFHTN/CVPAUK.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Article
CAS
Google Scholar
Rehurek R, Sojka P: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. Citeseer; 2010.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72.
Article
CAS
Google Scholar
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. https://doi.org/10.1073/pnas.89.22.10915.
Article
CAS
PubMed
PubMed Central
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Google Scholar
Murphy LR, Wallqvist A, Levy RM. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 2000;13(3):149–52.
Article
CAS
Google Scholar
Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3(1):1–27. https://doi.org/10.1080/03610927408827101.
Article
Google Scholar
Galili T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics. 2015;31(22):3718–20.
Article
CAS
Google Scholar
Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
Google Scholar
Sievers F, Higgins DG. Clustal omega, accurate alignment of very large numbers of sequences. In: Multiple sequence alignment methods. Springer; 2014. p. 105–16.
Louca S, Polz MF, Mazel F, Albright MBN, Huber JA, O’Connor MI, et al. Function and functional redundancy in microbial systems. Nat Ecol Evol. 2018;2(6):936–43. https://doi.org/10.1038/s41559-018-0519-1.
Article
PubMed
Google Scholar
Lim JM, Kim G, Levine RL. Methionine in proteins: it’s not just for protein initiation anymore. Neurochem Res. 2019;44(1):247–57.
Article
CAS
Google Scholar
Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, et al. Using deep learning to annotate the protein universe. Nat Biotechnol. 2022. https://doi.org/10.1038/s41587-021-01179-w.
Article
PubMed
Google Scholar
ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M. Amino acid encoding for deep learning applications. BMC Bioinformatics. 2020;21(1):1–14.
Article
Google Scholar
Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–22. https://doi.org/10.1038/s41592-019-0598-1.
Article
CAS
PubMed
PubMed Central
Google Scholar
Chiu B, Crichton G, Korhonen A, Pyysalo S. How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th workshop on biomedical natural language processing. 2016. pp. 166–74.
Ghosh S, Chakraborty P, Cohn E, Brownstein JS, Ramakrishnan N. Characterizing diseases from unstructured text: a vocabulary driven word2vec approach. In: Proceedings of the 25th ACM international on conference on information and knowledge management. 2016. pp. 1129–38.
Öztürk H, Ozkirimli E, Özgür A. A novel methodology on distributed representations of proteins using their interacting ligands. Bioinformatics. 2018;34(13):i295–303.
Article
Google Scholar
Dusserre E, Padró M: Bigger does not mean better! We prefer specificity. In: Iwcs 2017—12th international conference on computational semantics—short papers. 2017.
Littmann M, Bordin N, Heinzinger M, Schütze K, Dallago C, Orengo C, et al. Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics. 2021;37(20):3449–55.
Article
CAS
Google Scholar
Seo S, Oh M, Park Y, Kim S. DeepFam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics. 2018;34(13):i254–62. https://doi.org/10.1093/bioinformatics/bty275.
Article
CAS
PubMed
PubMed Central
Google Scholar
Cantu VA, Salamon P, Seguritan V, Redfield J, Salamon D, Edwards RA, et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput Biol. 2020;16(11):e1007845.
Article
CAS
Google Scholar