Huiting Chen, Zhaozhong Zhu, Ye Qiu, Xingyi Ge, Heping Zheng and Yousong Peng. Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms[J]. Virologica Sinica, 2022, 37(3): 437-444. doi: 10.1016/j.virs.2022.04.006
Citation: Huiting Chen, Zhaozhong Zhu, Ye Qiu, Xingyi Ge, Heping Zheng, Yousong Peng. Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms .VIROLOGICA SINICA, 2022, 37(3) : 437-444.  http://dx.doi.org/10.1016/j.virs.2022.04.006

开发机器学习方法预测冠状病毒3C样蛋白酶的切割位点

  • 通讯作者: 彭友松, pys2013@hnu.edu.cn
  • 收稿日期: 2021-11-01
    录用日期: 2022-04-02
  • 冠状病毒3C样蛋白酶属于半胱氨酸蛋白酶,它在病毒感染和免疫逃逸中发挥重要作用。然而,目前仍然缺少有效的工具用于快速确定该蛋白酶的切割位点。本研究首先系统研究了冠状病毒3C样蛋白酶对于冠状病毒多聚蛋白的切割位点多样性,发现冠状病毒alpha、beta和gamma等属中的酶切位点高度保守,而且在酶切位点附近存在很强的氨基酸偏好性。基于该发现,我们使用氨基酸指数来表征酶切位点附近的氨基酸序列,然后建立了一个随机森林模型来预测冠状病毒3C样蛋白酶的切割位点。该模型在交叉验证中的AUC值达到了0.96,表明该模型具有很好的预测性能。为了进一步评估该模型的预测能力,我们整理了一个单独的测试数据集。该数据集由90种来自多种冠状病毒宿主的蛋白组成,它们都已经被实验证实可以被冠状病毒3C样蛋白酶切割。在该测试数据集上,上述随机森林模型的AUC值为0.95,而且能够预测正确其中80%的酶切位点,表明该模型具有很好的实用性。接下来我们使用该模型预测了1352种人类蛋白可能被冠状病毒的3C样蛋白酶切割,这些蛋白富集到细胞骨架相关的功能,如微管和肌动蛋白等。最后,我们为上述随机森林模型建立了相应的在线服务器,用于帮助预测冠状病毒3C样蛋白酶的切割位点。综上,该研究为冠状病毒3C样蛋白酶的切割位点确定提供了一个快速有效的工具,同时也为揭示冠状病毒致病性的分子机制提供了参考。

Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms

  • Corresponding author: Yousong Peng, pys2013@hnu.edu.cn
  • Received Date: 01 November 2021
    Accepted Date: 02 April 2022
  • The coronavirus 3C-like (3CL) protease, a cysteine protease, plays an important role in viral infection and immune escape. However, there is still a lack of effective tools for determining the cleavage sites of the 3CL protease. This study systematically investigated the diversity of the cleavage sites of the coronavirus 3CL protease on the viral polyprotein, and found that the cleavage motif were highly conserved for viruses in the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. Strong residue preferences were observed at the neighboring positions of the cleavage sites. A random forest (RF) model was built to predict the cleavage sites of the coronavirus 3CL protease based on the representation of residues in cleavage motifs by amino acid indexes, and the model achieved an AUC of 0.96 in cross-validations. The RF model was further tested on an independent test dataset which were composed of cleavage sites on 99 proteins from multiple coronavirus hosts. It achieved an AUC of 0.95 and predicted correctly 80% of the cleavage sites. Then, 1,352 human proteins were predicted to be cleaved by the 3CL protease by the RF model. These proteins were enriched in several GO terms related to the cytoskeleton, such as the microtubule, actin and tubulin. Finally, a webserver named 3CLP was built to predict the cleavage sites of the coronavirus 3CL protease based on the RF model. Overall, the study provides an effective tool for identifying cleavage sites of the 3CL protease and provides insights into the molecular mechanism underlying the pathogenicity of coronaviruses.

  • 加载中
    1. Acharya, A., Kevadiya, B.D., Gendelman, H.E., Byrareddy, S.N., 2020. SARS-CoV-2 infection leads to neurological dysfunction. J. Neuroimmune Pharmacol. 15, 167-173.

    2. Anand, K., Ziebuhr, J., Wadhwani, P., Mesters, J.R., Hilgenfeld, R., 2003. Coronavirus main proteinase (3CLpro) structure:basis for design of anti-SARS drugs. Science 300, 1763-1767.

    3. Arabi, Y.M., Harthi, A., Hussein, J., Bouchama, A., Johani, S., Hajeer, A.H., Saeed, B.T., Wahbi, A., Saedy, A., Aldabbagh, T., Okaili, R., Sadat, M., Balkhy, H., 2015. Severe neurologic syndrome associated with Middle East respiratory syndrome corona virus(MERS-CoV). Infection 43, 495-501.

    4. Arya, R., Kumari, S., Pandey, B., Mistry, H., Bihani, S.C., Das, A., Prashar, V., Gupta, G.D., Panicker, L., Kumar, M., 2021. Structural insights into SARS-CoV-2 proteins. J. Mol.Biol. 433, 166725.

    5. Chafekar, A., Fielding, B.C., 2018. MERS-CoV:understanding the latest human coronavirus threat. Viruses 10, 93.

    6. Chen, B., Tian, E.K., He, B., Tian, L., Han, R., Wang, S., Xiang, Q., Zhang, S., El Arnaout, T., Cheng, W., 2020. Overview of lethal human coronaviruses. Signal Transduct. Targeted Ther. 5, 89.

    7. Chen, S., Tian, J., Li, Z., Kang, H., Zhang, J., Huang, J., Yin, H., Hu, X., Qu, L., 2019. Feline infectious peritonitis virus Nsp5 inhibits type I interferon production by cleaving NEMO at multiple sites. Viruses 12, 43.

    8. Chuck, C.P., Chong, L.T., Chen, C., Chow, H.F., Wan, D.C.C., Wong, K.B., 2010. Profiling of substrate specificity of SARS-CoV 3CL. PLoS One 5, e13197.

    9. Chuck, C.P., Chow, H.F., Wan, D.C.C., Wong, K.B., 2011. Profiling of substrate specificities of 3C-like proteases from group 1, 2a, 2b, and 3 coronaviruses. PLoS One 6, e27228.

    10. Cohen, M.E., Eichel, R., Steiner-Birmanns, B., Janah, A., Ioshpa, M., Bar-Shalom, R., Paul, J.J., Gaber, H., Skrahina, V., Bornstein, N.M., Yahalom, G., 2020. A case of probable Parkinson's disease after SARS-CoV-2 infection. Lancet Neurol. 19, 804-805.

    11. Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E., 2004. WebLogo:a sequence logo generator. Genome Res. 14, 1188-1190.

    12. Cui, J., Li, F., Shi, Z.L., 2019. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 17, 181-192.

    13. Dewanjee, S., Vallamkondu, J., Kalra, R.S., Puvvada, N., Kandimalla, R., Reddy, P.H., 2021. Emerging COVID-19 neurological manifestations:present outlook and potential neurological challenges in COVID-19 pandemic. Mol. Neurobiol. 58, 4694-4715.

    14. El Boujnouni, H., Rahouti, M., El Boujnouni, M., 2021. Identification of SARS-CoV-2 origin:using Ngrams, principal component analysis and Random Forest algorithm. Inform. Med. Unlocked 24, 100577.

    15. Fang, S., Shen, H., Wang, J., Tay, F.P.L., Liu, D.X., 2010. Functional and genetic studies of the substrate specificity of coronavirus infectious bronchitis virus 3C-like proteinase. J. Virol. 84, 7325-7336.

    16. Fearon, C., Fasano, A., 2021. Parkinson's disease and the COVID-19 pandemic. J. Parkinsons Dis. 11, 431-444.

    17. Fu, L., Ye, F., Feng, Y., Yu, F., Wang, Q., Wu, Y., Zhao, C., Sun, H., Huang, B., Niu, P., Song, H., Shi, Y., Li, X., Tan, W., Qi, J., Gao, G.F., 2020. Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease. Nat. Commun. 11, 4417.

    18. Gralinski, L.E., Bankhead, A., Jeng, S., Menachery, V.D., Proll, S., Belisle, S.E., Matzke, M., Webb-Robertson, B.J.M., Luna, M.L., Shukla, A.K., Ferris, M.T., Bolles, M., Chang, J., Aicher, L., Waters, K.M., Smith, R.D., Metz, T.O., Law, G.L., Katze, M.G., Mcweeney, S., Baric, R.S., 2013. Mechanisms of severe acute respiratory syndrome coronavirus-induced acute lung injury. mBio 4, e00271, 13.

    19. Grau, J., Grosse, I., Keilwagen, J., 2015. PRROC:computing and visualizing precisionrecall and receiver operating characteristic curves in R. Bioinformatics 31, 2595-2597.

    20. Gupta, P., Mohanty, D., 2021. SMMPPI:a machine learning-based approach for prediction of modulators of protein-protein interactions and its application for identification of novel inhibitors for RBD:hACE2 interactions in SARS-CoV-2. Briefings Bioinf. 22 bbab111.

    21. Hartenian, E., Nandakumar, D., Lari, A., Ly, M., Tucker, J.M., Glaunsinger, B.A., 2020. The molecular virology of coronaviruses. J. Biol. Chem. 295, 12910-12934.

    22. Hu, B., Guo, H., Zhou, P., Shi, Z.L., 2021. Characteristics of SARS-CoV-2 and COVID-19. Nat. Rev. Microbiol. 19, 141-154.

    23. Katoh, K., Standley, D.M., 2013. MAFFT multiple sequence alignment software version 7:improvements in performance and usability. Mol. Biol. Evol. 30, 772-780.

    24. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M., 2008. AAindex:amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202-D205.

    25. Kiemer, L., Lund, O., Brunak, S., Blom, N., 2004. Coronavirus 3CLpro proteinase cleavage sites:possible relevance to SARS virus pathology. BMC Bioinf. 5, 72.

    26. Kim, J.E., Heo, J.H., Kim, H.O., Song, S.H., Park, S.S., Park, T.H., Ahn, J.Y., Kim, M.K., Choi, J.P., 2017. Neurological complications during treatment of Middle East respiratory syndrome. J. Clin. Neurol. 13, 227-233.

    27. Klemm, T., Ebert, G., Calleja, D.J., Allison, C.C., Richardson, L.W., Bernardini, J.P., Lu, B.G., Kuchel, N.W., Grohmann, C., Shibata, Y., Gan, Z.Y., Cooney, J.P., Doerflinger, M., Au, A.E., Blackmore, T.R., Van Der Heden Van Noort, G.J., Geurink, P.P., Ovaa, H., Newman, J., Riboldi-Tunnicliffe, A., Czabotar, P.E., Mitchell, J.P., Feltham, R., Lechtenberg, B.C., Lowes, K.N., Dewson, G., Pellegrini, M., Lessene, G., Komander, D., 2020. Mechanism and inhibition of the papain-like protease, PLpro, of SARS-CoV-2. EMBO J. 39, e106275.

    28. Kounakis, K., Tavernarakis, N., 2019. The cytoskeleton as a modulator of aging and neurodegeneration. Adv. Exp. Med. Biol. 1178, 227-245.

    29. Larsen, C.N., Sun, G., Li, X., Zaremba, S., Zhao, H., He, S., Zhou, L., Kumar, S., Desborough, V., Klem, E.B., 2020. Mat_peptide:comprehensive annotation of mature peptides from polyproteins in five virus families. Bioinformatics 36, 1627-1628.

    30. Lau, K.K., Yu, W.-C., Chu, C.M., Lau, S.T., Sheng, B., Yuen, K.Y., 2004. Possible central nervous system infection by SARS coronavirus. Emerg. Infect. Dis. 10, 342-344.

    31. Lu, C., Zhang, Z., Cai, Z., Zhu, Z., Qiu, Y., Wu, A., Jiang, T., Zheng, H., Peng, Y., 2021. Prokaryotic virus host predictor:a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol. 19, 5.

    32. Moustaqil, M., Ollivier, E., Chiu, H.P., Van Tol, S., Rudolffi-Soto, P., Stevens, C., Bhumkar, A., Hunter, D.J.B., Freiberg, A.N., Jacques, D., Lee, B., Sierecki, E., Gambin, Y., 2021. SARS-CoV-2 proteases PLpro and 3CLpro cleave IRF3 and critical modulators of inflammatory pathways (NLRP12 and TAB1):implications for disease presentation across species. Emerg. Microb. Infect. 10, 178-195.

    33. Oberstadt, M., Claßen, J., Arendt, T., Holzer, M., 2018. TDP-43 and cytoskeletal proteins in ALS. Mol. Neurobiol. 55, 3143-3151.

    34. Pablos, I., Machado, Y., De Jesus, H.C.R., Mohamud, Y., Kappelhoff, R., Lindskog, C., Vlok, M., Bell, P.A., Butler, G.S., Grin, P.M., Cao, Q.T., Nguyen, J.P., Solis, N., Abbina, S., Rut, W., Vederas, J.C., Szekely, L., Szakos, A., Drag, M., Kizhakkedathu, J.N., Mossman, K., Hirota, J.A., Jan, E., Luo, H., Banerjee, A., Overall, C.M., 2021. Mechanistic insights into COVID-19 by global analysis of the SARS-CoV-2 3CL substrate degradome. Cell Rep. 37, 109892.

    35. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2011. Scikit-learn:machine learning in Python. J. Mach. Learn. Res. 12, 2825-2830.

    36. Qiang, X.L., Xu, P., Fang, G., Liu, W.B., Kou, Z., 2020. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect Dis Poverty 9, 33.

    37. Rosado, J., Pelleau, S., Cockram, C., Merkling, S.H., Nekkab, N., Demeret, C., Meola, A., Kerneis, S., Terrier, B., Fafi-Kremer, S., De Seze, J., Bruel, T., Dejardin, F., Petres, S., Longley, R., Fontanet, A., Backovic, M., Mueller, I., White, M.T., 2021. Multiplex assays for the identification of serological signatures of SARS-CoV-2 infection:an antibody-based diagnostic and machine learning study. Lancet Microbe 2, e60-e69.

    38. Schechter, I., Berger, A., 1967. On the size of the active site in proteases. I. Papain. Biochem Biophys Res Commun 27, 157-162.

    39. Shang, J., Han, N., Chen, Z., Peng, Y., Li, L., Zhou, H., Ji, C., Meng, J., Jiang, T., Wu, A., 2021. Compositional diversity and evolutionary pattern of coronavirus accessory proteins. Briefings Bioinf. 22, 1267-1278.

    40. Singh, O., Su, E.C.Y., 2016. Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinf. 17, 478.

    41. Snijder, E.J., Decroly, E., Ziebuhr, J., 2016. The nonstructural proteins directing coronavirus RNA synthesis and processing. Adv. Virus Res. 96, 59-126.

    42. Stanley, J.T., Gilchrist, A.R., Stabell, A.C., Allen, M.A., Sawyer, S.L., Dowell, R.D., 2020. Two-stage ML classifier for identifying host protein targets of the dengue protease. Pac Symp Biocomput 25, 487-498.

    43. Taquet, M., Geddes, J.R., Husain, M., Luciano, S., Harrison, P.J., 2021. 6-month neurological and psychiatric outcomes in 236 379 survivors of COVID-19:a retrospective cohort study using electronic health records. Lancet Psychiatr. 8, 416-427.

    44. Tsai, L.K., Hsieh, S.T., Chao, C.C., Chen, Y.C., Lin, Y.H., Chang, S.C., Chang, Y.C., 2004. Neuromuscular disorders in severe acute respiratory syndrome. Arch. Neurol. 61, 1669-1673.

    45. Vuong, W., Khan, M.B., Fischer, C., Arutyunova, E., Lamer, T., Shields, J., Saffran, H.A., Mckay, R.T., Van Belkum, M.J., Joyce, M.A., Young, H.S., Tyrrell, D.L., Vederas, J.C., Lemieux, M.J., 2020. Feline coronavirus drug inhibits the main protease of SARSCoV-2 and blocks virus replication. Nat. Commun. 11, 4282.

    46. Wang, D., Fang, L., Shi, Y., Zhang, H., Gao, L., Peng, G., Chen, H., Li, K., Xiao, S., 2016. Porcine epidemic diarrhea virus 3C-like protease regulates its interferon antagonism by cleaving NEMO. J. Virol. 90, 2090-2101.

    47. WHO, 2022. WHO coronavirus (COVID-19) overview. https://covid19.who.int/.(Accessed 25 March 2022).

    48. Xu, J., Zhong, S., Liu, J., Li, L., Li, Y., Wu, X., Li, Z., Deng, P., Zhang, J., Zhong, N., Ding, Y., Jiang, Y., 2005. Detection of severe acute respiratory syndrome coronavirus in the brain:potential role of the chemokine mig in pathogenesis. Clin. Infect. Dis. 41, 1089-1096.

    49. Yu, G., Wang, L.G., Han, Y., He, Q.Y., 2012. clusterProfiler:an R package for comparing biological themes among gene clusters. OMICS 16, 284-287.

    50. Zhu, X., Chen, J., Tian, L., Zhou, Y., Xu, S., Long, S., Wang, D., Fang, L., Xiao, S., 2020. Porcine deltacoronavirus nsp5 cleaves DCP1A to decrease its antiviral activity. J. Virol. 94, e02162, 19.

    51. Zhu, X., Fang, L., Wang, D., Yang, Y., Chen, J., Ye, X., Foda, M.F., Xiao, S., 2017a. Porcine deltacoronavirus nsp5 inhibits interferon-β production through the cleavage of NEMO. Virology 502, 33-38.

    52. Zhu, X., Wang, D., Zhou, J., Pan, T., Chen, J., Yang, Y., Lv, M., Ye, X., Peng, G., Fang, L., Xiao, S., 2017b. Porcine deltacoronavirus nsp5 antagonizes type I interferon signaling by cleaving STAT2. J. Virol. 91, e00003-17.

  • 加载中
  • 10.1016j.virs.2022.04.006-ESM.docx

Article Metrics

Article views(3378) PDF downloads(16) Cited by(0)

Related
Proportional views
    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Prediction of coronavirus 3C-like protease cleavage sites using machine-learning algorithms

      Corresponding author: Yousong Peng, pys2013@hnu.edu.cn
    • Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha, 410082, China

    Abstract: The coronavirus 3C-like (3CL) protease, a cysteine protease, plays an important role in viral infection and immune escape. However, there is still a lack of effective tools for determining the cleavage sites of the 3CL protease. This study systematically investigated the diversity of the cleavage sites of the coronavirus 3CL protease on the viral polyprotein, and found that the cleavage motif were highly conserved for viruses in the genera of Alphacoronavirus, Betacoronavirus and Gammacoronavirus. Strong residue preferences were observed at the neighboring positions of the cleavage sites. A random forest (RF) model was built to predict the cleavage sites of the coronavirus 3CL protease based on the representation of residues in cleavage motifs by amino acid indexes, and the model achieved an AUC of 0.96 in cross-validations. The RF model was further tested on an independent test dataset which were composed of cleavage sites on 99 proteins from multiple coronavirus hosts. It achieved an AUC of 0.95 and predicted correctly 80% of the cleavage sites. Then, 1,352 human proteins were predicted to be cleaved by the 3CL protease by the RF model. These proteins were enriched in several GO terms related to the cytoskeleton, such as the microtubule, actin and tubulin. Finally, a webserver named 3CLP was built to predict the cleavage sites of the coronavirus 3CL protease based on the RF model. Overall, the study provides an effective tool for identifying cleavage sites of the 3CL protease and provides insights into the molecular mechanism underlying the pathogenicity of coronaviruses.

    Reference (52) Relative (20)

    目录

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return