Our previous studies have shown that human virus protein receptors have unique features including high N-glycosylation level, high number of interaction partners in the human PPI network, and high expression level in 32 common human tissues (Zhang et al. 2019). To identify the potential receptors of the human-infecting virome, firstly, a RF model was built to distinguish the human virus receptor proteins from other human membrane proteins based on the above features. The RF model built based on individual protein feature achieved an AUC ranging from 0.51 to 0.61 in five-fold cross-validations (Table 1). The combination of all three features greatly improved the RF model with the AUC and the prediction accuracy equaling to 0.70 and 0.72, respectively (Table 1).
Model with different sets of features Feature number Acc Sen Spe AUC N-gly 1 0.59 0.58 0.59 0.59 PPI 1 0.62 0.60 0.62 0.61 Expression 1 0.50 0.51 0.50 0.51 N-gly + PPI + Expression 3 0.72 0.68 0.72 0.70 AAC (top 10) 10 0.70 0.73 0.70 0.71 N-gly + PPI + Expression + AAC (top10) 13 0.76 0.75 0.76 0.76 N-gly N-glycosylation, PPI node degree in human PPI network, Expression expressions in 32 human tissues, AAC amino acid composition, Acc accuracy, Sen sensitivity, Spe specificity, AUC area under receiver operating characteristic curve.
Table 1. The predictive performances of random-forest models using different sets of features.
For comparison, we also developed RF models to distinguish the human virus receptors from other human membrane proteins based on protein sequences. The amino acid composition (AAC) of protein sequences was firstly used as features in the modeling. The AUC of RF models increased as the number of most important features (N) of AAC used increased from 1 to 10 (Fig. 1A). Then, it began to decrease when N was greater than 10. The RF model based on top ten features of AAC had an AUC of 0.71 and a prediction accuracy of 0.70 which were similar to that of the model based on a combination of protein features mentioned above. Further studies showed that the RF model based on the frequencies of k-mers with two amino acids didn't improve much compared to the model based on AAC (Fig. 1B). Therefore, only top ten features of AAC were used in the modeling based on protein sequences to reduce the complexity of the model.
Figure 1. The AUC of the random-forest model based on top N (N = 1-20 for AAC, N = 1-400 for two-amino-acid k-mers) features of AAC (A) or two-amino-acid k-mers of protein sequences (B).
To further improve the model for predicting the receptorome of the human-infecting virome, the protein features and the top ten features of AAC of protein sequences were incorporated in the modeling. The RF model achieved an AUC of 0.76. The prediction accuracy, sensitivity and specificity of the model were 0.76, 0.75 and 0.76, respectively (Table 1). The model combining both the protein features and top ten features of AAC of protein sequences was used for further analysis.
Based on the RF model, the receptorome was predicted from human cell membrane proteins. A score ranging from 0 to 1 was assigned to each human cell membrane protein. The proteins with high scores are more likely to be virus receptors. A total of 1424 proteins with scores greater than 0.5 were considered to constitute the receptorome of the human-infecting virome. Table 2 listed top 20 human cell membrane proteins and the relevant scores (for all human cell membrane proteins, please see Supplementary Table S2).
Gene name Protein name RF score Gene name Protein name RF score ITGAV Integrin alpha-V 0.959 PTPRJ Receptor-type tyrosine-protein phosphatase eta 0.903 SCARB1 Scavenger receptor class B member 1 0.948 KDR Vascular endothelial growth factor receptor 2 0.903 NCAM1 Neural cell adhesion molecule 1 0.943 IL6ST Interleukin-6 receptor subunit beta 0.900 ITGB1 Integrin beta-1 0.940 SELP P-selectin 0.898 IGF2R Cation-independent mannose-6-phosphate receptor 0.928 HSPA8 Heat shock cognate 71 kDa protein 0.895 ITGA6 Integrin alpha-6 0.927 EGFR Epidermal growth factor receptor 0.895 HLA-DRA HLA class Ⅱ histocompatibility antigen, DR alpha chain 0.926 TNFRSF14 Tumor necrosis factor receptor superfamily member 14 0.895 ITGA3 Integrin alpha-3 0.914 IL7R Interleukin-7 receptor subunit alpha 0.892 CR2 Complement receptor type 2 0.911 KIT Mast/stem cell growth factor receptor Kit 0.891 LDLR Low-density lipoprotein receptor 0.911 SLAMF1 Signaling lymphocytic activation molecule 0.891
Table 2. Top 20 human cell membrane proteins and their scores assigned by the random-forest model.
Then, the prediction of virus-receptor interactions was investigated. In the previous study, Lasso et al. (2019) predicted 282, 528 pairs of PPIs between human and 1001 human-infecting viruses. Based on the study, 9395 pairs of PPIs between 718 viral RBPs from 693 human-infecting viruses, and 314 human cell membrane proteins were extracted for further analysis (see Supplementary Table S3). A viral RBP was predicted to interact with 1-65 human cell membrane proteins, with a median of 10. For each viral RBP, the RBP-interacting cell membrane proteins were ranked by the score provided by the RF model to select the most likely receptor (Supplementary Table S3).
To validate the accuracy of the ranking by the RF model, 25 pairs of experimentally validated interactions between viral RBPs and receptors were extracted. For each pair of viral RBP and its receptor, the rank of the real receptor among the predicted RBP-interacting proteins was obtained, and then the related rank percentage was calculated (Materials and Methods). Eight real receptors were ranked in top one by the RF model (Table 3). Besides, nearly 70% (17/25) of real receptors were ranked in top three. On average, the real receptors had a rank percentage of 0.20 among all the RBP-interacting human cell membrane proteins, suggesting that the real receptors would be ranked in the top 20% of all candidates by the RF model.
Virus name RBP Real viral receptor Num of RBP-interacting proteins Rank by LR Rank by RF score SARS-CoV S ACE2 31 –* 22 MERS-CoV S DPP4 8 – 2 Echovirus E6 VP1 CD55 13 5 2 Echovirus E11 VP1 CD55 9 4 2 Echovirus E7 VP1 CD55 7 – 3 Echovirus E13 VP1 CD55 11 4 1 Echovirus E20 VP1 CD55 12 5 1 Echovirus E29 VP1 CD55 13 6 2 Echovirus E33 VP1 CD55 13 6 1 Enterovirus C VP1 PVR 5 – 1 Hepacivirus C E1 EGFR 17 10 2 MACV GPC TFRC 2 – 1 Measles virus H NECTIN4 18 – 18 Measles virus H SLAMF1 18 2 2 Hendra virus G EFNB2 5 – 1 Nipah virus G EFNB2 5 – 1 HAdV-A L5 CXADR 25 – 16 HAdV-C L5 CXADR 5 4 5 HAdV-D L5 CXADR 28 4 15 HAdV-E L5 CXADR 33 3 24 HSV-1 US6 TNFRSF14 28 – 3 HSV-1 US6 NECTIN1 28 – 11 HSV-2 US6 NECTIN1 34 – 14 HSV-2 US6 TNFRSF14 34 23 3 HIV-1 env CD4 21 – 1 Top 1 0 8 (3)# Top 3 2 17 (9)# Top 5 8 18 (10)# Median rank percentage 0.43 0.20 (0.14)# The median rank percentage of real virus receptors among RBP-interacting human cell membrane proteins, and the number of real virus receptors among top one, three and five ranks were summarized at the bottom.
MACV machupo mammarenavirus, HAdV-A human mastadenovirus A, HAdV-C human mastadenovirus C, HAdV-E human mastadenovirus E, HAdV-D human mastadenovirus D, HSV-1 human alphaherpesvirus 1, HSV-2 human alphaherpesvirus 2, HIV-1 human immunodeficiency virus 1.
*No LR was provided in Lasso's work since there were resolved complex structures between the RBP and the receptor.
#The number in brackets referred to those when only considering 12 pairs of viral RBP-receptor interaction with LRs available from Lasso's work.
Table 3. The ranks of real virus receptors among the RBP-interacting human cell membrane proteins by likelihood ratio (LR) and random-forest (RF) score.
The LR provided in Lasso's work can also be used to rank the RBP-interacting proteins. 12 of 25 pairs of experimentally validated viral RBP-receptor interactions had LRs available from Lasso's work. For comparison, the viral RBP-interacting human cell membrane proteins were ranked by LR. No real receptor was ranked in top one, and only two real receptors were ranked in top three when ranking RBP-interacting human cell membrane proteins by using LR. On average, the median rank percentage of real receptors was 0.43 when ranking was conducted by the LR, while that was 0.14 by the RF model (Table 3).
Previous studies have shown that the ACE2 protein, the receptor of SARS-CoV-2 (Hoffmann et al. 2020; Zhou et al. 2020), shows a low expression level in the lung and the upper respiratory tract (Qi et al. 2020; Zhang et al. 2020). The results indicate that SARS-CoV-2 may have alternative receptors. We investigated the prediction of the alternative receptors for SARS-CoV-2. Lasso's study has predicted PPIs between 28 human cell membrane proteins which were members of the receptorome of human-infecting viruses, and the spike proteins of two coronaviruses, including Severe Acute Respiratory Syndrome-CoV and Middle East Respiratory Syndrome-CoV. We supposed that the SARS-CoV-2 is very likely to use these spike-interacting proteins as its alternative receptors. These spike-interacting proteins were ranked by the scores provided by the RF model (Fig. 2). The expression level of these spike-interacting proteins in 32 common human tissues were shown in Fig. 2. Most of them had higher expression level than ACE2 in the lung, such as APP, EZR, CD4 and so on.
Figure 2. The predicted alternative receptors (on the left side) of SARS-CoV-2 and their expressions in 32 human tissues (on the bottom). The predicted alternative receptors were ranked by the RF score. The expression level was measured by transcripts per million (TPM) and was colored according to the legend on the top right. The white referred to no data available. The lung was highlighted by an arrow. The ACE2 was marked by an asterisk.