This repository contains the official implementation of our paper "EGAT: Edge Aggregated Graph Attention Networks and Transfer Learning Improve Protein-Protein Interaction Site Prediction".
We implemented our method using PyTorch and Deep Graph Library (DGL). Please install these two for successfully running our code. Necessary installation instructions are available at the following links-
- Please download the pretrained model weight-file "pytorch_model.bin" from here.
- Place this weight-file in the folder "EGAT/inputs/ProtBert_model". If you use this pretrained model for your paper, please cite the paper ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
- Please download the pretrained model weight-file "egat_model_weight.dat" from here.
- Place this weight-file in the folder "EGAT/models".
To store input-features, navigate to the folder "EGAT/inputs". In this folder, follow any of the following steps:
- Store the PDB files of the isolated proteins that shall be used for prediction in the folder "pdb_files". Rename the PDB files in the format: "<an arbritary name>_<chain IDs>". Please see the example PDB files provided in this folder. Please provide the real chain IDs (as available in the PDB file) after the underscore ("_") correctly. (In the provided examples <an arbritary name> is the PDB ID of a complex in which this input protein is one of the subunits. It is not mendatory.)
- List all the protein-names in the file "protein_list.txt"
- From command line cd to "EGAT" folder (where the "run_egat.py" file is situated).
- Please run the following command:
python run_egat.py
- The command above will generate the results in the "EGAT/outputs" folder.
- The output generated by running EGAT will be stored as a pickle file in the "EGAT/outputs" folder. To open the pickle file please run the following commands in the python interpreter:
import pickle
output = pickle.load(open('EGAT/outputs/prediction_and_attention_scores.pkl', 'rb'))
- In the above commands the "output" variable is a python dictionary (with the four keys: 'pred', 'protein_info', 'edges', 'attention_scores').
-
To access the predicted numeric propensity, please run the following commands:
prediction = output['pred'] protein_index = 0 print(prediction[protein_index])
In the above commands, "protein_index" represents the index of the protein-name in the "protein_list.txt" file. (You can set it to any number, e.g: for the protein-name at index 2 (third row of the "protein_list.txt" file), set protein_index=2).
These commands will print the predicted numeric propensities of all the residues in the protein at index "0" of "protein_list.txt" file. The propensities will be printed sequentially following the order of the residues in the input PDB file of this protein. -
To access general information about the input proteins, please run the following commands:
protein_information = output['protein_info'] protein_index = 0 print(protein_information[protein_index])
These commands will print a python dictionary corresponding to the protein at index "0" of "protein_list.txt" file. This python dictionary contains the number of residues in the protein (represented with the key 'seq_length' in this dictionary).
-
To access the edges of the graphs representions of the input proteins, please run the following commands:
graph_edges = output['edges'] protein_index = 0 print(graph_edges[protein_index])
These commands will print a numpy array corresponding to the protein at index "0" of "protein_list.txt" file. Each row of this numpy array corresponds to a neighborhood, that contains the indices of the neighboring nodes (residues) of one residue (i.e. the center of the neighborhood). (please see our paper for more details). This center of the neighborhood is the row count of the matrix. The following example command will print the neighborhood (neighboring residue indices) of the residue with index 2 -
center_node = 2 print(graph_edges[protein_index][center_node])
-
To access the attention scores associated with the edges, please run the following commands:
attention_scores = output['attention_scores'] protein_index = 0 print(attention_scores[protein_index])
These commands will print a numpy array corresponding to the protein at index "0" of "protein_list.txt" file. Each row of this numpy array contains the attention scores associated with the corresponding edge. In the following example command, center of the neighborhood is the residue at position 2. This command will print the attention scores associated with the edges from its neighboring residues (nodes) to this residue-
center_node = 2 print(attention_scores[protein_index][center_node])