Biological Information Extraction from Large Language Models (LLMs)
This is the official code of the papers:
- Automated Extraction of Molecular Interactions and Pathway Knowledge using Large Language Model, Galactica: Opportunities and Challenges
- Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge
The code was implemented on Python version 3.8, and the versions of the dependencies are listed in requirements.txt
- STRING DB: the human (Homo sapiens) protein network for performing a protein-protein interaction (PPI) recognition task.
- KEGG DB: the KEGG human pathways which have been identified as being activated in response to low-dose radiation exposure in a recent study.
- INDRA DB: a set of human gene regulatory relation statements that represent mechanistic interactions between biological agents.
To reproduce the results of the experiments, use the bash script run.sh. You need to change model/data paths accordingly.
Here are the results of the experiments. The experiments were conducted on 8×NVIDIA V100 GPUs. Note different number of GPUs and batch size can produce slightly different results.
- STRING Task1 - Precision for the generated binding proteins for 1K protein samples.
- STRING Task2 - Micro F-scores for randomly selected positive and negative pairs (I.e., 1K = 500 pos + 500 neg).
- Model prediction consistency between Task1 and Task2.
Model | STRING Task1 | STRING Task2 | Consistency |
---|---|---|---|
Galactica (6.7B) | 0.166 | 0.552 | 0.726 |
LLaMA (7B) | 0.043 | 0.484 | 0.984 |
Alpaca (7B) | 0.052 | 0.521 | 0.784 |
RST (11B) | 0.146 | 0.529 | 1.000 |
BioGPT-Large (1.5B) | 0.100 | 0.504 | 0.814 |
BioMedLM (2.7B) | 0.069 | 0.643 | 0.861 |
- KEGG Task1 - Precision for the generated genes that belong to the top 20 pathways relevant to low-dose radiation exposure.
- KEGG Task2 - Micro F-scores for randomly selected positive and negative pairs (I.e., 1K = 500 pos + 500 neg).
- Model prediction consistency between Task1 and Task2.
Model | KEGG Task1 | KEGG Task2 | Consistency |
---|---|---|---|
Galactica (6.7B) | 0.256 | 0.564 | 0.917 |
LLaMA (7B) | 0.180 | 0.562 | 0.881 |
Alpaca (7B) | 0.268 | 0.522 | 1.0 |
RST (11B) | 0.255 | 0.514 | 0.0 |
BioGPT-Large (1.5B) | 0.550 | 0.497 | 0.923 |
BioMedLM (2.7B) | 0.514 | 0.568 | 0.821 |
- INDRA Task - Micro F-scores with 1K samples for each class.
Model | 2 class | 3 class | 4 class | 5 class | 6 class |
---|---|---|---|---|---|
Galactica (6.7B) | 0.704 | 0.605 | 0.567 | 0.585 | 0.597 |
LLaMA (7B) | 0.351 | 0.293 | 0.254 | 0.219 | 0.212 |
Alpaca (7B) | 0.736 | 0.645 | 0.556 | 0.636 | 0.535 |
RST (11B) | 0.640 | 0.718 | 0.597 | 0.667 | 0.614 |
BioGPT-Large (1.5B) | 0.474 | 0.390 | 0.293 | 0.328 | 0.288 |
BioMedLM (2.7B) | 0.542 | 0.408 | 0.307 | 0.230 | 0.195 |
@inproceedings{park2023automated,
title={Automated Extraction of Molecular Interactions and Pathway Knowledge using Large Language Model, Galactica: Opportunities and Challenges},
author={Park, Gilchan and Yoon, Byung-Jun and Luo, Xihaier and Lpez-Marrero, Vanessa and Johnstone, Patrick and Yoo, Shinjae and Alexander, Francis},
booktitle={The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks},
pages={255--264},
year={2023}
}
@inproceedings{Park2023ComparativePE,
title={Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge},
author={Gilchan Park and Byung-Jun Yoon and Xihaier Luo and Vanessa L'opez-Marrero and Patrick Johnstone and Shinjae Yoo and Francis J. Alexander},
year={2023}
}