BioIE-LLM

Biological Information Extraction from Large Language Models (LLMs)

This is the official code of the papers:

Automated Extraction of Molecular Interactions and Pathway Knowledge using Large Language Model, Galactica: Opportunities and Challenges
Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge

Installation

The code was implemented on Python version 3.8, and the versions of the dependencies are listed in requirements.txt

Datasets

STRING DB: the human (Homo sapiens) protein network for performing a protein-protein interaction (PPI) recognition task.
KEGG DB: the KEGG human pathways which have been identified as being activated in response to low-dose radiation exposure in a recent study.
INDRA DB: a set of human gene regulatory relation statements that represent mechanistic interactions between biological agents.

Reproduction

To reproduce the results of the experiments, use the bash script run.sh. You need to change model/data paths accordingly.

Results

Here are the results of the experiments. The experiments were conducted on 8×NVIDIA V100 GPUs. Note different number of GPUs and batch size can produce slightly different results.

Recognizing Protein-Protein Interactions

STRING Task1 - Precision for the generated binding proteins for 1K protein samples.
STRING Task2 - Micro F-scores for randomly selected positive and negative pairs (I.e., 1K = 500 pos + 500 neg).
Model prediction consistency between Task1 and Task2.

Model	STRING Task1	STRING Task2	Consistency
Galactica (6.7B)	0.166	0.552	0.726
LLaMA (7B)	0.043	0.484	0.984
Alpaca (7B)	0.052	0.521	0.784
RST (11B)	0.146	0.529	1.000
BioGPT-Large (1.5B)	0.100	0.504	0.814
BioMedLM (2.7B)	0.069	0.643	0.861

KEGG Pathway Recognition

KEGG Task1 - Precision for the generated genes that belong to the top 20 pathways relevant to low-dose radiation exposure.
KEGG Task2 - Micro F-scores for randomly selected positive and negative pairs (I.e., 1K = 500 pos + 500 neg).
Model prediction consistency between Task1 and Task2.

Model	KEGG Task1	KEGG Task2	Consistency
Galactica (6.7B)	0.256	0.564	0.917
LLaMA (7B)	0.180	0.562	0.881
Alpaca (7B)	0.268	0.522	1.0
RST (11B)	0.255	0.514	0.0
BioGPT-Large (1.5B)	0.550	0.497	0.923
BioMedLM (2.7B)	0.514	0.568	0.821

Evaluating Gene Regulatory Relations

INDRA Task - Micro F-scores with 1K samples for each class.

Model	2 class	3 class	4 class	5 class	6 class
Galactica (6.7B)	0.704	0.605	0.567	0.585	0.597
LLaMA (7B)	0.351	0.293	0.254	0.219	0.212
Alpaca (7B)	0.736	0.645	0.556	0.636	0.535
RST (11B)	0.640	0.718	0.597	0.667	0.614
BioGPT-Large (1.5B)	0.474	0.390	0.293	0.328	0.288
BioMedLM (2.7B)	0.542	0.408	0.307	0.230	0.195

Citation

@inproceedings{park2023automated,
  title={Automated Extraction of Molecular Interactions and Pathway Knowledge using Large Language Model, Galactica: Opportunities and Challenges},
  author={Park, Gilchan and Yoon, Byung-Jun and Luo, Xihaier and Lpez-Marrero, Vanessa and Johnstone, Patrick and Yoo, Shinjae and Alexander, Francis},
  booktitle={The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks},
  pages={255--264},
  year={2023}
}
@inproceedings{Park2023ComparativePE,
  title={Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge},
  author={Gilchan Park and Byung-Jun Yoon and Xihaier Luo and Vanessa L'opez-Marrero and Patrick Johnstone and Shinjae Yoo and Francis J. Alexander},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioIE-LLM

Installation

Datasets

Reproduction

Results

Recognizing Protein-Protein Interactions

KEGG Pathway Recognition

Evaluating Gene Regulatory Relations

Citation

About

Releases

Packages

Languages

License

boxorange/BioIE-LLM

Folders and files

Latest commit

History

Repository files navigation

BioIE-LLM

Installation

Datasets

Reproduction

Results

Recognizing Protein-Protein Interactions

KEGG Pathway Recognition

Evaluating Gene Regulatory Relations

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages