building an extended version of primekg that include OMIM phenotypes and genes #12

abearab · 2023-08-11T09:35:55Z

Big thanks to @marinkaz for recommending a pull request to share my works here.

Updates:

Including scripts to gather updated data from the OMIM portal using their API (credit to @amirieb)
Including scripts to process and extract relevant information from downloaded OMIM datasets (downloaded on May 2023).
Uploading a detailed notebook to build an extended version of PrimeKG (this can be a good example for users who want to append other resources to PrimeKG and build an extended version).
A formatting update for the README file (I could add some notes about my work but I kept that for later).

Also:

A discussion about these updates – Are there any OMIM nodes? #9
Improvements in the context of knowledge graph resources TDC#207

extended version of primekg that include OMIM phenotypes and genes.

payalchandak · 2023-09-11T15:35:22Z

Hi @abearab thank you for submitting this PR! I'm wondering if you are reading the OMIM information from a locally downloaded json file? It would be great if you could make this re-useable for others by including the source website and any wget commands! Thanks

ayushnoori · 2023-09-11T15:36:50Z

Thanks @abearab! We will review. Agree with @payalchandak, would be great to include the source information. Also, please feel free to add notes on your work to the README as part of this PR.

abearab · 2023-09-11T17:47:54Z

Hi @payalchandak and @ayushnoori

I'm wondering if you are reading the OMIM information from a locally downloaded json file? It would be great if you could make this re-useable for others by including the source website and any wget commands!

I already added a notebook (see this link) as part of this PR that describes how I made the local json file. Please let me know if that is what you meant or you need me to provide more info.

mims-harvard#12

abearab · 2023-09-12T08:47:33Z

Also, please feel free to add notes on your work to the README as part of this PR.

@ayushnoori plz see the updated README file. Let me know what you think. Thanks!

abearab · 2023-09-30T23:10:42Z

@payalchandak @ayushnoori

To give you an update, I just committed a few more features as a module in TDC. I believe it helps to build, handle, explore, and integrate knowledge graphs based on the final PrimeKG data format (i.e. simply a pandas data frame with same column names). This might be useful for future updates of PrimeKG itself and improves building graphs from other resources in a compatible format to PrimeKG.

mims-harvard/TDC#207

from tdc.utils.knowledge_graph import KnowledgeGraph, build_KG

primekg = PrimeKG(path = './datasets/PrimeKG')

1. List node from a given source

primekg.get_nodes_by_source('NCBI').head()

id	type	name	source
9796	gene/protein	PHYHIP	NCBI
7918	gene/protein	GPANK1	NCBI
8233	gene/protein	ZRSR2	NCBI
4899	gene/protein	NRF1	NCBI
5297	gene/protein	PI4KA	NCBI

2. Extract a subgraph through running a pandas query:

subgr = primekg.copy()

subgr.run_query('(x_name == "Olaparib" | y_name == "Olaparib")')

Here, subgr.df_raw contains the dataframe before query and subgr.df contains the sub graph.

3. Build a knowledge graph from scratch

my_kg = build_KG(
    indices = REPLACE,# a list to assign row names of output data frame
    relation = REPLACE,# a list or string to assign values
    display_relation= REPLACE,# a list or string to assign values

    x_id = REPLACE,# a list or string to assign values
    x_type = REPLACE,# a list or string to assign values
    x_name = REPLACE,# a list or string to assign values
    x_source = REPLACE,# a list or string to assign values

    y_id = REPLACE,# a list or string to assign values
    y_type = REPLACE,# a list or string to assign values
    y_name = REPLACE,# a list or string to assign values
    y_source = REPLACE# a list or string to assign values
)

my_kg will be a KnowledgeGraph object in the same format as PrimeKG which can simply concatenated.

ayushnoori · 2023-10-13T03:27:37Z

@payalchandak, an update: @abearab and I were able to meet today. Abe let me know that there is no rush to merge this PR, since this PR is an extension of prior work done with Zak via SIBMI several years ago. Abe will keep developing this PR and also mims-harvard/TDC#207, and will keep us posted.

@abearab, glad we were able to touch base today. Please feel free to keep myself and @payalchandak posted as your work progresses. As discussed, when we are ready to add OMIM phenotypes and genes to the KG, we will merge this PR. Thanks!

payalchandak · 2023-10-22T06:49:21Z

Thanks a lot @abearab and @ayushnoori! Looking forward to expanding PrimeKG further, please keep me posted.

ayushnoori · 2023-12-28T20:14:54Z

Hi @abearab, we're reviewing this PR. This looks great! A few comments and requests:

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.
It would be super helpful if you could prepare a summary of the changes to PrimeKG by adding OMIM nodes: e.g., the number of OMIM phenotypes already in PrimeKG, the number of OMIM phenotypes you add, the number of new edges added (stratified by edge type), the original vs. final edge counts, etc. Your Venn diagrams in append_omim.ipynb look fantastic – it would be helpful to have them in a single document with descriptions so we can review together before merging this PR.
If data sharing policies permit, may you please also upload the OMIM source files used to construct the OMIM extension of the KG to Harvard Dataverse where we can link to them in the README? Please also feel free to create a new README file to direct users to the exact relative file path where these OMIM source files must be stored.

Thanks so much! cc: @payalchandak @marinkaz

abearab · 2023-12-29T08:45:04Z

Hi @ayushnoori, Thanks for your reply here.

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Sounds good, I'll do this and get back to you.

It would be super helpful if you could prepare a summary of the changes to PrimeKG by adding OMIM nodes: e.g., the number of OMIM phenotypes already in PrimeKG, the number of OMIM phenotypes you add, the number of new edges added (stratified by edge type), the original vs. final edge counts, etc. Your Venn diagrams in append_omim.ipynb look fantastic – it would be helpful to have them in a single document with descriptions so we can review together before merging this PR.

Sure.

If data sharing policies permit, may you please also upload the OMIM source files used to construct the OMIM extension of the KG to Harvard Dataverse where we can link to them in the README? Please also feel free to create a new README file to direct users to the exact relative file path where these OMIM source files must be stored.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stoped it to get some guidance here. Maybe @marinkaz already has that permission or she can tell us how we should proceed? I'll be happy to share the required files with you so you can upload it to the same "dataverse". Updating README file sounds good, I'll be happy to help with that, too.

I'll add more commits here and be happy to discuss more. Thanks

(1) using `kg_raw` table, (2) using TDC KG data function

abearab · 2023-12-29T11:34:55Z

@ayushnoori plz see updates in the README file and append_omim.ipynb notebook.

One note I would say here is that MONDO database I used here is most likely not same as what you are using. Thus, I'm skipping some node/edge annotations. Let me know if you have suggestion.

5489754 I made subgraphs as mentioned earlier in a same format as current version of PrimeKG. Also, this part can be updated once mims-harvard/TDC#207 is merged.

cc @payalchandak

ayushnoori · 2024-01-01T01:42:23Z

Thanks for the updates, @abearab!

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Thinking about this – it would be great to make append_omim.ipynb a script which could be called by build_graph.ipynb. That is, when constructing the KG using build_graph.ipynb, the user should have the choice to add OMIM nodes before taking the LCC, for example, by setting a variable flag use_omim = True.

Alternatively, you could modify append_omim.ipynb to start from kg_raw.csv, then copy the KG standardization steps from build_graph.ipynb into append_omim.ipynb.

Thus, I'm skipping some node/edge annotations.

I'm not sure what you mean by this. Could you please provide a few examples? We should try and minimize node duplication between OMIM and MONDO as much as possible.

Also, in reviewing the README:

relation
mim_disease                        9599
mim_gene                          16636
mim_phenotype                    574128
mim_phenotypic_series              4111
mim_phenotypic_series_disease       549
phenotype_map                      7259

Can you please explain or provide examples for phenotypic_series, phenotypic_series_disease, and phenotype_map edge types? I look at the OMIM documentation but this wasn't clear to me.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stoped it to get some guidance here.

I'll touch base with @marinkaz about sharing the OMIM-augmented KG publicly. In the interim, if you could privately share with us the OMIM-augmented KG files (e.g., via email), that would be great! We can then perform some standard checks on our end.

ayushnoori · 2024-01-01T02:21:58Z

Hi Abe, from Marinka:

We can upload the files to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM using a consistent naming scheme, such as OMIM-augmented-edges.csv, etc. The files should contain KG only, no derivative OMIM database, and absolutely no verbatim copies of any OMIM file. Abe should provide a notebook on how to parse the data and build a knowledge graph, but we cannot host the raw dataset in the repo or data verse.

If you send us (Payal, Marinka, and myself) the relevant files, we will upload them.

ayushnoori · 2024-01-01T02:33:04Z

README.md

Updated to fix typos and add clarity. LGTM!

ayushnoori

Few minor comments, but LGTM. Let's merge for now and we can work with @abearab to address and improve in later versions.

ayushnoori · 2024-01-01T02:34:20Z

datasets/omim/omim-api.ipynb

We might want to move this file to datasets/processing_scripts/omim/omim-api.ipynb for consistency.

Also, can we make this into a Python script so we can add to primary_data_resources.sh?

Surely it can be a Python script and be called in your .sh file. You would need to request an API Key from OMIM. I'm not sure how it can be relevant, but that can be added to this repository for automated actions https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions

ayushnoori · 2024-01-01T02:34:47Z

datasets/processing_scripts/omim_tools.py

See comment above about moving to /omim for consistency.

I assumed this could be a place for keeping and adding more functions as "processing scripts" for OMIM.

The notebook in /datasets/omim/ is just been added for your record to show how I gathered the required datasets and data files will stay in that folder. But feel free to reorganize as you wish so 👍

ayushnoori · 2024-01-01T02:38:57Z

knowledge_graph/append_omim.ipynb

This looks really great! Thanks for the detailed documentation, @abearab.

ayushnoori · 2024-01-01T02:44:45Z

@abearab, thanks for this great update to PrimeKG. Reviewed and merged.

Would be great if we could work together to address the minor points raised above (#12 (comment)) in a future update. For now, could you please send us the CSV files generated by your script (e.g., standalone OMIM edges and OMIM-augmented PrimeKG)? Thanks!

abearab · 2024-01-02T09:45:14Z

Hi @ayushnoori

Thanks for merging this!! plz see my responses below.

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Thinking about this – it would be great to make append_omim.ipynb a script which could be called by build_graph.ipynb. That is, when constructing the KG using build_graph.ipynb, the user should have the choice to add OMIM nodes before taking the LCC, for example, by setting a variable flag use_omim = True.

Alternatively, you could modify append_omim.ipynb to start from kg_raw.csv, then copy the KG standardization steps from build_graph.ipynb into append_omim.ipynb.

Thus, I'm skipping some node/edge annotations.

I'm not sure what you mean by this. Could you please provide a few examples? We should try and minimize node duplication between OMIM and MONDO as much as possible.

To build links with MONDO I downloaded a file from MONDO website and I noticed some inconsistencies, also I didn't know how best to label nodes from MONDO. You may need to rerun append_omim notebook with your own local MONDO files for consistency.

Also, to my understanding, MONDO is not covering OMIM "gene" entries, and that was the main motivation for me to start this integration. I'm not sure what is the best way to avoid redundant MONDO and OMIM nodes in the final KG, I'm not very familiar with MONDO.

Also, in reviewing the README:
relation
mim_disease                        9599
mim_gene                          16636
mim_phenotype                    574128
mim_phenotypic_series              4111
mim_phenotypic_series_disease       549
phenotype_map                      7259
Can you please explain or provide examples for phenotypic_series, phenotypic_series_disease, and phenotype_map edge types? I look at the OMIM documentation but this wasn't clear to me.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stopped it to get some guidance here.

Phenotypic Series (PS) are OMIM efforts for charting phenotype sets, i.e. multiple OMIM phenotype pages will be grouped in one PS. For instance, this is something being used in UDN for patient phenotyping and diagnosis task purposes.

I'll touch base with @marinkaz about sharing the OMIM-augmented KG publicly. In the interim, if you could privately share with us the OMIM-augmented KG files (e.g., via email), that would be great! We can then perform some standard checks on our end.

Hi Abe, from Marinka:

We can upload the files to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM using a consistent naming scheme, such as OMIM-augmented-edges.csv, etc. The files should contain KG only, no derivative OMIM database, and absolutely no verbatim copies of any OMIM file. Abe should provide a notebook on how to parse the data and build a knowledge graph, but we cannot host the raw dataset in the repo or data verse.

If you send us (Payal, Marinka, and myself) the relevant files, we will upload them.

The omim-api.ipynb file in this PR is exactly what Marinka is asking for regarding "data gathering" and basic data cleaning related to our task here. If you get an OMIM API key you can run it yourself and get the most updated data from OMIM. The append_omim.ipynb is then how I extracted data from the resulting files downloaded from OMIM.

I think this part is better to be revised with your local database, for instance, I have some assumptions about MONDO which can be naive or wrong. Here I'm skipping some information for y nodes that needs revision:

Anyway, I'll be happy to share any files if that's helpful. Just let me know if you need that from me.

Best

abearab added 5 commits August 9, 2023 01:00

add omim api notebook

2c11839

add script

488434d

add PrimeKG+

0217270

extended version of primekg that include OMIM phenotypes and genes.

move details

6eab68d

reformat

3a7408f

payalchandak requested a review from ayushnoori September 11, 2023 15:31

describe OMIM

7807141

mims-harvard#12

abearab mentioned this pull request Sep 30, 2023

Improvements in the context of knowledge graph resources mims-harvard/TDC#207

Merged

abearab mentioned this pull request Oct 26, 2023

Suggesting a new data function: Knowledge Graph Mastery mims-harvard/TDC#211

Closed

abearab mentioned this pull request Dec 16, 2023

ML prediction for DAC combination therapies GilbertLabUCSF/Decitabine-treatment#5

Closed

2 tasks

Merge branch 'mims-harvard:main' into main

1efca0f

abearab added 3 commits December 29, 2023 02:35

run append_omim.ipynb

5489754

(1) using `kg_raw` table, (2) using TDC KG data function

Add details to README file

29e206e

minor updates

18b04d8

abearab added 3 commits December 29, 2023 14:46

move "updates" section

9ab4791

add Table of Contents

cd88f61

add a missing link

d2ba6ce

Update description of OMIM data coverage

978cc0c

ayushnoori reviewed Jan 1, 2024

View reviewed changes

README.md

Copy link

Member

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to fix typos and add clarity. LGTM!

abearab reacted with thumbs up emoji

ayushnoori approved these changes Jan 1, 2024

View reviewed changes

ayushnoori merged commit 7985e67 into mims-harvard:main Jan 1, 2024

ayushnoori mentioned this pull request Jan 1, 2024

Are there any OMIM nodes? #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

building an extended version of primekg that include OMIM phenotypes and genes #12

building an extended version of primekg that include OMIM phenotypes and genes #12

abearab commented Aug 11, 2023 •

edited

Loading

payalchandak commented Sep 11, 2023

ayushnoori commented Sep 11, 2023

abearab commented Sep 11, 2023

abearab commented Sep 12, 2023 •

edited

Loading

abearab commented Sep 30, 2023 •

edited

Loading

ayushnoori commented Oct 13, 2023

payalchandak commented Oct 22, 2023

ayushnoori commented Dec 28, 2023 •

edited

Loading

abearab commented Dec 29, 2023

abearab commented Dec 29, 2023

ayushnoori commented Jan 1, 2024

ayushnoori commented Jan 1, 2024

ayushnoori Jan 1, 2024

ayushnoori left a comment

ayushnoori Jan 1, 2024

ayushnoori Jan 1, 2024

abearab Jan 2, 2024

ayushnoori Jan 1, 2024

abearab Jan 2, 2024

ayushnoori Jan 1, 2024

ayushnoori commented Jan 1, 2024

abearab commented Jan 2, 2024

building an extended version of primekg that include OMIM phenotypes and genes #12

building an extended version of primekg that include OMIM phenotypes and genes #12

Conversation

abearab commented Aug 11, 2023 • edited Loading

payalchandak commented Sep 11, 2023

ayushnoori commented Sep 11, 2023

abearab commented Sep 11, 2023

abearab commented Sep 12, 2023 • edited Loading

abearab commented Sep 30, 2023 • edited Loading

ayushnoori commented Oct 13, 2023

payalchandak commented Oct 22, 2023

ayushnoori commented Dec 28, 2023 • edited Loading

abearab commented Dec 29, 2023

abearab commented Dec 29, 2023

ayushnoori commented Jan 1, 2024

ayushnoori commented Jan 1, 2024

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

ayushnoori left a comment

Choose a reason for hiding this comment

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

abearab Jan 2, 2024

Choose a reason for hiding this comment

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

abearab Jan 2, 2024

Choose a reason for hiding this comment

ayushnoori Jan 1, 2024

Choose a reason for hiding this comment

ayushnoori commented Jan 1, 2024

abearab commented Jan 2, 2024

abearab commented Aug 11, 2023 •

edited

Loading

abearab commented Sep 12, 2023 •

edited

Loading

abearab commented Sep 30, 2023 •

edited

Loading

ayushnoori commented Dec 28, 2023 •

edited

Loading