Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

building an extended version of primekg that include OMIM phenotypes and genes #12

Merged
merged 14 commits into from
Jan 1, 2024

Conversation

abearab
Copy link
Contributor

@abearab abearab commented Aug 11, 2023

Big thanks to @marinkaz for recommending a pull request to share my works here.

Updates:

  • Including scripts to gather updated data from the OMIM portal using their API (credit to @amirieb)
  • Including scripts to process and extract relevant information from downloaded OMIM datasets (downloaded on May 2023).
  • Uploading a detailed notebook to build an extended version of PrimeKG (this can be a good example for users who want to append other resources to PrimeKG and build an extended version).
  • A formatting update for the README file (I could add some notes about my work but I kept that for later).

Also:

abearab added 5 commits August 9, 2023 01:00
extended version of primekg that include OMIM phenotypes and genes.
@payalchandak
Copy link
Collaborator

Hi @abearab thank you for submitting this PR! I'm wondering if you are reading the OMIM information from a locally downloaded json file? It would be great if you could make this re-useable for others by including the source website and any wget commands! Thanks

@ayushnoori
Copy link
Member

Thanks @abearab! We will review. Agree with @payalchandak, would be great to include the source information. Also, please feel free to add notes on your work to the README as part of this PR.

@abearab
Copy link
Contributor Author

abearab commented Sep 11, 2023

Hi @payalchandak and @ayushnoori

I'm wondering if you are reading the OMIM information from a locally downloaded json file? It would be great if you could make this re-useable for others by including the source website and any wget commands!

I already added a notebook (see this link) as part of this PR that describes how I made the local json file. Please let me know if that is what you meant or you need me to provide more info.

@abearab
Copy link
Contributor Author

abearab commented Sep 12, 2023

Also, please feel free to add notes on your work to the README as part of this PR.

@ayushnoori plz see the updated README file. Let me know what you think. Thanks!

@abearab
Copy link
Contributor Author

abearab commented Sep 30, 2023

@payalchandak @ayushnoori

To give you an update, I just committed a few more features as a module in TDC. I believe it helps to build, handle, explore, and integrate knowledge graphs based on the final PrimeKG data format (i.e. simply a pandas data frame with same column names). This might be useful for future updates of PrimeKG itself and improves building graphs from other resources in a compatible format to PrimeKG.

mims-harvard/TDC#207


from tdc.utils.knowledge_graph import KnowledgeGraph, build_KG

primekg = PrimeKG(path = './datasets/PrimeKG')

1. List node from a given source

primekg.get_nodes_by_source('NCBI').head()
id type name source
9796 gene/protein PHYHIP NCBI
7918 gene/protein GPANK1 NCBI
8233 gene/protein ZRSR2 NCBI
4899 gene/protein NRF1 NCBI
5297 gene/protein PI4KA NCBI

2. Extract a subgraph through running a pandas query:

subgr = primekg.copy()

subgr.run_query('(x_name == "Olaparib" | y_name == "Olaparib")')

Here, subgr.df_raw contains the dataframe before query and subgr.df contains the sub graph.

3. Build a knowledge graph from scratch

my_kg = build_KG(
    indices = REPLACE,# a list to assign row names of output data frame
    relation = REPLACE,# a list or string to assign values
    display_relation= REPLACE,# a list or string to assign values

    x_id = REPLACE,# a list or string to assign values
    x_type = REPLACE,# a list or string to assign values
    x_name = REPLACE,# a list or string to assign values
    x_source = REPLACE,# a list or string to assign values

    y_id = REPLACE,# a list or string to assign values
    y_type = REPLACE,# a list or string to assign values
    y_name = REPLACE,# a list or string to assign values
    y_source = REPLACE# a list or string to assign values
)

my_kg will be a KnowledgeGraph object in the same format as PrimeKG which can simply concatenated.

@ayushnoori
Copy link
Member

@payalchandak, an update: @abearab and I were able to meet today. Abe let me know that there is no rush to merge this PR, since this PR is an extension of prior work done with Zak via SIBMI several years ago. Abe will keep developing this PR and also mims-harvard/TDC#207, and will keep us posted.

@abearab, glad we were able to touch base today. Please feel free to keep myself and @payalchandak posted as your work progresses. As discussed, when we are ready to add OMIM phenotypes and genes to the KG, we will merge this PR. Thanks!

@payalchandak
Copy link
Collaborator

Thanks a lot @abearab and @ayushnoori! Looking forward to expanding PrimeKG further, please keep me posted.

@ayushnoori
Copy link
Member

ayushnoori commented Dec 28, 2023

Hi @abearab, we're reviewing this PR. This looks great! A few comments and requests:

  • May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

  • It would be super helpful if you could prepare a summary of the changes to PrimeKG by adding OMIM nodes: e.g., the number of OMIM phenotypes already in PrimeKG, the number of OMIM phenotypes you add, the number of new edges added (stratified by edge type), the original vs. final edge counts, etc. Your Venn diagrams in append_omim.ipynb look fantastic – it would be helpful to have them in a single document with descriptions so we can review together before merging this PR.

  • If data sharing policies permit, may you please also upload the OMIM source files used to construct the OMIM extension of the KG to Harvard Dataverse where we can link to them in the README? Please also feel free to create a new README file to direct users to the exact relative file path where these OMIM source files must be stored.

Thanks so much! cc: @payalchandak @marinkaz

@abearab
Copy link
Contributor Author

abearab commented Dec 29, 2023

Hi @ayushnoori, Thanks for your reply here.

  • May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Sounds good, I'll do this and get back to you.

  • It would be super helpful if you could prepare a summary of the changes to PrimeKG by adding OMIM nodes: e.g., the number of OMIM phenotypes already in PrimeKG, the number of OMIM phenotypes you add, the number of new edges added (stratified by edge type), the original vs. final edge counts, etc. Your Venn diagrams in append_omim.ipynb look fantastic – it would be helpful to have them in a single document with descriptions so we can review together before merging this PR.

Sure.

  • If data sharing policies permit, may you please also upload the OMIM source files used to construct the OMIM extension of the KG to Harvard Dataverse where we can link to them in the README? Please also feel free to create a new README file to direct users to the exact relative file path where these OMIM source files must be stored.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stoped it to get some guidance here. Maybe @marinkaz already has that permission or she can tell us how we should proceed? I'll be happy to share the required files with you so you can upload it to the same "dataverse". Updating README file sounds good, I'll be happy to help with that, too.

I'll add more commits here and be happy to discuss more. Thanks

(1) using `kg_raw` table, (2) using TDC KG data function
@abearab
Copy link
Contributor Author

abearab commented Dec 29, 2023

@ayushnoori plz see updates in the README file and append_omim.ipynb notebook.

One note I would say here is that MONDO database I used here is most likely not same as what you are using. Thus, I'm skipping some node/edge annotations. Let me know if you have suggestion.

5489754 I made subgraphs as mentioned earlier in a same format as current version of PrimeKG. Also, this part can be updated once mims-harvard/TDC#207 is merged.


cc @payalchandak

@ayushnoori
Copy link
Member

Thanks for the updates, @abearab!

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Thinking about this – it would be great to make append_omim.ipynb a script which could be called by build_graph.ipynb. That is, when constructing the KG using build_graph.ipynb, the user should have the choice to add OMIM nodes before taking the LCC, for example, by setting a variable flag use_omim = True.

Alternatively, you could modify append_omim.ipynb to start from kg_raw.csv, then copy the KG standardization steps from build_graph.ipynb into append_omim.ipynb.

Thus, I'm skipping some node/edge annotations.

I'm not sure what you mean by this. Could you please provide a few examples? We should try and minimize node duplication between OMIM and MONDO as much as possible.

Also, in reviewing the README:

relation
mim_disease                        9599
mim_gene                          16636
mim_phenotype                    574128
mim_phenotypic_series              4111
mim_phenotypic_series_disease       549
phenotype_map                      7259

Can you please explain or provide examples for phenotypic_series, phenotypic_series_disease, and phenotype_map edge types? I look at the OMIM documentation but this wasn't clear to me.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stoped it to get some guidance here.

I'll touch base with @marinkaz about sharing the OMIM-augmented KG publicly. In the interim, if you could privately share with us the OMIM-augmented KG files (e.g., via email), that would be great! We can then perform some standard checks on our end.

@ayushnoori
Copy link
Member

Hi Abe, from Marinka:

We can upload the files to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM using a consistent naming scheme, such as OMIM-augmented-edges.csv, etc. The files should contain KG only, no derivative OMIM database, and absolutely no verbatim copies of any OMIM file. Abe should provide a notebook on how to parse the data and build a knowledge graph, but we cannot host the raw dataset in the repo or data verse.

If you send us (Payal, Marinka, and myself) the relevant files, we will upload them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to fix typos and add clarity. LGTM!

Copy link
Member

@ayushnoori ayushnoori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, but LGTM. Let's merge for now and we can work with @abearab to address and improve in later versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move this file to datasets/processing_scripts/omim/omim-api.ipynb for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we make this into a Python script so we can add to primary_data_resources.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it can be a Python script and be called in your .sh file. You would need to request an API Key from OMIM. I'm not sure how it can be relevant, but that can be added to this repository for automated actions https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about moving to /omim for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed this could be a place for keeping and adding more functions as "processing scripts" for OMIM.

The notebook in /datasets/omim/ is just been added for your record to show how I gathered the required datasets and data files will stay in that folder. But feel free to reorganize as you wish so 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great! Thanks for the detailed documentation, @abearab.

@ayushnoori ayushnoori merged commit 7985e67 into mims-harvard:main Jan 1, 2024
@ayushnoori
Copy link
Member

@abearab, thanks for this great update to PrimeKG. Reviewed and merged.

Would be great if we could work together to address the minor points raised above (#12 (comment)) in a future update. For now, could you please send us the CSV files generated by your script (e.g., standalone OMIM edges and OMIM-augmented PrimeKG)? Thanks!

@abearab
Copy link
Contributor Author

abearab commented Jan 2, 2024

Hi @ayushnoori

Thanks for merging this!! plz see my responses below.

May you please run append_omim.ipynb again but use auxillary/kg_raw.csv? This is the KG before we take the largest connected component (please see build_graph.ipynb), and it is best to add new edges to the KG before taking the LCC so we don't unnecessarily lose any nodes. This may resolve the omim_genes_missing problem that you note in append_omim.ipynb.

Thinking about this – it would be great to make append_omim.ipynb a script which could be called by build_graph.ipynb. That is, when constructing the KG using build_graph.ipynb, the user should have the choice to add OMIM nodes before taking the LCC, for example, by setting a variable flag use_omim = True.

Alternatively, you could modify append_omim.ipynb to start from kg_raw.csv, then copy the KG standardization steps from build_graph.ipynb into append_omim.ipynb.

Thus, I'm skipping some node/edge annotations.

I'm not sure what you mean by this. Could you please provide a few examples? We should try and minimize node duplication between OMIM and MONDO as much as possible.

To build links with MONDO I downloaded a file from MONDO website and I noticed some inconsistencies, also I didn't know how best to label nodes from MONDO. You may need to rerun append_omim notebook with your own local MONDO files for consistency.

Also, to my understanding, MONDO is not covering OMIM "gene" entries, and that was the main motivation for me to start this integration. I'm not sure what is the best way to avoid redundant MONDO and OMIM nodes in the final KG, I'm not very familiar with MONDO.

Also, in reviewing the README:

relation
mim_disease                        9599
mim_gene                          16636
mim_phenotype                    574128
mim_phenotypic_series              4111
mim_phenotypic_series_disease       549
phenotype_map                      7259

Can you please explain or provide examples for phenotypic_series, phenotypic_series_disease, and phenotype_map edge types? I look at the OMIM documentation but this wasn't clear to me.

Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stopped it to get some guidance here.

Phenotypic Series (PS) are OMIM efforts for charting phenotype sets, i.e. multiple OMIM phenotype pages will be grouped in one PS. For instance, this is something being used in UDN for patient phenotyping and diagnosis task purposes.

I'll touch base with @marinkaz about sharing the OMIM-augmented KG publicly. In the interim, if you could privately share with us the OMIM-augmented KG files (e.g., via email), that would be great! We can then perform some standard checks on our end.

Hi Abe, from Marinka:

We can upload the files to https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM using a consistent naming scheme, such as OMIM-augmented-edges.csv, etc. The files should contain KG only, no derivative OMIM database, and absolutely no verbatim copies of any OMIM file. Abe should provide a notebook on how to parse the data and build a knowledge graph, but we cannot host the raw dataset in the repo or data verse.

If you send us (Payal, Marinka, and myself) the relevant files, we will upload them.

The omim-api.ipynb file in this PR is exactly what Marinka is asking for regarding "data gathering" and basic data cleaning related to our task here. If you get an OMIM API key you can run it yourself and get the most updated data from OMIM. The append_omim.ipynb is then how I extracted data from the resulting files downloaded from OMIM.

I think this part is better to be revised with your local database, for instance, I have some assumptions about MONDO which can be naive or wrong. Here I'm skipping some information for y nodes that needs revision:
image


Anyway, I'll be happy to share any files if that's helpful. Just let me know if you need that from me.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants