-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
building an extended version of primekg that include OMIM phenotypes and genes #12
Conversation
extended version of primekg that include OMIM phenotypes and genes.
Hi @abearab thank you for submitting this PR! I'm wondering if you are reading the OMIM information from a locally downloaded json file? It would be great if you could make this re-useable for others by including the source website and any wget commands! Thanks |
Thanks @abearab! We will review. Agree with @payalchandak, would be great to include the source information. Also, please feel free to add notes on your work to the README as part of this PR. |
Hi @payalchandak and @ayushnoori
I already added a notebook (see this link) as part of this PR that describes how I made the local json file. Please let me know if that is what you meant or you need me to provide more info. |
@ayushnoori plz see the updated README file. Let me know what you think. Thanks! |
To give you an update, I just committed a few more features as a module in TDC. I believe it helps to build, handle, explore, and integrate knowledge graphs based on the final PrimeKG data format (i.e. simply a pandas data frame with same column names). This might be useful for future updates of PrimeKG itself and improves building graphs from other resources in a compatible format to PrimeKG. from tdc.utils.knowledge_graph import KnowledgeGraph, build_KG
primekg = PrimeKG(path = './datasets/PrimeKG') 1. List node from a given source primekg.get_nodes_by_source('NCBI').head()
2. Extract a subgraph through running a pandas subgr = primekg.copy()
subgr.run_query('(x_name == "Olaparib" | y_name == "Olaparib")') Here, 3. Build a knowledge graph from scratch my_kg = build_KG(
indices = REPLACE,# a list to assign row names of output data frame
relation = REPLACE,# a list or string to assign values
display_relation= REPLACE,# a list or string to assign values
x_id = REPLACE,# a list or string to assign values
x_type = REPLACE,# a list or string to assign values
x_name = REPLACE,# a list or string to assign values
x_source = REPLACE,# a list or string to assign values
y_id = REPLACE,# a list or string to assign values
y_type = REPLACE,# a list or string to assign values
y_name = REPLACE,# a list or string to assign values
y_source = REPLACE# a list or string to assign values
)
|
@payalchandak, an update: @abearab and I were able to meet today. Abe let me know that there is no rush to merge this PR, since this PR is an extension of prior work done with Zak via SIBMI several years ago. Abe will keep developing this PR and also mims-harvard/TDC#207, and will keep us posted. @abearab, glad we were able to touch base today. Please feel free to keep myself and @payalchandak posted as your work progresses. As discussed, when we are ready to add OMIM phenotypes and genes to the KG, we will merge this PR. Thanks! |
Thanks a lot @abearab and @ayushnoori! Looking forward to expanding PrimeKG further, please keep me posted. |
Hi @abearab, we're reviewing this PR. This looks great! A few comments and requests:
Thanks so much! cc: @payalchandak @marinkaz |
Hi @ayushnoori, Thanks for your reply here.
Sounds good, I'll do this and get back to you.
Sure.
Please see this page – https://www.omim.org/api. I wanted to upload files into harvard dataverse on my end but I thought it may need approvals from OMIM so I stoped it to get some guidance here. Maybe @marinkaz already has that permission or she can tell us how we should proceed? I'll be happy to share the required files with you so you can upload it to the same "dataverse". Updating README file sounds good, I'll be happy to help with that, too. I'll add more commits here and be happy to discuss more. Thanks |
(1) using `kg_raw` table, (2) using TDC KG data function
@ayushnoori plz see updates in the README file and append_omim.ipynb notebook. One note I would say here is that MONDO database I used here is most likely not same as what you are using. Thus, I'm skipping some node/edge annotations. Let me know if you have suggestion. 5489754 I made subgraphs as mentioned earlier in a same format as current version of PrimeKG. Also, this part can be updated once mims-harvard/TDC#207 is merged. |
Thanks for the updates, @abearab!
Thinking about this – it would be great to make Alternatively, you could modify
I'm not sure what you mean by this. Could you please provide a few examples? We should try and minimize node duplication between OMIM and MONDO as much as possible. Also, in reviewing the README:
Can you please explain or provide examples for
I'll touch base with @marinkaz about sharing the OMIM-augmented KG publicly. In the interim, if you could privately share with us the OMIM-augmented KG files (e.g., via email), that would be great! We can then perform some standard checks on our end. |
Hi Abe, from Marinka:
If you send us (Payal, Marinka, and myself) the relevant files, we will upload them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to fix typos and add clarity. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments, but LGTM. Let's merge for now and we can work with @abearab to address and improve in later versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to move this file to datasets/processing_scripts/omim/omim-api.ipynb
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we make this into a Python script so we can add to primary_data_resources.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely it can be a Python script and be called in your .sh
file. You would need to request an API Key from OMIM. I'm not sure how it can be relevant, but that can be added to this repository for automated actions https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about moving to /omim
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed this could be a place for keeping and adding more functions as "processing scripts" for OMIM.
The notebook in /datasets/omim/
is just been added for your record to show how I gathered the required datasets and data files will stay in that folder. But feel free to reorganize as you wish so 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great! Thanks for the detailed documentation, @abearab.
@abearab, thanks for this great update to PrimeKG. Reviewed and merged. Would be great if we could work together to address the minor points raised above (#12 (comment)) in a future update. For now, could you please send us the CSV files generated by your script (e.g., standalone OMIM edges and OMIM-augmented PrimeKG)? Thanks! |
Hi @ayushnoori Thanks for merging this!! plz see my responses below.
To build links with MONDO I downloaded a file from MONDO website and I noticed some inconsistencies, also I didn't know how best to label nodes from MONDO. You may need to rerun Also, to my understanding, MONDO is not covering OMIM "gene" entries, and that was the main motivation for me to start this integration. I'm not sure what is the best way to avoid redundant MONDO and OMIM nodes in the final KG, I'm not very familiar with MONDO.
Phenotypic Series (PS) are OMIM efforts for charting phenotype sets, i.e. multiple OMIM phenotype pages will be grouped in one PS. For instance, this is something being used in UDN for patient phenotyping and diagnosis task purposes.
The I think this part is better to be revised with your local database, for instance, I have some assumptions about MONDO which can be naive or wrong. Here I'm skipping some information for Anyway, I'll be happy to share any files if that's helpful. Just let me know if you need that from me. Best |
Big thanks to @marinkaz for recommending a pull request to share my works here.
Updates:
Also: