code and data for WSDM 2025 short paper
- The unit database is stored in
data/KG.csv
. The corresponding code for retrieving the data is indata/query_data_WSDM.ipynb
. - The horn rules (for generating counterfactuals) are generated using
data/data_processing.ipynb
section: "# Rule Mining using Expert Causal Graph". - The causal discovery experiments and the corresponding results are presented in
causal/causal_discovery.ipynb
. - The counterfactual prediction experiments are done using
causal/counterf_prediction_command.py
with commands:
python counterf_prediction_command.py -rule cg -pca 0.9
python counterf_prediction_command.py -rule cg -pca 0.5
python counterf_prediction_command.py -rule cg -pca 0.1
- The results of counterfactual prediction are presented in
causal/show.ipynb
.
# ROLE # Act as an expert on identifying causal relationships between properties within a knowledge graph.
# CONTEXT # You are examining the causal relationships among properties within the knowledge graph.
- Property: age;
- Property: episodeType;
- Property: familyGender;
- Property: hasBio;
- Property: hasFamilyCancerType;
- Property: hasSmokingHabit;
- Property: locatedIn;
- Property: relapse;
- Property: sex;
- Property: stage;
# OBJECTIVES #
- Analyze all these properties in the # CONTEXT # to identify all possible causal relationships among them.
- Each identified causal relationship should be supported by evidence from academic studies, referenced using title, DOI or PMCID to ensure confidentiality and verify the reliability of the studies.
# INSTRUCTIONS #
- Examine and understand each property carefully.
- Identify all possible causal relationships among the properties in the # CONTEXT #; each identified causal relationship must be supported by the title, DOI or PMCID of relevant academic research.
- Provide, in a code box, the identified causal relationships in the previous step as a Python set of tuples, each with format: ([A], [B]) representing a causal relationship from property [A] to property [B].
- Do not introduce directed loops along the causal relationships.
# ROLE # Act as an expert in lung cancer research, focusing on identifying causal relationships between properties within a knowledge graph of non-small cell lung cancer (NSCLC) patients.
# CONTEXT #
You are examining the causal relationships among properties of NLCPatient
class, within a knowledge graph,
with rich metadata that describe their human-understandable label (rdfs:label
), human-understandable meaning (rdfs:comment
), domain (rdfs:domain
), and range (rdfs:range
):
- Property: age
rdfs:label
: "age"rdfs:comment
: "The age of NSCLC patients, classified as 'Young' (<= 50 years) or 'Old' (> 50 years)."rdfs:domain
:NLCPatient
rdfs:range
:xsd:integer
;
- Property: episodeType
rdfs:label
: "treatmentment episode tpye"rdfs:comment
: "The type of treatment episodes undergone by NSCLC patients, which could include categories such as: 'OralTargetedTherapy', 'Immunotherapy', 'Oral_and_intravenous_chemotherapy', 'Chemotherapy', 'Concomitant_QT-RT', 'QT_intravenous', 'Sequential_QT-RT', 'OralChemotherapy', 'Intravenous_chemotherapy_+_immunotherapy', 'Other', 'Adjuvant_QT-RT', 'HormonalTherapy', 'Neoadjuvant_QT-RT'."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: familyGender
rdfs:label
: "family gender"rdfs:comment
: "The gender of NSCLC patients' cancered family antecedents, either 'Women' or 'Men'."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: hasBio
rdfs:label
: "biomarker test result"rdfs:comment
: "The biomarker test results of NSCLC patients, including ALK or EGFR; 'other biomarker' includes MET, HER2, FGFR1, KRAS, RET, PDL1, HER2Mut, ROS1, BRAF."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: hasFamilyCancerType
rdfs:label
: "family cancer type"rdfs:comment
: "The type of cancer in the family of NSCLC patients, which can be 'Major' representing cancer types in ('Lung', 'Colorectal', 'Head and neck', 'Uterus/cervical', 'Esophagogastric', 'Prostate'), 'Breast' cancer or 'Minor' which represents other cancer types."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: hasSmokingHabit
rdfs:label
: "smoking habits"rdfs:comment
: "The smoking habits of NSCLC patients, classified as 'Non-Smoker' or 'Smoker'."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: locatedIn
rdfs:label
: "hospital location"rdfs:comment
: "The geographic location or region in Spain where the hospital (where the NSCLC patient recieves diagnosis) is located."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: relapse
rdfs:label
: "relapse or progression"rdfs:comment
: "Indicates whether the NSCLC patient has experienced a relapse or progression of the lung cancer after initial treatment, classified as either 'Relapse' or 'Progression'."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: sex
rdfs:label
: "gender"rdfs:comment
: "The gender of NSCLC patients, either 'male' or 'female'."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
;
- Property: stage
rdfs:label
: "cancer stage"rdfs:comment
: "The diagnosed stage of lung cancer of the patient, classified according to the TNM staging system (e.g., Stage I, Stage II, Stage III, Stage IV) which indicates the severity and spread of the cancer."rdfs:domain
:NLCPatient
rdfs:range
:xsd:string
.
Here are some associations between properties recognized in the knowledge graph:
associations = [('hasBio', 'hasFamilyCancerType'), ('locatedIn', 'relapse'), ('hasFamilyCancerType', 'hasSmokingHabit'), ('hasBio', 'locatedIn'), ('locatedIn', 'hasSmokingHabit'), ('hasBio', 'stage'), ('sex', 'relapse'), ('relapse', 'stage'), ('sex', 'hasSmokingHabit'), ('age', 'relapse'), ('age', 'hasSmokingHabit'), ('hasBio', 'familyGender'), ('familyGender', 'relapse'), ('hasBio', 'relapse'), ('hasBio', 'hasSmokingHabit'), ('age', 'hasBio'), ('familyGender', 'hasSmokingHabit'), ('hasBio', 'episodeType'), ('hasBio', 'sex'), ('relapse', 'hasSmokingHabit'), ('hasSmokingHabit', 'stage'), ('episodeType', 'relapse'), ('episodeType', 'hasSmokingHabit'), ('hasFamilyCancerType', 'relapse')]
,
where each tuple denote the association between two properties of NLCPatient
class.
# OBJECTIVES #
- Analyze all these properties in the # CONTEXT # to identify all possible causal relationships among them.
- Each identified causal relationship should be supported by evidence from academic studies, referenced using title, DOI or PMCID to ensure confidentiality and verify the reliability of the studies.
# INSTRUCTIONS #
- Examine and understand the metadata of each property carefully.
- For each tuple ([A], [B]) in the
association
list, if there exists causation between properties [A] and [B]? if exists, what is the causal direction. - Identify all possible causal relationships within the
association
list where the properties is described in the # CONTEXT #; each identified causal relationship must be supported by the title, DOI or PMCID of relevant academic research. - Provide, in a code box, the identified causal relationships in the previous step as a Python set of tuples, each with format: ([A], [B]) representing a causal relationship from property [A] to property [B].
- Do not introduce directed loops along the causal relationships.