Bachelor Thesis, Artificial Intelligence @ Johannes Kepler University
Author: Martin Dallinger ([email protected])
Linz Institute of Technology Secure and Correct Systems Lab
Supervisor: Univ.-Prof. Priv.-Doz. DDI Dr. Stefan Rass
This project offers a regression-based approach over fully explainable fuzzy rules (Mamdani FIS) to discover biases in arbitrary datasets.
More details (especially technical documentation) will follow in the next months.
The web-app runs in your browser🎉. https://xai.martin-dallinger.me
Fuzzy-Regression XAI is a regression-based approach utilizing fully explainable fuzzy rules (Mamdani Fuzzy Inference System) to discover and analyze biases in arbitrary datasets with a numerical target variable. This tool leverages fuzzy inference systems to create simple basis functions. One such basis function could mean If NumberOfRooms is high AND If District is downtown then Price is veryhigh
, which would automatically be generated by the tool and assigned a coefficient to show, how important this rule is in the regression system.
It is recommended to remove outliers, especially on the target variable (can be performed by using outlier_filtering
)! This is important, since the fuzzy sets (veryhigh, high, ...) are spread with equal size.
- Suggested Node.js version:
22.11.0
- npm (Node Package Manager) v6.0 or higher
Install dependencies with npm install
.
One can either choose to serve this project in an API or to compile it and import it in the browser.
First, clone the repo and get the necessary dependencies:
git clone https://github.com/S0urC10ud/xai-fuzzy-regrules
cd xai-fuzzy-regrules
npm i
npm run build:(web or node)
Start the server:
npm run build:node
npm start
Furthermore, the following debugging configuration can be used with VS-Code:
{
"version": "0.2.0",
"configurations": [
{
"type": "node",
"request": "launch",
"name": "Launch TypeScript (Dev)",
"runtimeExecutable": "node",
"runtimeArgs": ["--inspect"],
"args": ["-r", "ts-node/register", "src/api/index.ts"],
"sourceMaps": true,
"outFiles": ["${workspaceFolder}/dist/**/*.js"],
"skipFiles": ["<node_internals>/**"]
},
{
"type": "node",
"request": "launch",
"name": "Launch JavaScript (Prod)",
"program": "${workspaceFolder}/dist/api/index.js",
"runtimeArgs": ["--inspect"],
"sourceMaps": true,
"skipFiles": ["<node_internals>/**"],
"outFiles": ["${workspaceFolder}/dist/**/*.js"]
}
]
}
To run the frontend first bundle it:
npm run build:web
And then serve it with the web-server of your choice (e.g. http-server
) in the root-directory of this project.
Uploads a CSV file along with metadata to process the data and generate regression models. Request:
Headers: Content-Type: multipart/form-data
Body:
- csvFile: The CSV file to be uploaded.
- metadata: JSON string containing configuration parameters.
Example request metadata:
{
"split_char": ";", // split character for the CSV-file
"decimal_point": ".", // decimal point for numbers in the CSV-file (usually ./,)
"target_var": "Salary", // target column from csv file to explain
"lasso": {
"regularization": 1, // the lambda parameter for lasso regularization, it is highly recommended to set this to a non-zero value, especially when there are more rules than samples (small datasets)
"max_lasso_iterations": 10000, // default is 10000 - lasso is applied iteratively until convergence or until this max-iterations-counter is hit
"lasso_convergance_tolerance": 1e-4 // default is 1e-4, this is the threshold for the absolute value of difference between beta[i-1] and beta[i] until we say it converged
},
"rule_filters": {
"l1_row_threshold": 0.1, // row/(2*threshold) will be serialized to a string and checked for duplicates
"l1_column_threshold": 0.1, // same as l1_row_threshold but column-wise
"dependency_threshold": 0, // if the residual from the Gram-Schmidt orthogonalization has a norm lower than this value, the vector is considered being linearly dependent - set to 0 to disable
"significance_level": 0.05, // for the lasso-test with H0 that the coefficient Beta=0
"remove_insignificant_rules": false, // remove rules that are not statistically significant, requires compute_pvalues to be true
"only_whitelist": false, // disables the rule generation and forces the system only to use the specified whitelist-rules
"rule_priority_filtering": {
"enabled": true, // default: false, filters for minimum rule priority (computation described in rule_priority_weights),
"min_priority": 0.04 // all rules with a priority geq this value will survive (but intercept is exempted) - NOTE: Priorities can also be negative, because leverage may be negative
}
},
"compute_pvalues": true, // if you want pValues in the output (H0 = Rule not needed), set this to true - disadvantage: the computation take much longer (for each non-filtered basis function a model has to be fit)
"numerical_fuzzification": ["veryhigh", "high", "medium", "low", "verylow"], // defines the fuzzy sets - possible values: verylow, low, mediumlow, medium, mediumhigh, high, veryhigh
"numerical_defuzzification": ["veryhigh", "high", "medium", "low", "verylow"], // same as above
"return_contributions": false, // default: false, returns the contribution matrix from regression shaped [rules][records] containing row-normalized contributions
"variance_threshold": 1e-5, // Columns with a variance smaller than this value can be removed, set to 0 to disable
"remove_low_variance": false, // Defaults to false, toggles only warn vs. actually remove columns below variance threshold
"include_intercept": true, // Defaults to true; Determines, whether at absolute 0 the model should be forced to go to 0 or if an intercept can be used to offset it - this parameter cannot be removed from colinearities or the significance-test
"re_fit_after_removing_insignificant_rules": false, // only able to be active if remove_insignificant_rules is true
"outlier_filtering": {
"Salary": { // column name
"method": "VariableBounds",
"min": 25000,
"max": 85000
}
//another example: "TAX": {
// "method": "IQR",
// "outlier_iqr_multiplier": 4
//}
},
"num_vars": 2, // number of antecedents to combine - will scale compute quadratically
"whitelist": [ // these rules will definitely be included
"If CRIM is high AND If PTRATIO is high then MEDV is verylow",
"If DIS is low AND If INDUS is high then MEDV is verylow"
],
"blacklist": [ // these rules will be removed from the generation
"If CRIM is high AND If RM is high then MEDV is verylow",
"If DIS is high AND If LSTAT is high then MEDV is veryhigh"
],
"rule_priority_weights": { // weighting for ordering the rules - the order is important for the linear dependency threshold removal
"support_weight": 1, // support_weight * rule.support (see association rule mining theory) +
"leverage_weight": 10, // leverage_weight * rule.leverage (see association rule mining theory) +
"num_antecedents_weight": 0, // num_antecedents_weight * (1 / numAntecedents) +
"whitelist_boolean_weight": 1000 // + whitelist_boolean_weight if the rule is a whitelisted rule
},
}
Example response/result for the biased_salaries dataset (in example_unveiling_biases):
{
"mean_absolute_error": 1146.3913891140153,
"root_mean_squared_error": 1437.1583491639678,
"r_squared": 0.9784692594108033,
"mean_absolute_percentage_error": 2.718759531208799,
"sorted_rules": [
{
"title": "Intercept",
"coefficient": 10.438564951856518,
"isWhitelist": true,
"support": 0,
"leverage": 0,
"priority": 0,
"pValue": null,
"secondaryRules": [],
"mostAffectedCsvRows": []
},
{
"title": "If HiringManager is B AND If Gender is other then Salary is high",
"coefficient": 0.15621443630865187,
"isWhitelist": false,
"support": 0.0050217609641781055,
"leverage": 0.004312964706227824,
"priority": 0.04815140802645635,
"pValue": 0,
"secondaryRules": [],
"mostAffectedCsvRows": [
1398,
2084,
2703,
1955,
747,
2875,
697,
220,
519,
2032
]
},
{
"title": "If JobPosition is management then Salary is high",
"coefficient": 0.12827094712560508,
"isWhitelist": false,
"support": 0.05222631402745229,
"leverage": 0.04440870824123596,
"priority": 0.4963133964398119,
"pValue": 0.5402194952880761,
"secondaryRules": [],
"mostAffectedCsvRows": [
2101,
1878,
637,
1373,
2175,
2494,
2673,
2024,
2614,
1625
]
},
{
"title": "If UniversityReputation is veryhigh then Salary is medium",
"coefficient": 0.07444006501921702,
"isWhitelist": false,
"support": 0.09005691329092735,
"leverage": 0.04015234412170404,
"priority": 0.4915803545079678,
"pValue": 0.6993551771227084,
"secondaryRules": [],
"mostAffectedCsvRows": [
171,
2959,
2971,
2582,
1840,
282,
495,
2873,
1367,
1131
]
},
{
"title": "If HiringManager is A then Salary is medium",
"coefficient": 0.06186564613038564,
"isWhitelist": false,
"support": 0.0746568463341145,
"leverage": 0.03921736967770954,
"priority": 0.46683054311121,
"pValue": 0.13624170116805234,
"secondaryRules": [],
"mostAffectedCsvRows": [
566,
2972,
295,
1893,
2733,
489,
1186,
634,
2058,
2137
]
},
...
"title": "If Experience is medium then Salary is medium",
"coefficient": -0.060380365296388974,
"isWhitelist": false,
"support": 0.14328757951121526,
"leverage": 0.012699938770494051,
"priority": 0.27028696721615575,
"pValue": 0.31524925552392524,
"secondaryRules": [],
"mostAffectedCsvRows": [
1064,
2778,
1193,
2917,
80,
2631,
1153,
673,
2344,
687
]
},
{
"title": "If UniversityReputation is verylow then Salary is verylow",
"coefficient": -0.07684795806488968,
"isWhitelist": false,
"support": 0.05423501841312354,
"leverage": 0.03819967992088023,
"priority": 0.4362318176219259,
"pValue": 0.9512387038579269,
"secondaryRules": [],
"mostAffectedCsvRows": [
1027,
230,
883,
46,
2779,
229,
1100,
235,
2372,
1246
]
},
{
"title": "If Experience is low then Salary is verylow",
"coefficient": -0.11512082566839403,
"isWhitelist": false,
"support": 0.0572480749916304,
"leverage": 0.03510805500321727,
"priority": 0.4083286250238032,
"pValue": 0.017694714456290894,
"secondaryRules": [],
"mostAffectedCsvRows": [
734,
464,
1032,
1569,
2549,
493,
527,
1897,
971,
2896
]
},
{
"title": "If Experience is verylow then Salary is verylow",
"coefficient": -0.19793330948201468,
"isWhitelist": false,
"support": 0.010713090056913292,
"leverage": 0.0083218553694735,
"priority": 0.09393164375164828,
"pValue": 0,
"secondaryRules": [],
"mostAffectedCsvRows": [
2754,
1751,
2330,
2531,
919,
1146,
1089,
98,
341,
315
]
}
],
"warnings": [
"Removed 13 rows due to outlier filter on Salary",
"0 Duplicate columns detected and removed based on L1-Norm < 0.1: 0",
{
"Removed Rules from priority filtering": [
"If Experience is low AND If RANDOM is verylow then Salary is verylow",
"If GPA is medium AND If HiringManager is G then Salary is low",
"If RANDOM is low AND If HiringManager is E then Salary is low",...]
},
"Lasso did not converge after 10000 iterations - maximum difference: 0.0003146107681146759",
[
{
"Removed from pValue-computation due to low coefficient (<1e-5)": [
"If JobPosition is management then Salary is medium",
"If Gender is male AND If JobPosition is management then Salary is high",
"If Experience is medium AND If UniversityReputation is veryhigh then Salary is medium",...]
}
]
]
}
This project is licensed under the GNU GPL v3.
For any questions or suggestions, please contact me at [email protected].
assets/icon.png was generated with DALL-E.
GitHub Copilot was used to speed up coding tedious/boilerplate parts of the project