German Credit Risk - Bias#
This notebook computes the gender bias of models developed on the German Credit Risk dataset.
Source: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.datasets import fetch_openml
from fairscoring.metrics import bias_metric_pe, bias_metric_eo, bias_metric_cal, \
WassersteinMetric, CalibrationMetric
from fairscoring.metrics.roc import bias_metric_roc, bias_metric_xroc
from tqdm.notebook import tqdm
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
Load and pre-process data#
Load German Credit Risk data from OpenML#
openML_ID = 46116
data = fetch_openml(data_id=openML_ID)
features = data.data.copy()
target = data.target
Preprocessing#
# Drop index Column
# features.drop("Unnamed:_0", axis=1, inplace=True)
# Fill n/a
features['Saving accounts'] = features['Saving accounts'].astype(object).fillna('no_inf')
features['Checking account'] = features['Checking account'].astype(object).fillna('no_inf')
# Small beautification
features['Purpose'] = features['Purpose'].replace("'domestic appliances'", "domestic appliances")
num_columns = ['Credit amount', 'Duration']
cat_columns = ['Job', 'Housing', 'Saving accounts', 'Checking account', 'Purpose', 'Sex']
Encoding#
ordinal_enc = OrdinalEncoder().fit(features[cat_columns])
features[cat_columns]=ordinal_enc.transform(features[cat_columns])
features[cat_columns]=features[cat_columns].astype(int)
categorical = pd.get_dummies(features[cat_columns].astype(str), drop_first=True)
numerical = MinMaxScaler().fit_transform(features[num_columns])
target_encoder = LabelEncoder()
target= target_encoder.fit_transform(target)
Training#
Train-Test Split#
log_reg_data=pd.concat([pd.DataFrame(categorical), pd.DataFrame(numerical)], axis=1)
log_reg_data=log_reg_data.rename(columns = {0:'Credit amount', 1:'Duration'})
X_train, X_test, y_train, y_test = train_test_split(
log_reg_data.astype(float), target.astype(int), test_size=0.33, random_state=42)
Train LogReg Model#
Cross-Validation to check for stability#
shuffle = KFold(n_splits=5, shuffle=True, random_state=2579)
logreg = LogisticRegression(max_iter=1000)
ROC_Values=cross_val_score(logreg, X_train , y_train, cv=shuffle, scoring="roc_auc")
print('\nROC AUC values for 5-fold Cross Validation:\n',ROC_Values)
print('\nStandard Deviation of ROC AUC of the models:', round(ROC_Values.std(),3))
print('\nFinal Average ROC AUC of the model:', round(ROC_Values.mean(),3))
ROC AUC values for 5-fold Cross Validation:
[0.62074468 0.76400111 0.78967544 0.75111461 0.71436404]
Standard Deviation of ROC AUC of the models: 0.059
Final Average ROC AUC of the model: 0.728
Final Model#
logreg = sm.Logit(y_train, X_train).fit()
# performing predictions on the test datdaset
y_pred = logreg.predict(X_test)
y_pred_train = logreg.predict(X_train)
prediction_test = list(map(round, y_pred))
prediction_train = list(map(round, y_pred_train))
Optimization terminated successfully.
Current function value: 0.506663
Iterations 6
Train debiased LogReg Model#
Remove Gender Information#
X_train_wosex = X_train.drop(X_train.columns[[19,19]], axis=1)
X_test_wosex = X_test.drop(X_train.columns[[19,19]], axis=1)
Cross-Validation to check for stability#
shuffle = KFold(n_splits=5, shuffle=True, random_state=2579)
logreg_wosex = LogisticRegression(max_iter=1000)
ROC_Values=cross_val_score(logreg_wosex, X_train_wosex, y_train, cv=shuffle, scoring="roc_auc")
print('\nROC AUC values for 5-fold Cross Validation:\n',ROC_Values)
print('\nStandard Deviation of ROC AUC of the models:', round(ROC_Values.std(),3))
print('\nFinal Average ROC AUC of the model:', round(ROC_Values.mean(),3))
ROC AUC values for 5-fold Cross Validation:
[0.6087766 0.77904709 0.78507539 0.73826383 0.71299342]
Standard Deviation of ROC AUC of the models: 0.064
Final Average ROC AUC of the model: 0.725
Final Model#
logreg_wosex = sm.Logit(y_train, X_train_wosex).fit()
y_pred_wosex = logreg_wosex.predict(X_test_wosex)
y_pred_train_wosex = logreg_wosex.predict(X_train_wosex)
roc_score_logreg_wosex = roc_auc_score(y_test, y_pred_wosex)
roc_score_logreg_wosex_train = roc_auc_score(y_train, y_pred_train_wosex)
print('The ROC-AUC of the Logistic Regression is', roc_score_logreg_wosex)
print('The train-ROC-AUC of the Logistic Regression is', roc_score_logreg_wosex_train)
Optimization terminated successfully.
Current function value: 0.510914
Iterations 6
The ROC-AUC of the Logistic Regression is 0.7712395693717844
The train-ROC-AUC of the Logistic Regression is 0.765014029809344
Bias Measures#
Prepare Dataset#
attribute = data.data.loc[X_test.index,"Sex"]
groups = ['female', 'male']
favorable_target = target_encoder.transform(["good"])[0]
models = [
("LogReg", y_pred),
("LogReg (debiased)", y_pred_wosex),
]
List of bias metrics#
metrics = [
bias_metric_eo, # Standardized Equal Opportunity
bias_metric_pe, # Standardized Predictive Equality
bias_metric_cal, # Standardized Calibration Equality
bias_metric_roc, # ROC-Bias
bias_metric_xroc, # xROC-Bias
WassersteinMetric(fairness_type="EO",name="Equal Opportunity (U)", score_transform="rescale"),
WassersteinMetric(fairness_type="PE",name="Predictive Equality (U)", score_transform="rescale"),
CalibrationMetric(weighting="scores",name="Calibration (U)", score_transform="rescale"),
]
Compute Bias Metrics#
Compute all bias metrics for the dataset
results = []
for metric in tqdm(metrics):
for model, scores in models:
# Compute bias
bias = metric.bias(
scores, y_test, attribute,
groups=groups,
favorable_target=favorable_target,
min_score=0, max_score=1,
n_permute=1000, seed=2579)
# Store result
results.append((metric, model, bias))
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:81: RuntimeWarning: invalid value encountered in divide
fraction_of_positives = np.where(nonzero, bin_true / bin_total, np.nan)
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:82: RuntimeWarning: invalid value encountered in divide
mean_predicted_value = np.where(nonzero, bin_sums / bin_total, np.nan)
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:81: RuntimeWarning: invalid value encountered in divide
fraction_of_positives = np.where(nonzero, bin_true / bin_total, np.nan)
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:82: RuntimeWarning: invalid value encountered in divide
mean_predicted_value = np.where(nonzero, bin_sums / bin_total, np.nan)
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:81: RuntimeWarning: invalid value encountered in divide
fraction_of_positives = np.where(nonzero, bin_true / bin_total, np.nan)
C:\dev\fair-scoring-public\src\fairscoring\metrics\calibration.py:82: RuntimeWarning: invalid value encountered in divide
mean_predicted_value = np.where(nonzero, bin_sums / bin_total, np.nan)
Result Table I#
Models vertically arranged This corresponds to table C2 in the publication.
results = [[
metric.name,
model,
f"{bias.bias:.3f}",
f"{bias.pos_component:.0%}",
f"{bias.neg_component:.0%}",
f"{bias.p_value:.2f}" ] for metric, model, bias in results
]
df = pd.DataFrame(results, columns=["metric", "model", "total", "pos", "neg", "p-value"])
df.set_index(["metric", "model"], inplace=True)
df
| total | pos | neg | p-value | ||
|---|---|---|---|---|---|
| metric | model | ||||
| Equal Opportunity | LogReg | 0.083 | 1% | 99% | 0.04 |
| LogReg (debiased) | 0.048 | 93% | 7% | 0.32 | |
| Predictive Equality | LogReg | 0.092 | 0% | 100% | 0.09 |
| LogReg (debiased) | 0.025 | 62% | 38% | 0.99 | |
| Calibration | LogReg | 0.291 | 46% | 54% | 0.35 |
| LogReg (debiased) | 0.299 | 58% | 42% | 0.26 | |
| ROC bias | LogReg | 0.044 | 98% | 2% | 0.80 |
| LogReg (debiased) | 0.050 | 98% | 2% | 0.69 | |
| xROC bias | LogReg | 0.133 | 0% | 100% | 0.02 |
| LogReg (debiased) | 0.057 | 93% | 7% | 0.54 | |
| Equal Opportunity (U) | LogReg | 0.041 | 3% | 97% | 0.13 |
| LogReg (debiased) | 0.036 | 97% | 3% | 0.23 | |
| Predictive Equality (U) | LogReg | 0.078 | 1% | 99% | 0.10 |
| LogReg (debiased) | 0.024 | 74% | 26% | 0.98 | |
| Calibration (U) | LogReg | 0.246 | 40% | 60% | 0.57 |
| LogReg (debiased) | 0.225 | 75% | 25% | 0.84 |
Result Table II#
Models horizontally arranged This corresponds to table 2 in the publication.
model_names = [name for name, _ in models]
blocks = [df[df.index.get_level_values(1) == name] for name in model_names]
for i in range(len(blocks)):
blocks[i].set_index(blocks[i].index.droplevel("model"))
blocks[i] = blocks[i].reset_index()
blocks[i].drop("model", axis=1, inplace=True)
if i == 0:
metric_col = blocks[i]["metric"]
blocks[i].drop("metric", axis=1, inplace=True)
df2 = pd.concat([metric_col] + blocks, axis=1, keys=[""]+model_names)
df2.set_index(df2.columns[0],inplace=True)
df2.index.names = ["Metric"]
df2
| LogReg | LogReg (debiased) | |||||||
|---|---|---|---|---|---|---|---|---|
| total | pos | neg | p-value | total | pos | neg | p-value | |
| Metric | ||||||||
| Equal Opportunity | 0.083 | 1% | 99% | 0.04 | 0.048 | 93% | 7% | 0.32 |
| Predictive Equality | 0.092 | 0% | 100% | 0.09 | 0.025 | 62% | 38% | 0.99 |
| Calibration | 0.291 | 46% | 54% | 0.35 | 0.299 | 58% | 42% | 0.26 |
| ROC bias | 0.044 | 98% | 2% | 0.80 | 0.050 | 98% | 2% | 0.69 |
| xROC bias | 0.133 | 0% | 100% | 0.02 | 0.057 | 93% | 7% | 0.54 |
| Equal Opportunity (U) | 0.041 | 3% | 97% | 0.13 | 0.036 | 97% | 3% | 0.23 |
| Predictive Equality (U) | 0.078 | 1% | 99% | 0.10 | 0.024 | 74% | 26% | 0.98 |
| Calibration (U) | 0.246 | 40% | 60% | 0.57 | 0.225 | 75% | 25% | 0.84 |