Debiasing AI Credit Models While Preserving Explainability

Richard Pace

Mar 11, 20229 min read

Updated: May 6, 2024

The current accuracy-fairness trade-off used to evaluate LDA credit models is likely insufficient to meet regulatory requirements. Other criteria - such as conceptual soundness and explainability - are also critically important.

Debiasing AI Credit Models While Preserving Explainability

As I noted in my previous blog post, we find ourselves amidst a fierce "score war" in which well-resourced fintechs are challenging the reign of traditional credit scores by deploying AI technologies and alternative data to produce a new breed of credit scores purported to be more accurate, inclusive, and fair. While my previous remarks focused on some "under the radar" risks to be considered when evaluating these new scoring models, in this post I do a deeper dive into the important interdependency of model explainability and fairness that arises when selecting a final model that meets both fair lending as well as safety-and-soundness regulatory requirements.[1]

But, first, let's start with some important background. In 2001, the statistician Leo Breiman coined the phrase "Rashômon effect" to describe the tendency for AI/ML algorithms to produce multiple model training solutions that were equally (or near equally) predictive of a given dataset, but with potentially very different "explanations" - i.e., different sets of predictors and/or different predictor weights.[2] He referred to this phenomenon as the "Rashômon" effect as it was conceptually similar to the theme explored in the 1950 Akira Kurosawa film Rashômon in which four witnesses all provide different explanations to the facts and events associated with a brutal attack and death. Applied to the AI/ML context, each of the multiple training solutions - according to Breiman - is a different Rashômon "witness" with a different - but equally accurate - explanation of the "events" (i.e., equal predictive accuracy on the target variable in a given dataset).

More recently, the Rashômon effect has been explored further by leading AI/ML researchers with the following developments:

Expanding and formalizing the concept into a "Rashômon set"[3] - that is, theoretically specifying the collection of AI/ML algorithms associated with a given dataset that produces effectively the same predictive accuracy. The researchers then use the Rashômon sets to facilitate the search for more interpretable model alternatives (with the same predictive accuracy). Importantly, the authors note that a true Rashômon set can span multiple AI/ML architectures (i.e., members of the set are not limited to just different versions of the same AI/ML architecture selected for training).

Linking the under-specification of an AI/ML model architecture to the existence of a Rashômon set, and showing how members of a Rashômon set can exhibit very different performance levels outside of the model development sample (e.g., very different stress performance levels).[4] As opposed to [3] above, here the authors work with Rashômon sets of the same AI/ML architecture as the original trained model, with each variation produced by a subtle perturbation in model hyperparameters (e.g., different random initialization of weights).

Using Rashômon sets in the search for "less discriminatory alternatives" in order to "debias" credit scoring models of disparate impact.[5]

How Rashômon Sets Facilitate the Debiasing of Credit Scoring Models

For consumer lending, "disparate impact" has historically referred to a facially-neutral lending policy or procedure (e.g., a minimum loan amount policy for credit approval) that has a disproportionate adverse impact on one or more protected class groups (e.g., racial/ethnic minorities, females, or the elderly). However, for disparate impact to be an illegal form of lending discrimination, the triggering lending policy or procedure must either: (1) fail the "business necessity" test OR (2) a reasonable "less discriminatory alternative" ("LDA") policy or procedure must be shown to exist.[6] Accordingly, the existence of disparate lending outcomes is a necessary, but not sufficient, condition for a fair lending violation.

With the advent of more automated consumer loan underwriting and pricing processes driven by AI/ML technologies and alternative data, disparate impact discrimination theory has undergone a fairly significant evolution over the last few years. Now,

Credit scoring algorithms can be considered the triggering facially-neutral "policy or procedure" when unequal lending outcomes are observed between protected and non-protected class groups; and

The "business necessity" defense is typically focused on statistical evidence of the algorithm's predictive accuracy relative to credit performance.[7]

However, even with a statistically-sound, facially-neutral algorithm, lenders could still fail the LDA test if another version of the algorithm is shown to produce acceptable predictive accuracy but with a lower degree of disparate impact. This latter condition - which is all the more relevant given the existence of the Rashômon sets described previously - has significant implications for traditional credit scoring model development workflows and traditional fair lending testing programs.

In light of this new LDA-centric, disparate impact assessment of algorithmic-based credit models, a growing set of algorithmic "debiasing" techniques has emerged that leverage the existence of the Rashômon set to identify "less discriminatory alternatives" - that is, specific algorithmic alternatives with predictive accuracy similar to the original trained model - but with less disparate impact as measured, for example, by improved fairness metrics.[8] In the more intuitive version of these debiasing techniques, the AI/ML algorithm's training process is modified such that the optimization covers two objectives: (1) predictive accuracy and (2) fairness - with a hyperparameter determining the relative importance of each objective. The developer can then vary this hyperparameter value in successive model training instances to identify different members of the Rashômon set.[9] That is, by including a second dimension of model performance (i.e., fairness), the initially homogeneous Rashômon set is now differentiated - thereby providing a means to select a specific member according to both objectives.

The general debiasing process can be illustrated in Chart 1 below where the two training objectives are represented on the two axes and each circle represents an alternative algorithm generated by the debiasing process. The original trained model is denoted in blue and was trained without regard to fairness - thereby resulting in maximum predictive accuracy. I refer to the alternatives within the green box as the "traditional" Rashômon set as they all share nearly the same predictive accuracy level. The remaining alternatives involve meaningfully lower predictive accuracy levels and, technically, may lie outside the traditional Rashômon set but are still relevant alternatives to the extent that the lender considers the reduction in predictive accuracy to be acceptable (taking into account safety and soundness considerations). The outer edge or "frontier" of the alternatives (in colors other than gray) is referred to as the Pareto Frontier and represents the specific set of LDA alternatives for the lender's consideration. This is because all of the alternatives in grey are outperformed along one or both objectives by the alternatives on the Pareto Frontier (i.e., there is an alternative on the Pareto Frontier that is either more accurate, fairer, or both).

Accuracy-Fairness Tradeoff Credit Models — Chart 1: The Search for a Less Discriminatory Alternative Credit Model Algorithm

The green alternative is identified as the lender's likely selection as: (1) it lies on the Pareto Frontier, (2) it possesses significantly improved model fairness relative to the original trained model (blue), and (3) the improvement in model fairness occurs with only a slight reduction in overall model accuracy. Importantly, however, there is no objectively optimal point on the Pareto Frontier. For example, as shown in Chart 1, the "fairest" model involves a significant reduction in accuracy, while the most accurate model has the highest level of disparate lending outcomes. Neither of these alternatives may represent a practical choice for the lender. Rather, the specific alternative selected by the lender depends on its relative assessment of the risks associated with model accuracy (safety-and-soundness) and fairness (consumer compliance) - which, of course, is a judgment call that requires appropriate governance and oversight.

The Interdependence of AI Credit Model Fairness and Explainability

Historically, model explainability and fairness - while recognized as important properties of AI credit scoring models - have been largely explored along separate research tracks, although general considerations of variable importance derived from explainability techniques may be considered within the debiasing process to expedite or prioritize the search for fairer alternatives. From a more holistic perspective, however, debiasing techniques tend not to consider whether the alternative, fairer algorithm is considered "conceptually sound" based on typical consumer credit default behaviors, applicable economic theory, or business expert input.[10] That is, while all members of a Rashômon set present alternative "witness" explanations of the same events, not all of these witness explanations will be considered plausible as some may defy accepted causal knowledge.

Additionally, it is important to note that during the debiasing process, alternative algorithms are created by: (1) dropping certain predictors from the original trained model, (2) adding new predictors, (3) changing the weights on existing predictors, and/or (4) changing the AI/ML algorithm's hyperparameters in order to improve the fairness metric.[11] Accordingly, it is possible that some alternatives may contain new predictors, exclude original predictors, or change the directional relationship of existing predictors in non-intuitive ways in order to improve the fairness metric - thereby compromising the conceptual soundness of the debiased model and elevating the lender's safety-and-soundness risk.

What this means is that the current accuracy-fairness trade-off depicted in Chart 1 is likely insufficient for an AI credit model to be fully regulatory-compliant. In addition to satisfying: (1) general safety-and-soundness requirements associated with overall predictive accuracy, as well as (2) fair lending compliance requirements, the AI credit model should also: (3) be of a form that is considered conceptually-sound under regulatory model risk management guidance, and (4) produce sensical local explanations for adverse action compliance requirements. Solving this multi-dimensional optimization problem will certainly be a challenge, and perhaps involve additional trade-offs between fairness and explainability that may dampen the fairness improvement under traditional debiasing techniques. Additionally, potential solutions may involve a greater reliance on inherently interpretable machine learning architectures to add algorithmic structure to model conceptual soundness.[12]

Wherever the ultimate solution may lie, we undoubtedly need further research to get there - but rather than separate tracks, the fairness and explainability research paths now need to converge. Whatever the outcome, however, it is clear that the traditional AI credit model development workflow needs to change to incorporate important collateral processes necessary to produce a credit scoring algorithm that is fully compliant with applicable regulatory requirements.

* * *

[1] For the purposes of this blog post, the terms "explainability" and "interpretability" are used interchangeably to refer to the ability to identify and describe the estimated relationships between model inputs and outputs as coded within the AI/ML algorithm. I note, however, that these terms have separate technical meanings within the realm of AI/ML research.

[2] See Breiman, Leo. "Statistical Modeling: The Two Cultures". Statistical Science, Vol. 16, No. 3 (Aug., 2001), pp. 199-215.

[3] See Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. "All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously". Journal of Machine Learning Research, 20 (2019) 1-81.

[4] See D’Amour, A., et al., “Underspecification Presents Challenges for Credibility in Modern Machine Learning”, arXiv pre-print arXiv:2011.03395, 2020.

[5] Gill, Navdeep & Hall, Patrick & Montgomery, Kim & Schmidt, Nicholas. (2020). "A Responsible Machine Learning Workflow with Focus on Interpretable Models, Post-hoc Explanation, and Discrimination Testing". Information. 11. 137. 10.3390/info11030137.

[6] For one of many legal references, see PROVING DISCRIMINATION- DISPARATE IMPACT at the U.S. Department of Justice Civil Rights Division. In general, these two defenses require a showing that the challenged policy or procedure is demonstrably related to a significant, legitimate business goal and there is no alternative policy or procedure that is comparably effective but with less disparate impact.

[7] There is an unsettled question with federal regulators and enforcement agencies as to whether statistical soundness / predictive accuracy at the overall model-level is sufficient to prove business necessity, or whether each individual predictive variable must also satisfy business necessity requirements.

[8] While a multitude of general fairness metrics has been proposed by various researchers and practitioners, the most common metrics used in consumer lending are the adverse impact ratio (for credit decisions) and the standardized mean difference (for loan interest rates). I note, however, that these metrics do not adjust for underlying differences in credit quality across protected and non-protected class consumers.

[9] This assumes that the Rashômon set is limited to alternative algorithms within the same AI/ML architecture as the original model (e.g., Random Forest algorithms). However, technically, the Rashômon set can include alternative algorithms associated with AI/ML architectures that are different than that of the original model (e.g., the original model is a Random Forest algorithm, but alternative members of the Rashômon set come from neural networks and boosted decision trees)[2]. This distinction has a strong practical impact to the search for less discriminatory alternatives with the latter imposing significantly higher burdens on a lender relative to the former. Currently, there is no regulatory guidance on how broad a lender's search must be for an LDA.

[10] See "Evaluation of Conceptual Soundness" in Comptroller's Handbook: Model Risk Management, Office of the Comptroller of the Currency, Version 1.0, August 2021.

[11] Under the Rashômon set definition contained in [3] above, this could also include changing the AI/ML architecture.

[12] See, for example, Sudjianto, Agus and Aijun Zhang. "Designing Inherently Interpretable Machine Learning Models," arXiv preprint arXiv:2111.01743, 2021. and Rudin, Cynthia. "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead," arXiv preprint arXiv:arXiv:1811.10154v3, 2019.

Debiasing AI Credit Models While Preserving Explainability

How Rashômon Sets Facilitate the Debiasing of Credit Scoring Models

The Interdependence of AI Credit Model Fairness and Explainability

* * *

Related Posts

Share your feedback on the AI LendScape Blog