Banks have used credit scoring models for decades and have managed the fair lending risks of these models through long-standing compliance programs informed by regulatory guidance[1] and frequent supervision. However, over the last several years, a growing chorus of voices has expressed grave concerns about the inherent fairness of algorithmic-based credit decisions - going so far as to call the use of such algorithms "robo-discrimination."
Since credit scores and automated underwriting models have been responsibly used for decades, it's legitimate to ask "what's different?" in today's environment that is driving this increased attention and concern. Surveying the current industry and regulatory landscape, a picture begins to emerge - one that is based on a convergence of the following five forces to reshape the popular narrative around the fairness of algorithmic-based lending:
The increasing replacement of traditional credit model methodologies with much more advanced, but much less transparent, AI/ML algorithms.
The growing availability and use of "alternative data" - that is, broad-based financial and non-financial consumer-level data aggregated by third-parties and used to expand the set of potential predictive signals for consumer credit modeling - particularly by fintechs.
A new, more aggressive federal and state consumer protection environment focused on fairer access to credit - supplemented by a cadre of non-profits, academics, and technology company researchers proffering studies of how biased data, biased algorithms, biased developers, and/or biased model outcomes impede such access and, therefore, require algorithmic-based credit models to be "de-biased" through new analytical approaches (e.g., "using AI to debias AI).
The rise of fintechs and the corresponding "score wars" in which the new AI-powered start-ups battle for consumer loan origination market share through a public relations strategy emphasizing the superior accuracy and fairness of their credit models relative to traditional methods of credit risk evaluation.
The emergence of third-party AI startups providing consumer lenders with "de-biasing" services to reduce the disparate impact risk of their credit models.
So, what is a Compliance Officer to do in this new environment? Are their existing fair lending compliance programs no longer effective in today's new AI/ML-based world? And, if not, how should their compliance programs evolve?
While the marketplace is filled with academic and commercial messaging regarding credit model bias and potential compliance solutions, there still exists a degree of unease within the consumer banking community to adopt and deploy these solutions (and the corresponding algorithms) without knowing the views of their primary financial regulator. To date, these regulators have been relatively silent - apart from information collection activities, reiterations of existing regulatory requirements and guidance, and statements of general concern around the potential risks of such algorithms.[2]
With the regulators trailing behind the marketplace, there is a significant degree of uncertainty clouding the regulatory landscape - thereby hindering many banks' innovation around, and implementation of, AI-based credit scoring algorithms that may lower losses, expand financial access, and reduce lending disparities. In what follows below, I highlight the six most critical open questions related to AI bias for which consumer lenders need regulatory guidance in order to promote further financial innovation.
1. How Should AI Credit Model Bias Be Measured?
Most credit models are designed to predict the relative likelihood that an applicant becomes seriously delinquent, or defaults, within a specific period of time after loan origination - for example, within 12 or 24 months. This relative likelihood is expressed as a numeric score - calibrated to a specific range, for example 300-850 - with higher scores representing higher applicant credit quality, and vice versa. Lenders typically apply a threshold - or score "cut-off" - based on their credit policy, to determine the subset of applications that are approvable. Of course, there are other relevant credit policies that also impact the final credit decision - such as ability to pay and, where relevant, collateral considerations - so having a passing score does not guarantee credit approval. Nevertheless, it is an important first step in the overall approval process.
With this bigger picture of the credit decisioning process, an important question is how should one evaluate credit model bias?
One currently popular bias measure is called the Adverse Impact Ratio ("AIR") which measures the relative "approval" rates of protected and non-protected class applicants - assuming a specific score cut-off. Essentially, the AIR simply reflects whether there is equality of outcomes between groups - that is, are protected class groups approved at the same rate as non-protected class groups. If they have a lower approval rate, then the AIR will be less than one, and vice versa. Some proponents of the AIR bias metric suggest that an AIR value less than 0.80 represents problematic model bias (disparate impact), and reference the EEOC's "four-fifths rule" as support for this threshold.[3]
While the AIR may meaningfully serve as a general disparate impact risk indicator, it is an odd measure of credit model bias as it only considers the credit model's predictions - not the model's accuracy. For example, if the model predicts a higher expected default rate for a protected class group (e.g., 5%) relative to a non-protected class group (e.g., 3.5%), and these relative predictions result in: (1) lower approval rates for the protected class group and, therefore, (2) an AIR that falls below the 0.80 threshold, why isn't the accuracy of these default predictions relevant to the disparate impact risk evaluation? More generally, if I evaluate the credit model's predictive accuracy across all demographic groups, and find no evidence of prediction bias for any demographic group, shouldn't this fact mitigate the fair lending concerns raised by the AIR disparity by providing a legitimate business justification?
Not necessarily, according to the current industry narrative - and this is where regulatory guidance is sorely needed.
Some proponents of AI de-biasing would argue that "business justification" presumes the absence of less discriminatory alternatives ("LDAs") and, therefore, a search for such alternatives should be performed when the AIR falls below the 0.80 threshold - regardless of model accuracy. By adopting this approach, proponents are essentially arguing that model predictive accuracy - at both the aggregate and the demographically-specific levels - is an insufficient defense to a disparate impact allegation. In particular, their position is that lenders should consider alternative credit model configurations that reduce the model's predicted outcome disparities (and, thereby, increase the model's AIR) - even if such alternative models are less predictively accurate - although no guidance is provided as to where on the predictive accuracy-disparate impact trade-off curve a lender is expected to end up.
To be fair, this does not mean proponents of AI de-biasing are advocating for less accurate credit models; however, they do suggest that: (1) a required search for less discriminatory alternatives is consistent with their interpretation of applicable anti-discrimination laws, and (2) lenders should identify and consider less accurate models as part of their fair lending compliance program. While this advocacy for expanded financial access is admirable, I note that such an approach carries the potential for unintended consequences. In particular, adopting a less discriminatory alternative credit model with less accuracy (but improved fairness) opens the lender up to potential regulatory challenges associated with safety-and-soundness as well as UDAAP. As I have discussed elsewhere, the modifications to a model's internal structure necessary to produce a less discriminatory alternative may inadvertently compromise the model's conceptual soundness, robustness, and/or stability in a manner that is inconsistent with sound model risk management. With respect to UDAAP, if the less discriminatory alternative model intentionally under-predicts future default rates for protected class customer groups in order to increase approval rates, and the resulting loan approvals default at higher rates than the model predicted, then the lender may be criticized for approving loans that it knows the customers will likely not repay - thereby damaging their credit reports and impairing their future access to credit.
As this discussion has shown, there is currently significant uncertainty over how credit model "bias" should be measured. On the one hand, traditional approaches would suggest measuring such bias by comparing the credit model's relative predictive accuracy for protected class and non-protected class borrower groups, while - alternatively - proponents of the newer AI de-biasing approaches favor a comparison of model outcomes (e.g., credit approvals) between protected and non-protected class groups. The difference in these two approaches is significant and the implications for fair lending risk levels, fair lending compliance programs, and potential regulatory challenge for unintended consequences (e.g., lending safety-and-soundness as well as potential UDAAP claims) makes this an area needing formal regulatory guidance.
2. Is Model-Level Disparate Impact Testing Sufficient or Must Variable-Level Testing Be Performed as Well?
Historically, traditional credit risk models used a relatively manageable number of predictive variables (e.g., < 30) that made variable-level disparate impact assessments relatively straightforward and efficient. Additionally, prior to the wide-spread adoption of demographic proxies, model-level disparate impact testing was seldom performed due to the inability to compare actual and predicted model outcomes by race/ethnicity or gender.
However, with today's heavy reliance on demographic proxies such as BISG for non-mortgage fair lending testing, model-level disparate impact testing has become both feasible and more common - although important risks and limitations are present. With such model-level testing, and given the fact that algorithmic-based credit models - particularly those that rely on alternative data - can contain hundreds, if not 1,000+, predictive variables, there is a legitimate question as to whether variable-level disparate impact testing is still needed. Arguments against such testing include the following:
Variable-level analysis is not as straight-forward in AI/ML models as it is in more traditional credit risk models that have a linear model structure. This is because most AI/ML model architectures slice, dice, and re-combine the model's original data inputs into more complex, multidimensional, fine-grained predictive features. Accordingly, the original data inputs are not really the direct predictive features of the model; rather, the model's direct predictive features are created by the AI/ML algorithm through a "Frankensteinian" process by which pieces of multiple data inputs are extracted and combined together into a set of downstream predictive attributes that are deemed to have strong predictive signals for explaining observed consumer credit outcomes. Given this process, performing disparate impact testing on the original data inputs is likely not meaningful as they are not directly and independently driving the model's final predicted outcomes.
Even if we could completely and accurately extract all of these Frankensteinian features from the model, the sheer size of the feature set - given that we start from a base of 100+ or 1,000+ original data inputs - likely makes such individual-level disparate impact analysis intractable and highly resource-intensive.
Logically, if the overall model-level analysis indicates an absence of disparate impact risk, it is difficult to understand the value of the additional variable-level analysis. That is, even if the variable-level analysis identified one or more potential disparate impact risks, the fact that such risks appear to be mitigated at the overall model-level (i.e., after combining and netting out the influences of all the model predictive attributes) would appear to make the variable-level results moot. Of course, alternatively, if the model-level disparate impact testing did identify potential risk, then - potentially - a variable-level drill-down might be useful to help identify root causes of the model-level effects.
Nevertheless, there are reasonable counterarguments for the utility of variable-level analysis:
The algorithm's creation of the Frankensteinian predictive features, their obscurity, and their intractability are all red flags indicating that developers and users really have little understanding of the true drivers of the model's predictions. Such predictive features could very well represent demographic proxies - thereby imparting a degree of disparate treatment into the model's predictions.
Even if the model's predictions are considered highly accurate - in the aggregate and for individual demographic groups, this result could be due to the impact of potential demographic proxies. For example, the model may struggle with explaining the higher observed default rate of a protected class group - leading to a systematic underprediction of such defaults. In response, the algorithm could create a complex combination of the original data inputs to close this gap - thereby inadvertently creating a demographic proxy. Indeed, the greater the number of data inputs available, the more chances the algorithm has in finding complex combinations of such inputs to create the proxy.
Under these arguments, variable-level testing is critical - even in the absence of overall model-level disparities. However, extracting and analyzing the algorithm's complex predictive features may not currently be feasible with existing tools and, therefore, may require companies to forego the benefits of more complex algorithms in exchange for simpler algorithms that are amenable to such variable-level analyses.
Additionally, should potential demographic proxies be identified and excised from the algorithm, the result will be a model that systematically underpredicts default rates for the protected class. While such an outcome does not disadvantage protected class groups (and, in fact, improves the adverse impact ratio), companies may now be exposed to potential UDAAP claims if (as the model predicts) approved loans to this group end up defaulting at higher rates than predicted.
Overall, there is currently no clear answer to this question. Regulatory guidance is sorely needed.
3. What Makes a Predictive Attribute a Potential Protected Class Proxy?
The objective of variable-level disparate impact testing is to identify predictive features that may inadvertently serve as proxies for prohibited customer demographics - such as race/ethnicity, gender, or age. Typically, such features are identified by calculating the correlations of model variables with customer demographic membership. For example, if the values of a particular predictive factor are "highly correlated" with, or highly predictive of, customer race/ethnicity, then that predictive factor may be deemed a potential demographic proxy.
While this evaluation may seem simple, straight-forward, and intuitive, it's actually not.
In nearly all areas of consumer credit, we typically observe that certain protected class groups exhibit higher historical credit default rates than corresponding non-protected class groups. For example, we may observe that Hispanics exhibit a historical credit default rate of 8% while Whites exhibit a historical credit default rate of 5%. Assuming the same credit behavior for the two groups[4], the higher Hispanic default rate reflects a lower credit quality distribution of Hispanics relative to Whites.
Mathematically, the only way we can have distributional differences in group-level credit quality is if one or more model variables are correlated with race/ethnicity - for example, Hispanics exhibit higher average debt-to-income ratios than Whites, or Hispanics exhibit lower average FICO scores than Whites. So, on the one hand, we need model variables that are correlated with race/ethnicity to create an accurate predictive model, but - on the other hand - such variables may be considered to be demographic proxies that violate fair lending laws and regulations.
How does a Compliance Officer tell the difference?
Unfortunately, there is no regulatory guidance to assist with this determination. Some would suggest that if the predictive attribute has a logical, causal connection to borrower credit behavior, then there is a presumption that the attribute is not a proxy - even if there is an elevated correlation with customer demographic membership (i.e., there is sufficient "business justification"). However, others might argue that, even with such a causal relationship, the lender is required to consider less discriminatory alternatives (even if, somewhat counterfactually, such lower-correlated attributes would produce less accurate predictions of protected class default rates - thereby creating risks of a UDAAP claim as well as safety-and-soundness concerns).
Without clear regulatory guidance as to whether a demographically-correlated attribute is a legitimate predictive factor or an impermissible demographic proxy, it is likely that lenders will adopt a more conservative approach and over-exclude such attributes from credit models - thereby reducing the intended benefits of the technology and potentially shifting the lender's regulatory risks toward UDAAP and safety-and-soundness.
4. What is the Ultimate Objective of AI Credit Model De-Biasing?
I have previously written about current "de-biasing" processes in which machine learning algorithms are modified to produce a set of "less discriminatory alternative" credit risk models that have different combinations of predictive accuracy and disparate impact. In that post, I write:
"...it is important to note that during the de-biasing process, alternative algorithms are created by: (1) dropping certain predictors from the original trained model, (2) adding new predictors, (3) changing the weights on existing predictors, and/or (4) changing the AI/ML algorithm's hyperparameters in order to improve the fairness metric. Accordingly, it is possible that some alternatives may contain new predictors, exclude original predictors, or change the directional relationship of existing predictors in non-intuitive ways in order to improve the fairness metric - thereby compromising the conceptual soundness of the de-biased model and elevating the lender's safety-and-soundness risk."
The main point here is that there may not be a "free lunch" when it comes to improving a credit model's fairness - even if the reduction in predictive accuracy is considered minor and acceptable - once one includes other relevant model properties in the evaluation. For example, I note that some popular de-biasing processes achieve a reduced disparate impact by fundamentally altering the set of predictive relationships that drive the model's estimates of borrower credit quality. However, no constraints are imposed on how "radical" such alterations can be, and whether the altered relationships are considered to be conceptually sound - that is, consistent with business intuition, applicable business policies and practices, and known borrower behavior.
Accordingly, current de-biasing processes may not filter out those less discriminatory alternative model configurations that may raise safety-and-soundness concerns by model validators and regulatory examiners and, accordingly, may provide lenders with incomplete information upon which to evaluate viable LDA models. Given the important interdependence of fairness and safety-and-soundness considerations inherent to credit models, this represents a critical area in which the CFPB and the applicable safety-and-soundness regulator need to be aligned.
5. Can Demographic Data Be Used to Remove Disparate Impact Without Violating ECOA?
Current model de-biasing approaches generally take the following forms:
Training data augmentation - for example, weighting protected class data observations more heavily than non-protected class data observations during model training to better balance demographic representativeness, or modifying the actual credit default outcomes of the protected class and non-protected class sample members prior to model training to better balance demographically the credit performance outcomes to which the algorithm is calibrated.
Model loss function augmentation - for example, instead of calibrating the model's parameters to optimize just one model training objective (i.e., predictive accuracy), the model's loss function is augmented so that model parameters are chosen to optimize a dual objective (i.e., predictive accuracy and model fairness).
Iterative model search - for example, a large set of potential models is generated by iteratively estimating models that contain different subsets of the original data inputs, or based on different algorithmic-specific hyperparameters. After each potential model is estimated, it is evaluated for predictive accuracy and fairness. After all the iterations are completed, the potential models are ranked by predictive accuracy and fairness, and the lender selects a specific model configuration from that ranking.
A crucial feature of all de-biasing approaches is that demographic data is used either directly or indirectly in the model development process - raising concerns that such usage violates fair lending laws and regulations. In effect, through the de-biasing process, lenders shift their fair lending risk from disparate impact to disparate treatment.
This risk is more pronounced for the de-biasing approaches in which demographic data is used directly in the model development process - such as with training data augmentation and model loss function augmentation. Alternatively, under the iterative model search approach, demographic data is being used only indirectly to evaluate the fairness of each potential model after estimation - which appears to lower significantly the disparate treatment concern. Nevertheless, none of the primary financial regulators have commented on these alternative approaches - a silence that has created significant regulatory uncertainty for Compliance Officers and consumer lenders seeking to produce algorithms with improved financial access and fairness.
6. How Far Must a Lender Go in Researching "Less Discriminatory Alternative" Models?
As discussed above, proponents of model de-biasing view credit algorithms as the "policy" that causes disparate impact and, therefore, advocate that lenders search for less discriminatory alternative models - even if there are arguments and evidence to support the algorithm's business necessity / justification. Unfortunately, no constraints or limits on this search have been proposed - and it is unclear how much resources, time, and effort are expected in searching for these potential LDAs.
For example, assume that a lender develops a new AI-credit scoring model using logistic regression and 20 linear predictive attributes. During fair lending testing, the model's AIR for African American vs. White applicants is 0.75 - indicating that the expected approval rate for African Americans is 25% lower than the expected approval rate for Whites. Despite this AIR value, the model's overall predictive accuracy is high - and group-level predictive accuracy for both African Americans and Whites is equal. Nevertheless,
Must the lender search through all possible combinations of the 20 predictive attributes for LDAs - that is, 1,048,576 different models? What if there are 50 predictive attributes and 1,125,900,000,000,000 different combinations?
Must the lender include all potential attribute interactions in its search - even though the original model relationships are strictly linear? What about other potential functional forms?
Must the lender search for and consider additional data attributes outside of the original model development sample? In this example where the sample includes 20 predictive attributes, is the lender expected to search for additional attributes to add to the original sample? If so, how much search is expected?
Must the lender explore additional model methodologies outside of logistic regression? That is, is the lender expected to consider a random forest model? a gradient boosted tree model? If so, how many different model methodologies is the lender expected to search?
How much hyperparameter tuning / searching is expected? Must the lender explore all possible hyperparameter configurations?
While de-biasing proponents would likely say that lenders should conduct a "reasonable" search for LDAs, there is no guidance as to what "reasonable" is in this context. And what happens if, despite the lender's efforts, an adversary finds an alternative model configuration that exhibits less bias? What is needed here are some guardrails to help Compliance Officers and lenders reduce the regulatory uncertainty associated with proposed de-biasing solutions.
* * *
ENDNOTES:
[1] See, for example, the appendix to OCC Risk Bulletin 97-24 ("Compliance Issues: ECOA (Regulation B) and the Fair Housing Act") and the appendix to the FFIEC's Interagency Fair Lending Examination Procedures ("Considering Automated Underwriting and Credit Scoring") - as two examples.
[2] See, for example, the interagency RFI "Request for Information and Comment on Financial Institutions' Use of Artificial Intelligence, Including Machine Learning" issued on March 30, 2021, the CFPB's May 26, 2022 statement confirming ECOA's adverse action requirements apply equally to AI-based credit models, and the CFPB's December 15, 2021 blog post calling on tech workers to serve as whistleblowers if they "...have detailed knowledge of the algorithms and technologies used by companies and ... know of potential discrimination or other misconduct within the CFPB’s authority."
[3] I note that the use of an AIR threshold of 0.80 has not been formally endorsed by the federal financial regulators and, therefore, introduces an additional layer of uncertainty with respect to measuring potential credit model bias. Additionally, some industry participants advocate for a stricter 0.90 AIR threshold - see Fair Lending Monitorship of Upstart Network’s Lending Model: Second Report of the Independent Monitor, November 10, 2021.
[4] This assumption is actually required for a model to be in compliance with ECOA and other applicable fair lending laws and regulations. That is, if we were to alternatively assume that Hispanics behave differently than Whites with respect to similar credit-related characteristics - such as debt-to-income ratios, credit scores, etc. - then such an assumption would require us to consider explicitly the customer's race/ethnicity within the model's architecture and/or during the model's use. However, this is prohibited under current fair lending laws and regulations.
© Pace Analytics Consulting LLC, 2023