What did you expect?
This case study follows Jonathan Blum, a New York-based author and GIS-novice, as he tests the assumed, unquestioned, relationship between loan grades and interest rates. As you read his story, keep in mind that this workflow can answer a variety of different questions.
Most of us probably expect, for example, that communities with higher average incomes will pay higher average income taxes. But is this consistently true? Where is it less true or more true across the country? We might believe that agricultural areas with the best growing conditions will produce the highest yields. Is this always the case everywhere? If not, why not? Wouldn't it be reasonable to assume that schools with better teacher-to-student ratios will have higher test scores?
As Jonathan finds out, measuring and then mapping expected relationships like these, can sometimes lead to unexpected, and very interesting, findings.
How much does money cost?
If you needed money to consolidate debt, pay for a wedding, take a vacation, fix your home, or cover other expenses, would you apply for an online loan? Last year millions of people answered that question with a resounding yes! If we choose to join them, what will our loan interest rate be? Most of us take for granted that a poor credit score will translate directly to a higher interest rate. It's a valid assumption ... isn't it?
Jonathan Blum wants to know more. He is a New York-based author writing a layman's guide to information science and as part of his work, he's been investigating online marketplace lending. These Internet, typically non-bank lenders use proprietary computer algorithms and credit checking tools to connect borrowers, lenders, and investors. They can provide loan approval and funding within minutes or hours. Contrast that with loan processing by traditional banks, which can take days or even weeks. While online marketplace lending is still small compared to the trillions of dollars in loans provided by traditional banks, growth rates for this industry are exponential. This is attracting attention not only from borrowers and investors, but also from lawyers and regulators.
Do online marketplace lenders unfairly discriminate?
Thus far, online lending has thrived with minimal oversight. This is changing. The U.S. Supreme Court recently upheld a regulator's use of disparate impact in a housing-related discrimination case. Disparate impact can occur when loan decisions that are not intentionally discriminatory result in discriminatory outcomes. A policy of only funding home loans above $200,000, for example, even if applied consistently, could have the unintended impact of redlining if the average home values in a region's minority neighborhoods is less than $200,000. Avoiding disparate impact is difficult for lenders because it isn't exposed until many loans have been made.
After direct interviews with company executives and experts in online marketplace lending, Jonathan decides to see if he can teach himself enough about GIS and spatial analysis to map the online lending frontier and test assumptions about what drives loan interest rates. In the end, credit scores do matter. They are an important component in calculating loan grades and setting interest rates, but they don't tell the whole story.
Jonathan begins his exploratory analysis by seeing which areas of the country are participating in online lending, and where interest rates are highest and lowest. He then creates a very basic model to predict average interest rates from average loan grades. Jonathan's results are interesting and certainly not what he was expecting.
What data will Jonathan need?
Lending Club provides loan data that can be easily downloaded, linked to ZIP3 areas, and analyzed. (ZIP3 areas are the geometry defined by the first three digits of a standard 5-digit ZIP Code). Jonathan downloads data for all of the loans Lending Club accepted or rejected between August 2007 and September 2015. The data is then summarized, yielding the total number of loan submissions, the total number of loans issued, the average interest rate for loans issued, and the average loan grade for loans issued within each ZIP3 area. He also obtains Business Analyst data for the total number of households within each ZIP3 for 2014.
Where is the online lending frontier?
Jonathan examines the number of Lending Club loan applications submitted each year. There is a clear increase in the number of submissions.
He wonders if participation in online lending is evenly distributed across the United States. Since there will be more loan applications (more of most everything, in fact) in the ZIP3s that have more people, a map of loan application counts would not be very helpful here. It would probably only reveal where most people in the contiguous United States live. Consequently, to get a picture of the online lending frontier (locations where online lending is concentrating), he must create a rate variable. He divides the number of loan applications in each ZIP3 by the number of households in each ZIP3 to get per household online loan application rates. He maps these rates using hot spot analysis. The result, shown below, identifies which areas of the country are participating most heavily in online lending (red) and which areas are not (blue).
The dark red areas on both the west and the east coast have the most intense clustering of high-per-household online loan application rates, followed by southern areas near Atlanta, Montgomery, and north of Miami. In contrast, vast expanses of the country appear not to be participating in online lending at all. Households in Iowa, Nebraska, North Dakota, Maine, and pockets of South Dakota and Idaho are either not interested, not aware, or not able to participate.
Are interest rates consistent across the country?
Having determined that the level of participation in online lending varies across the country, Jonathan wonders if the average interest rates people pay for their online loans varies as well.
To ensure that the average interest rate computed for each ZIP3 is truly representative and reliable (see Kennedy, 1989), Jonathan is advised to focus his remaining analyses on ZIP3s with a minimum of 30 funded loans. With these ZIP3s selected, he performs another hot spot analysis, this time mapping average interest rates.
Notice that there is a definite geography to the online loan interest rates. Red areas are locations where the highest average interest rates concentrate. Similarly, the blue areas are locations with concentrations of the lowest average interest rates. Excluded from the map are locations with fewer than 30 funded loans.
So why are interest rates in Alabama higher than interest rates around San Francisco?
Company executives, online lending experts, and the Lending Club website all confirm that interest rates are a function of loan grades. The logic is simple enough. The borrowers who are assigned A and B loan grades tend to have the healthiest credit metrics and represent the lowest lending risk. Grades C and D, down to G, tend to have progressively lower credit scores. Once a loan grade is assigned to a loan application, the corresponding interest rate can simply be obtained from a table. Consequently, if interest rates are higher in Alabama, as shown in the hot spot map above, it is fair to assume it is because the loan grades assigned there reflect riskier loans. A risky borrower in San Francisco should be just as risky in Mobile.
Ever the skeptic, Jonathan sees if he can build his own predictive model to confirm the published relationship between interest rates and loan grades. He learns about Ordinary Least Squares (OLS) regression and creates a basic model. His model uses only the average loan grade in each ZIP3 to predict the average interest rate in each ZIP3.
The OLS tool computes the predicted average interest rate values and then creates the residual map shown below. The residual map indicates where Jonathan's model predicted well, where it predicted too high, and where it predicted too low. If a prediction for a particular ZIP3 is too high, it means that the actual average interest rate value for that ZIP3 is lower than expected, given the associated average loan grade rank. Similarly, if the prediction is too low, it means the actual average interest rate is higher than expected, given the corresponding average loan grade rank.
Jonathan understands this and realizes that if interest rates are purely a function of loan grades, his simple model should confirm two things:
- The measure, called R2, should be high (above 0.90 on a scale from 0.00 to 1.00). R2 quantifies how well, overall, average loan grade values predict average interest rate values.
- In addition, model predictions should be consistently accurate across the country. The OLS model might predict a bit high for one ZIP3 and a bit low for another ZIP3, but these overpredictions and underpredictions should be randomly distributed among all the ZIP3s. Jonathan should not see any clustering of the overpredictions (blue) or the underpredictions (red) on the OLS residual map.
Results are not quite what Jonathan expected. He notes from the OLS output that the R2 value is 0.94. This indicates that, overall, the average loan grade rankings did a very good job of predicting the average interest rate values. Looking at the residual map below, however, reveals a problem. The residual map shows where the predicted average interest rate values are too high (blue) and where they are too low (red). Notice that the entire state of Mississippi is blue. Jonathan expected a random pattern of overpredictions and underpredictions, but there is nothing spatially random about lower-than-expected interest rates for an entire state. Apparently, average loan grade rankings are not an effective predictor of average interest rates in that part of the country.
Finding lower-than-expected interest rates throughout the state of Mississippi is important. It gives the impression, at least, of either intentional bias or disparate impact. In any case, it is clear that average interest rates are not purely a function of average loan grades everywhere in the country.
Where do average loan grade rankings have the biggest impact on average interest rates?
When the relationship between two variables is strong, you can predict the value of one from the other. This is what Jonathan did with his simple OLS model above. The OLS method, however, summarizes relationship strength using a single value (a single coefficient). In other words, it assumes the relationship between average loan grades and average interest rates is the same for every ZIP3 in the country. If Jonathan wants to examine how this relationship changes - if he wants to see where average loan grade rankings have a larger or smaller impact on average interest rates - he needs to learn about another regression technique called Geographically Weighted Regression (GWR). GWR computes a coefficient for each ZIP3. Where coefficients are large, changes in the average loan grade ranking will have a larger impact on average interest rates; where coefficients are small, changes in average loan grade rankings will have a smaller impact on average interest rates. Jonathan creates a map of the GWR regression coefficients below. The darkest areas reflect locations where the relationship between average loan grade rankings and average interest rates is strongest. Improvements in average loan grade rankings in these locations will have the largest impact on reducing interest rates. Conversely, a change in the average loan grade ranking in the lightest areas (where the relationship is weak) will have the smallest impact on average interest rates.
The map above suggests interest rates are not purely a function of loan grades, at least not everywhere. In both Mississippi and much of Kansas, for example, there is a weak relationship between average loan grades and average interest rates. Correspondingly, interest rates are lower than expected, on average, throughout Mississippi and higher than expected, on average, in much of Kansas. This has tangible and material consequences. Differences in loan interest rates impact the entire economy. When access to loans is limited because of high interest rates, people tend to save, to spend less, and businesses tend to scale down. When loan interest rates are low, people are more willing to both borrow and spend and businesses are more likely to expand.
The question asked at the beginning of this case study is: do online lenders unfairly discriminate? Researchers have found evidence of both race and gender discrimination in a variety of online marketplaces. Jonathan's exploratory analysis contributes to this important research area by uncovering evidence of geographic discrimination associated with online lending. Jonathan has only considered loan grades, however. Despite published tables indicating a direct relationship between loan grades and interest rates, the maps above suggest other factors must also be involved. For example, some researchers are finding that as many as one third of borrowers will purposely choose the loan with the fastest funding time, over the one with the lowest interest rate.
Armed and ready for a better story
Jonathan is not a professional data scientist. He is not looking to statistically prove one outcome over another. He is a journalist. His job is to report and inform emerging debates around the important story of online lending. And maps and the analyses diagrammed here are new, critical storytelling tools.
Jonathan can now sketch the geography of online lending on napkins. He can send email and post social media about his work. He can disclose a transparent, statistically viable argument that stakeholders can easily respond to.
He is finding that good maps are ideal material for regulators and managers as well. Maps create neutral storytelling grounds that teams of people can collaborate on and easily understand. Regulators seem to be more willing to talk; managers are more transparent. As scrutiny increases around marketplace lending, Jonathan expects these maps will help focus the debate.
Jonathan senses that competitors will also be drawn to these maps. This case study only considers data from Lending Club, but billions of dollars in other loans have been made. Perhaps competing firms will find opportunities by offering lower rates in the locations associated with higher-than-expected interest rates.
Data-driven mapping provides a powerful tool for storytelling.