The geography of online lending, workflow—Analytics

Workflow using ArcGIS Pro

Create a hot spot map of average interest rates

If you haven't done so already, download and unzip the data package provided at the top of this workflow.
Open ArcGIS Pro and browse to the GeographyOfOnlineLending.ppkx project package.
Once the project opens, right-click the ZIP3LoanData layer in the Contents pane and click Attribute Table.

Notice that each ZIP3 (each three-digit ZIP Code) has data for the total number of loan applications submitted, the total number of loans issued (accepted loans), the average interest rate for all loans issued, average loan grade ranking for all loans issued, and total number of households.

To avoid the small numbers problem (Kennedy, 1989) and ensure the average interest rate reported for each ZIP3 is both reliable and representative, you will focus your analysis on ZIP3s where at least 30 loans have been funded.

Begin by searching for the Select Layer By Attribute tool in the Geoprocessing pane.
To select the ZIP3s with more than 30 loans, run the Select Layer By Attribute tool with the following parameters. Click the Add Clause button to create the Expression.
- Layer Name or Table View : ZIP3LoanData
- Selection type : New selection
- Expression : Number of loans issued is Greater Than or Equal to 30

Find and open the Copy Features tool and use it to create a new feature class containing only the ZIP3s that have at least 30 accepted loans:
- Input Features : ZIP3LoanData
- Output Feature Class : the name of your output feature class such as ZIP3Data4Analysis
Right-click ZIP3LoanData in Contents and Remove it from the map document so it is out of the way. You will use ZIP3Data4Analysis for the remaining analyses.

To create a hot spot map of the average loan interest rates, you will use the Hot Spot Analysis tool. The tool has a number of parameters. In addition to the Input Feature Class, Input Field, and Output Feature Class, there are several other parameters including Conceptualization of Spatial Relationships, Distance Band or Threshold Distance, and a check box to indicate whether or not you want to apply a False Discovery Rate Correction (FDR).

The default Conceptualization of Spatial Relationships is Fixed distance band. The advantage of this choice is that it keeps the scale of your analysis consistent (fixed) across the study area.

Approfondissement :

Spatial analysis tools, like Hot Spot Analysis, evaluate each feature (each ZIP3 in this case) within the context of neighboring features; your selection for the Conceptualization of Spatial Relationships parameter determines which features will be neighbors and which will not. If you select one of the polygon contiguity methods, for example, a feature's neighbors will be those that touch it. When the polygons in a dataset have very different sizes, this conceptualization will result in different scales of analysis across your study area. Notice in the ZIP3 maps below that polygons in the eastern part of the country are much smaller than polygons in the western part of the country. Consequently, a feature and its neighbors in the West will cover a much larger area (will have a larger scale of analysis) than a feature and its neighbors in the East. Keeping the scale of analysis the same across your study area is often best because of something called MAUP (the Modifiable Areal Unit Problem). When you change your scale of analysis, you can get different answers (often because when you change your scale of analysis, you change the question you are asking). Here is a simple example. Suppose you want to ask the question: are there sufficient physicians to serve the population? You look at the number of people and the number of physicians in a particular county (scale of analysis is the county) and see that the ratio matches the country as a whole. You answer: Yes, there are sufficient physicians. Next, you change your scale of analysis to ZIP Codes. Examining the ZIP Codes throughout the county, you notice that all of the people live in the southernmost ZIP Codes and all of the physicians are located far away in the northernmost ZIP Codes. Your answer is now: No, there are not sufficient physicians. Notice that the answers are 180 degrees different, yet both are correct for their particular scale of analysis.

The default value for the Distance Band or Threshold Distance parameter is the minimum distance to ensure every feature (every ZIP3) has at least one neighbor. Often, this is not the best choice (see Selecting a fixed-distance band value). In this case, however, you are analyzing individual loan application data that, unfortunately, is only identified geographically by a three-digit ZIP Code; no other location information is available. Because the default distance provided by Hot Spot Analysis, when no distance value is given, represents the minimum valid distance (scale of analysis), it is the most appropriate distance band value for this data.

Hot Spot Analysis will visit each ZIP3 and compute the average interest rate for that ZIP3 and any surrounding ZIP3s within the Distance Band or Threshold Distance specified. If this local average interest rate is significantly higher than the average interest rate for all ZIP3s across the country, it is a hot spot. If this local interest rate is significantly lower than the average interest rate for all ZIP3s across the country, it is a cold spot. Applying the FDR correction to account for multiple testing and spatial dependence is always a good idea.

Each of the Hot Spot Analysis parameters are described in the tool documentation, including information for how to select an appropriate Conceptualization of Spatial Relationships and how to select an appropriate fixed-distance band value.

Find and open the Hot Spot Analysis tool and run it with the following parameters.
- Input Feature Class : ZIP3Data4Analysis
- Input Field : AveInterestRate
- Output Feature Class : the name of your output feature class such as InterestRateHSA
- Conceptualization of Spatial Relationships : Fixed distance band
- Apply False Discovery Rate (FDR) Correction : checked

The red areas below are hot spots (statistically significant clusters of high loan interest rates). The blue areas are cold spots (statistically significant clusters of low interest rates). Notice that Alabama has higher-than-expected average interest rates.

Hot Spot Analysis of average interest rates — Hot Spot Analysis of Average Interest Rates

Create a model of average interest rate values

You will next see how well average loan grade rankings explain (predict) average interest rates using Ordinary Least Squares regression (OLS). If average loan grade values effectively predict the average interest rate values, you should get a high R² value and the model overpredictions and underpredictions (residuals) should exhibit a spatially random pattern.

Find and open the Ordinary Least Squares regression (OLS) tool and run it with the following parameters:
- Input Feature Class : ZIP3Data4Analysis
- Unique ID field : ZIP3NUM
- Output Feature Class : the name of your output feature class such as AveIntRatesOLSAveLoanGrade
- Dependent Variable : Average Interest Rate
- Explanatory Variable : Average Loan Grade
To see the OLS report, hover over the progress bar at the bottom of the Geoprocessing pane and click the icon to open the tool progress messages.

When the report opens, you may resize the window by using the cursor to grab the lower right corner of the message window. Every item in the message window is fully described in the tool documentation. For your present purposes, however, you are only interested in the adjusted R² value.

Notice that the adjusted R² value is 0.942152. This tells you that the average loan grade values explain 94 percent of the average interest rate values. As expected, this is a high adjusted R² value (R² results range from 0 to 1.0).
The residual map produced by Ordinary Least Squares regression (OLS) is shown below. The blue areas are locations where the model predicted too high; the actual average interest rate is lower than expected, given the corresponding average loan grade. Similarly, the red areas are locations where the model predicted too low; the actual average interest rate is higher than expected, given the corresponding average loan grade. Notice that the underpredictions and overpredictions are far from randomly distributed. Most notable is the very strong cluster of overpredictions (blue) in Mississippi.
OLS Residual Map

Create a map showing the relationship between loan grades and interest rates

The residual map from Ordinary Least Squares regression (OLS) makes clear that average loan grades are not good predictors of average interest rates in Mississippi. You will use Geographically Weighted Regression (GWR) to further explore this relationship across the country.

With Ordinary Least Squares regression (OLS), only one model is constructed to represent all of the ZIP3s in your dataset. With only one model, only one coefficient is computed and this coefficient reflects the strength of the relationship between the average loan grade and average interest rate variables for all ZIP3s. Geographically Weighted Regression (GWR), on the other hand, creates a regression model for every ZIP3 in your dataset. This means there are 815 models and, correspondingly, 815 coefficients reflecting the strength of the relationship between loan grades and interest rates. Geographically Weighted Regression (GWR) computes a potentially different coefficient for every model. When you create a map of these coefficients, you can see how the relationship between loan grades and interest rates varies across the country.

Geographically Weighted Regression (GWR) calibrates each local model using only nearby ZIP3s. Further, this subset of ZIP3s is weighted to give the nearer ZIP3s more influence on the calibration process than ZIP3s that are farther away. Your parameter choices for Kernel type and Bandwidth method will determine which neighboring features are in or out of the calibration process.

If you select Fixed for the Kernel type parameter, it means a particular distance band will determine if a feature is included. If you select Adaptive, it means a particular number of nearest neighboring features will determine if a feature is included. The Fixed kernel has the advantage of ensuring the scale of analysis remains constant.

The Bandwidth method you select indicates the criteria that will be used to find an optimal distance band or optimal number of neighbors. Both the Akaike Information Criterion (AICc) and the Cross Validation (CV) methods are appropriate. If you are using Geographically Weighted Regression (GWR) to get accurate predictions, you will want to make sure your model includes all the key explanatory variables and then experiment with both bandwidth methods to see which one yields the highest adjusted R2 and lowest AICc values. Both of these diagnostics are provided in the GWR output report.

Here, you are using Geographically Weighted Regression (GWR) only to explore the relationship between loan grades and interest rates.

Find and open the Geographically Weighted Regression (GWR) tool and run it with the following parameters:
- Input features : ZIP3Data4Analysis
- Dependent variable : Average Interest Rate
- Explanatory variable(s) : Average Loan Grade Rank
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade
- Kernel type : Fixed
- Bandwidth method : Akaike Information Criterion

Ouch! The tool gives an error. It sounds serious, too!

Actually, this message is more of a caution than a stop sign. When you see this error, you should check the following:

Do you have sufficient features? Because Geographically Weighted Regression (GWR) creates local models, you should have 100 or more features in your Input features dataset. You have 815, so this is not the problem.
Geographically Weighted Regression (GWR) will fail if two or more explanatory variables are redundant. In your case, there is only one explanatory variable in your model, so this is not a problem.
Geographically Weighted Regression (GWR) will also fail when there isn't much variation in the explanatory variable values. Since the loan grades are mean rankings with little variation, this may be what is causing the problem. This may not be fatal, however. Often subtracting the mean from the explanatory variable values allows Geographically Weighted Regression (GWR) to solve. Let's try this.
Remarque :
There are additional suggestions given in the tool documentation for dealing with the severe model design error. Once you've ruled out all other problems, you should identify the explanatory variables that are creating the problem (remove each explanatory variable one by one to identify the stinkers) and try transforming them as described below.

Use the Add Field tool with the following parameters to create a field to hold the transformed values.
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Field Type : Float

Determine the mean value for the Average Loan Grade Ranking field by creating a histogram (which also reports the mean, median, and standard deviation).

Right-click the ZIP3Data4Analysis layer in the Contents pane, select Create Chart, and choose Histogram. On the Chart pane, set the X-Axis: Number parameter to Average Loan Grade Rank.

Notice that the mean value is 12.17.

Histogram of average loan grade rankings

Use the Calculate Field tool to compute the transformation of subtracting the mean value from each of the AveLoanGrade values:
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Expression : !AveLoanGrade! - 12.17
Run Geographically Weighted Regression (GWR) again, this time using the transformed variable:
- Input features : ZIP3Data4Analysis
- Dependent variable : Average Interest Rate
- Explanatory variable(s) : tAveLoanGrade
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade2
- Kernel type : Fixed
- Bandwidth method : Akaike Information Criterion
This time, there is no error. Expand the status area at the bottom of the Geoprocessing pane to see the tool execution messages. Notice that the model has improved. Because Geographically Weighted Regression (GWR) allows relationships to change at each ZIP3, the adjusted R² value is now almost 97 percent (up from 94 percent).

Like Ordinary Least Squares regression (OLS), the output map from Geographically Weighted Regression (GWR) shows where the model predictions are either higher or lower than the actual average interest rate values. More interesting to your objectives, the output layer also contains a field with the coefficient value for each ZIP3. The larger the coefficient, the stronger the relationship is between average interest rates and average loan grades.

Right-click the AveIntRatesGWRAveLoanGrade2 layer in the Contents pane and choose Symbology.
Set the Field to Coefficient #1 tAveLoanGrade. Set the Method to Quantile. Pick a color ramp that best represents small to large (a graduated color ramp rather than a divergent color ramp). The Yellow-Orange-Brown continuous color ramp, for example, works well.

The lightest areas are the locations where average loan grades have a weak correlation to average interest rates. The darkest areas are locations where this relationship is strong.

GWR coefficient map — Relationship between average interest rates and average loan grade rankings

Final thoughts

This case study presents a workflow that tests the assumed, unquestioned relationship between average loan grades and average interest rates. It evaluates notions of equity and disparate impacts that exhibit spatially. These same methods could be applied to other applications where there are assumed correlations. For example, locations with higher average incomes should pay higher average income taxes. Is this consistently true? Where is it less or more true across the country? Locations with appropriate growing conditions should produce higher yields. Is this always the case, everywhere? If not, why not? Schools with better teacher-to-student ratios should have higher test scores. Testing and mapping these expected relationships may lead to unexpected, and very interesting, findings.