The geography of online lending, workflow—Analytics

Workflow using ArcMap

Create a hot spot map of average interest rates

If you haven't done so already, download and unzip the data package provided at the top of this workflow.
Double-click the OnlineLendingAnalysis.mpk map package to open it.
Open the table for the ZIP3LoanData layer by right-clicking the layer in the Table of Contents and clicking Open Attribute Table.

Notice that each ZIP3 (each three-digit ZIP Code) has data for the total number of loan applications submitted, the total number of loans issued (accepted loans), the average interest rate for all loans issued, average loan grade ranking for all loans issued, and total number of households.

To avoid the small numbers problem (Kennedy, 1989) and ensure the average interest rate reported for each ZIP3 is both reliable and representative, you will focus your analysis on ZIP3s where at least 30 loans have been funded.

Use the Search window to find and open the Select Layer By Attribute tool.
To select the ZIP3s with more than 30 loans, run the Select Layer By Attribute tool with the following parameters:
- Layer Name or Table View : ZIP3LoanData
- Selection type : NEW_SELECTION
- Expression : AcceptedLoans >= 30

Find and open the Copy Features tool and use it to create a new feature class containing only the ZIP3s that have at least 30 accepted loans:
- Input Features : ZIP3LoanData
- Output Feature Class : the name of your output feature class such as ZIP3Data4Analysis
Right-click ZIP3LoanData in the Table of Contents and Remove it from the map document so it is out of the way. You will use ZIP3Data4Analysis for the remaining analyses.

To create a hot spot map of the average loan interest rates, you will use the Hot Spot Analysis tool. The tool has a number of parameters. In addition to Input Feature Class, Input Field, and Output Feature Class, there are several other parameters including Conceptualization of Spatial Relationships, Distance Band or Threshold Distance, and a check box to indicate whether or not you want to apply a False Discovery Rate Correction (FDR).

The default Conceptualization of Spatial Relationships is a fixed distance band (FIXED_DISTANCE_BAND). The advantage of this choice is that it keeps the scale of your analysis consistent (fixed) across the study area.

Более подробно:

Spatial analysis tools, like Hot Spot Analysis, evaluate each feature (each ZIP3 in this case) within the context of neighboring features; your selection for the Conceptualization of Spatial Relationships parameter determines which features will be neighbors and which will not. If you select one of the polygon contiguity methods, for example, a feature's neighbors will be those that touch it. When the polygons in a dataset have very different sizes, this conceptualization will result in different scales of analysis across your study area. Notice in the ZIP3 maps below that polygons in the eastern part of the country are much smaller than polygons in the western part of the country. Consequently, a feature and its neighbors in the west will cover a much larger area (will have a larger scale of analysis) than a feature and its neighbors in the east. Keeping the scale of analysis the same across your study area is often best because of something called MAUP (the Modifiable Areal Unit Problem). When you change your scale of analysis, you can get different answers (often because when you change your scale of analysis, you change the question you are asking). Here is a simple example. Suppose you want to ask the question: are there sufficient physicians to serve the population? You look at the number of people and the number of physicians in a particular county (scale of analysis is the county) and see that the ratio matches the country as a whole. You answer: Yes, there are sufficient physicians. Next, you change your scale of analysis to ZIP Codes. Examining the ZIP Codes throughout the county, you notice that all of the people live in the southernmost ZIP Codes and all of the physicians are located far away in the northernmost ZIP Codes. Your answer is now: No, there are not sufficient physicians. Notice that the answers are 180 degrees different, yet both are correct for their particular scale of analysis.

The default value for the Distance Band or Threshold Distance parameter is the minimum distance to ensure every feature (every ZIP3) has at least one neighbor. Often, this is not the best choice (see Selecting a fixed-distance band value). In this case, however, you are analyzing individual loan application data that, unfortunately, is only identified geographically by a three-digit ZIP Code; no other location information is available. Because the default distance provided by Hot Spot Analysis, when no distance value is given, represents the minimum valid distance (scale of analysis), it is the most appropriate distance band value for this data.

Hot Spot Analysis will visit each ZIP3 and compute the average interest rate for that ZIP3 and any surrounding ZIP3s within the Distance Band or Threshold Distance specified. If this local average interest rate is significantly higher than the average interest rate for all ZIP3s across the country, it is a hot spot. If this local interest rate is significantly lower than the average interest rate for all ZIP3s across the country, it is a cold spot. Applying the FDR correction to account for multiple testing and spatial dependence is always a good idea.

Each of the Hot Spot Analysis parameters are described in the tool documentation, including information for how to select an appropriate Conceptualization of Spatial Relationships and how to select an appropriate fixed-distance band value.

Find and open the Hot Spot Analysis tool and run it with the following parameters.
- Input Feature Class : ZIP3Data4Analysis
- Input Field : AveInterestRate
- Output Feature Class : the name of your output feature class such as InterestRateHSA
- Conceptualization of Spatial Relationships : FIXED_DISTANCE_BAND
- Apply False Discovery Rate (FDR) Correction : checked

The red areas below are hot spots (statistically significant clusters of high loan interest rates). The blue areas are cold spots (statistically significant clusters of low interest rates). Notice that Alabama has higher-than-expected average interest rates.

Hot Spot map of average interest rates — Hot Spot Analysis of Average Interest Rates

Create a model of average interest rate values

You will next see how well average loan grade rankings explain (predict) average interest rates using Ordinary Least Squares regression (OLS). If average loan grade values effectively predict the average interest rate values, you should get a high R² value, and the model overpredictions and underpredictions (residuals) should exhibit a spatially random pattern.

The OLS tool writes its output report to the Results window and, when you turn off background processing, it also displays tool messages in a Progress dialog box.

To ensure you see the OLS report while the tool runs, turn off background processing by clicking the Geoprocessing menu item and clicking Geoprocessing Options. Uncheck the Enable box for Background Processing..
Now that you've disabled background processing, find and open the Ordinary Least Squares regression (OLS) tool and run it with the following parameters:
- Input Feature Class : ZIP3Data4Analysis
- Unique ID field : ZIP3NUM
- Output Feature Class : the name of your output feature class such as AveIntRatesOLSAveLoanGrade
- Dependent Variable : AveInterestRate
- Explanatory Variable : AveLoanGrade

The output from your OLS model is shown below. Every item in that report is fully described in the tool documentation. For your present purposes, however, you are only interested in the R² value.

Notice that the adjusted R² value is 0.942152. This tells you that the average loan grade values explain 94 percent of the average interest rate values. As expected, this is a high adjusted R² value (R² results range from 0.00 to 1.00).
The residual map produced by Ordinary Least Squares (OLS) is shown below. The blue areas are locations where the model predicted too high; the actual average interest rate is lower than expected, given the corresponding average loan grade. Similarly, the red areas are locations where the model predicted too low; the actual average interest rate is higher than expected, given the corresponding average loan grade. Notice that the underpredictions and overpredictions are far from randomly distributed. Most notable is the very strong cluster of overpredictions (blue) in Mississippi.

Create a map showing the relationship between loan grades and interest rates

The residual map from Ordinary Least Squares (OLS) makes clear that average loan grades are not good predictors of average interest rates in Mississippi. You will use Geographically Weighted Regression (GWR) to further explore this relationship across the country.

With Ordinary Least Squares (OLS), only one model is constructed to represent all of the ZIP3s in your dataset. With only one model, only one coefficient is computed, and this coefficient reflects the strength of the relationship between the average loan grade and average interest rate variables, for all ZIP3s. Geographically Weighted Regression (GWR), on the other hand, creates a regression model for every ZIP3 in your dataset. This means there are 815 models and, correspondingly, 815 coefficients reflecting the strength of the relationship between loan grades and interest rates. Geographically Weighted Regression (GWR) computes a potentially different coefficient for every model. When you create a map of these coefficients, you can see how the relationship between loan grades and interest rates varies across the country.

Geographically Weighted Regression (GWR) calibrates each local model using only nearby ZIP3s. Further, this subset of ZIP3s is weighted to give the nearer ZIP3s more influence on the calibration process than ZIP3s that are farther away. Your parameter choices for Kernel type and Bandwidth method will determine which neighboring features are in or out of the calibration process.

If you select FIXED for the Kernel type parameter, it means a particular distance band will determine if a feature is included. If you select ADAPTIVE it means a particular number of nearest neighboring features will determine if a feature is included. The FIXED kernel has the advantage of ensuring the scale of analysis remains constant.

The Bandwidth method you select indicates the criteria that will be used to find an optimal distance band or optimal number of neighbors. Both the Akaike Information Criterion (AICc) and the Cross Validation (CV) methods are appropriate. If you are using Geographically Weighted Regression (GWR) to get accurate predictions, you will want to make sure your model includes all the key explanatory variables and experiment with both bandwidth methods to see which one yields the highest adjusted R²and lowest AICc values. Both of these diagnostics are provided in the GWR output report.

Here, you are using Geographically Weighted Regression (GWR) only to explore the relationship between loan grades and interest rates.

Find and open the Geographically Weighted Regression (GWR) tool and run it with the following parameters:
- Input features : ZIP3Data4Analysis
- Dependent variable : AveInterestRate
- Explanatory variable(s) : AveLoanGrade
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade
- Kernel type : FIXED
- Bandwidth method : AICc

Ouch! The tool gives an error. It sounds serious, too!

Actually, this message is more of a caution than a stop sign. When you see this error, you should check the following:

Do you have sufficient features? Because Geographically Weighted Regression (GWR) creates local models, you should have 100 or more features in your Input features dataset. You have 815, so this is not the problem.
Geographically Weighted Regression (GWR) will fail if two or more explanatory variables are redundant. In your case, there is only one explanatory variable in your model, so this is not a problem.
Geographically Weighted Regression (GWR) will also fail when there isn't much variation in the explanatory variable values. Since the loan grades are mean rankings with little variation, this may be what is causing the problem. This may not be fatal, however. Often, subtracting the mean from the explanatory variable values allows Geographically Weighted Regression (GWR) to solve. Let's try this.
Примечание:
There are additional suggestions given in the tool documentation for dealing with the severe model design error. Once you've ruled out all other problems, you should identify the explanatory variables that are creating the problem (remove each explanatory variable one by one to identify the culprits) and try transforming them as described below.

Use the Add Field tool with the following parameters to create a field to hold the transformed values.
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Field Type : FLOAT
Open the ZIP3Data4Analysis table by right-clicking the layer in the Table of Contents and clicking Open Attribute Table.
Right-click the Average Loan Grade Rank (AveLoanGrade) field and click Statistics.

Notice that the mean value is 12.166066. You will subtract this mean value from each average loan grade rank value (AveLoanGrade).

Highlight the mean value and copy it using Ctrl+C.
Use the Calculate Field tool to compute the transformation of subtracting the mean value from each of the AveLoanGrade values (if you copied the mean value using Ctrl+C in step 5, you can paste it into the second part of the Expression using Ctrl+V; otherwise, just type the Expression as shown):
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Expression : [AveLoanGrade] - 12.166066
Run Geographically Weighted Regression (GWR) again, this time using the transformed variable:
- Input features : ZIP3Data4Analysis
- Dependent variable : AveInterestRate
- Explanatory variable(s) : tAveLoanGrade
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade2
- Kernel type : FIXED
- Bandwidth method : AICc
This time, there is no error. Also, notice that Geographically Weighted Regression (GWR), because it allows relationships to change at each ZIP3, has improved the model results. The adjusted R² value is now almost 97 percent (up from 94 percent).

Like Ordinary Least Squares (OLS), the output map from Geographically Weighted Regression (GWR) shows where the model overpredicted and underpredicted the average interest rate values. More interesting to your objectives, the output layer also contains a field with the coefficient value for each ZIP3. The larger the coefficient, the stronger the relationship is between average interest rates and average loan grades.

Right-click the AveIntRatesGWRAveLoanGrade2 layer in the Table of Contents, click Properties, and click the Symbology tab.
Set the field Value to Coefficient #1 tAveLoanGrade. Set the Classification to Quantile. Finally, select a color ramp that best represents small to large (a graduated color ramp rather than a divergent color ramp). The Yellow to Dark Red color ramp, for example, works well.

The lightest areas are the locations where average loan grades have a weak correlation to average interest rates. The darkest areas are locations where this relationship is strong. GWR AveLoanGrade coefficients

Relationship between average interest rates and average loan grade rankings

Final thoughts

This case study presents a workflow that tests the assumed, unquestioned, relationship between average loan grades and average interest rates. It evaluates notions of equity and disparate impacts that exhibit spatially. These same methods could be applied to other applications where there are assumed correlations. For example, locations with higher average incomes should pay higher average income taxes. Is this consistently true? Where is it less or more true across the country? Locations with appropriate growing conditions should produce higher yields. Is this always the case everywhere? If not, why not? Schools with better teacher-to-student ratios should have higher test scores. Testing and mapping these expected relationships may lead to unexpected, and very interesting, findings.