Workflow using ArcMap
Create a hot spot map of average interest rates
- If you haven't done so already, download and unzip the data package provided at the top of this workflow.
- Double-click the OnlineLendingAnalysis.mpk map package to open it.
- Open the table for the ZIP3LoanData layer by right-clicking the layer in the Table of Contents and clicking Open Attribute Table.
- Use the Search window to find and open the Select Layer By Attribute tool.
- To select the ZIP3s with more than 30 loans, run the Select Layer By Attribute tool with the following parameters:
- Layer Name or Table View : ZIP3LoanData
- Selection type : NEW_SELECTION
- Expression : AcceptedLoans >= 30
- Find and open the Copy Features tool and use it to create a new feature class containing only the ZIP3s that have at least 30 accepted loans:
- Input Features : ZIP3LoanData
- Output Feature Class : the name of your output feature class such as ZIP3Data4Analysis
- Right-click ZIP3LoanData in the Table of Contents and Remove it from the map document so it is out of the way. You will use ZIP3Data4Analysis for the remaining analyses.
- Find and open the Hot Spot Analysis tool and run it with the following parameters.
- Input Feature Class : ZIP3Data4Analysis
- Input Field : AveInterestRate
- Output Feature Class : the name of your output feature class such as InterestRateHSA
- Conceptualization of Spatial Relationships : FIXED_DISTANCE_BAND
- Apply False Discovery Rate (FDR) Correction : checked
Notice that each ZIP3 (each three-digit ZIP Code) has data for the total number of loan applications submitted, the total number of loans issued (accepted loans), the average interest rate for all loans issued, average loan grade ranking for all loans issued, and total number of households.
To avoid the small numbers problem (Kennedy, 1989) and ensure the average interest rate reported for each ZIP3 is both reliable and representative, you will focus your analysis on ZIP3s where at least 30 loans have been funded.
To create a hot spot map of the average loan interest rates, you will use the Hot Spot Analysis tool. The tool has a number of parameters. In addition to Input Feature Class, Input Field, and Output Feature Class, there are several other parameters including Conceptualization of Spatial Relationships, Distance Band or Threshold Distance, and a check box to indicate whether or not you want to apply a False Discovery Rate Correction (FDR).
The default Conceptualization of Spatial Relationships is a fixed distance band (FIXED_DISTANCE_BAND). The advantage of this choice is that it keeps the scale of your analysis consistent (fixed) across the study area.
The default value for the Distance Band or Threshold Distance parameter is the minimum distance to ensure every feature (every ZIP3) has at least one neighbor. Often, this is not the best choice (see Selecting a fixed-distance band value). In this case, however, you are analyzing individual loan application data that, unfortunately, is only identified geographically by a three-digit ZIP Code; no other location information is available. Because the default distance provided by Hot Spot Analysis, when no distance value is given, represents the minimum valid distance (scale of analysis), it is the most appropriate distance band value for this data.
Hot Spot Analysis will visit each ZIP3 and compute the average interest rate for that ZIP3 and any surrounding ZIP3s within the Distance Band or Threshold Distance specified. If this local average interest rate is significantly higher than the average interest rate for all ZIP3s across the country, it is a hot spot. If this local interest rate is significantly lower than the average interest rate for all ZIP3s across the country, it is a cold spot. Applying the FDR correction to account for multiple testing and spatial dependence is always a good idea.
Each of the Hot Spot Analysis parameters are described in the tool documentation, including information for how to select an appropriate Conceptualization of Spatial Relationships and how to select an appropriate fixed-distance band value.
Create a model of average interest rate values
You will next see how well average loan grade rankings explain (predict) average interest rates using Ordinary Least Squares regression (OLS). If average loan grade values effectively predict the average interest rate values, you should get a high R2 value, and the model overpredictions and underpredictions (residuals) should exhibit a spatially random pattern.
The OLS tool writes its output report to the Results window and, when you turn off background processing, it also displays tool messages in a Progress dialog box.
- To ensure you see the OLS report while the tool runs, turn off background processing by clicking the Geoprocessing menu item and clicking Geoprocessing Options. Uncheck the Enable box for Background Processing..
- Now that you've disabled background processing, find and open the Ordinary Least Squares regression (OLS) tool and run it with the following parameters:
- Input Feature Class : ZIP3Data4Analysis
- Unique ID field : ZIP3NUM
- Output Feature Class : the name of your output feature class such as AveIntRatesOLSAveLoanGrade
- Dependent Variable : AveInterestRate
- Explanatory Variable : AveLoanGrade
- Notice that the adjusted R2 value is 0.942152. This tells you that the average loan grade values explain 94 percent of the average interest rate values. As expected, this is a high adjusted R2 value (R2 results range from 0.00 to 1.00).
- The residual map produced by Ordinary Least Squares (OLS) is shown below. The blue areas are locations where the model predicted too high; the actual average interest rate is lower than expected, given the corresponding average loan grade. Similarly, the red areas are locations where the model predicted too low; the actual average interest rate is higher than expected, given the corresponding average loan grade. Notice that the underpredictions and overpredictions are far from randomly distributed. Most notable is the very strong cluster of overpredictions (blue) in Mississippi.
Create a map showing the relationship between loan grades and interest rates
The residual map from Ordinary Least Squares (OLS) makes clear that average loan grades are not good predictors of average interest rates in Mississippi. You will use Geographically Weighted Regression (GWR) to further explore this relationship across the country.
With Ordinary Least Squares (OLS), only one model is constructed to represent all of the ZIP3s in your dataset. With only one model, only one coefficient is computed, and this coefficient reflects the strength of the relationship between the average loan grade and average interest rate variables, for all ZIP3s. Geographically Weighted Regression (GWR), on the other hand, creates a regression model for every ZIP3 in your dataset. This means there are 815 models and, correspondingly, 815 coefficients reflecting the strength of the relationship between loan grades and interest rates. Geographically Weighted Regression (GWR) computes a potentially different coefficient for every model. When you create a map of these coefficients, you can see how the relationship between loan grades and interest rates varies across the country.
Geographically Weighted Regression (GWR) calibrates each local model using only nearby ZIP3s. Further, this subset of ZIP3s is weighted to give the nearer ZIP3s more influence on the calibration process than ZIP3s that are farther away. Your parameter choices for Kernel type and Bandwidth method will determine which neighboring features are in or out of the calibration process.
If you select FIXED for the Kernel type parameter, it means a particular distance band will determine if a feature is included. If you select ADAPTIVE it means a particular number of nearest neighboring features will determine if a feature is included. The FIXED kernel has the advantage of ensuring the scale of analysis remains constant.
The Bandwidth method you select indicates the criteria that will be used to find an optimal distance band or optimal number of neighbors. Both the Akaike Information Criterion (AICc) and the Cross Validation (CV) methods are appropriate. If you are using Geographically Weighted Regression (GWR) to get accurate predictions, you will want to make sure your model includes all the key explanatory variables and experiment with both bandwidth methods to see which one yields the highest adjusted R2and lowest AICc values. Both of these diagnostics are provided in the GWR output report.
Here, you are using Geographically Weighted Regression (GWR) only to explore the relationship between loan grades and interest rates.
- Find and open the Geographically Weighted Regression (GWR) tool and run it with the following parameters:
- Input features : ZIP3Data4Analysis
- Dependent variable : AveInterestRate
- Explanatory variable(s) : AveLoanGrade
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade
- Kernel type : FIXED
- Bandwidth method : AICc
- Do you have sufficient features? Because Geographically Weighted Regression (GWR) creates local models, you should have 100 or more features in your Input features dataset. You have 815, so this is not the problem.
- Geographically Weighted Regression (GWR) will fail if two or more explanatory variables are redundant. In your case, there is only one explanatory variable in your model, so this is not a problem.
- Geographically Weighted Regression (GWR) will also fail when there isn't much variation in the explanatory variable values. Since the loan grades are mean rankings with little variation, this may be what is causing the problem. This may not be fatal, however. Often, subtracting the mean from the explanatory variable values allows Geographically Weighted Regression (GWR) to solve. Let's try this.
- Use the Add Field tool with the following parameters to create a field to hold the transformed values.
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Field Type : FLOAT
- Open the ZIP3Data4Analysis table by right-clicking the layer in the Table of Contents and clicking Open Attribute Table.
- Right-click the Average Loan Grade Rank (AveLoanGrade) field and click Statistics.
- Highlight the mean value and copy it using Ctrl+C.
- Use the Calculate Field tool to compute the transformation of subtracting the mean value from each of the AveLoanGrade values (if you copied the mean value using Ctrl+C in step 5, you can paste it into the second part of the Expression using Ctrl+V; otherwise, just type the Expression as shown):
- Input Table : ZIP3Data4Analysis
- Field Name : tAveLoanGrade
- Expression : [AveLoanGrade] - 12.166066
- Run Geographically Weighted Regression (GWR) again, this time using the transformed variable:
- Input features : ZIP3Data4Analysis
- Dependent variable : AveInterestRate
- Explanatory variable(s) : tAveLoanGrade
- Output feature class : the name of your output feature class such as AveIntRatesGWRAveLoanGrade2
- Kernel type : FIXED
- Bandwidth method : AICc
- This time, there is no error. Also, notice that Geographically Weighted Regression (GWR), because it allows relationships to change at each ZIP3, has improved the model results. The adjusted R2 value is now almost 97 percent (up from 94 percent).
- Right-click the AveIntRatesGWRAveLoanGrade2 layer in the Table of Contents, click Properties, and click the Symbology tab.
- Set the field Value to Coefficient #1 tAveLoanGrade. Set the Classification to Quantile. Finally, select a color ramp that best represents small to large (a graduated color ramp rather than a divergent color ramp). The Yellow to Dark Red color ramp, for example, works well.
Ouch! The tool gives an error. It sounds serious, too!
Actually, this message is more of a caution than a stop sign. When you see this error, you should check the following:
Final thoughts
This case study presents a workflow that tests the assumed, unquestioned, relationship between average loan grades and average interest rates. It evaluates notions of equity and disparate impacts that exhibit spatially. These same methods could be applied to other applications where there are assumed correlations. For example, locations with higher average incomes should pay higher average income taxes. Is this consistently true? Where is it less or more true across the country? Locations with appropriate growing conditions should produce higher yields. Is this always the case everywhere? If not, why not? Schools with better teacher-to-student ratios should have higher test scores. Testing and mapping these expected relationships may lead to unexpected, and very interesting, findings.