When you run the Exploratory Regression tool, the primary output is a report. The report can be seen in the geoprocessing messages window when you run in the foreground, or it can be accessed from the Results window. Optionally, a table will also be created that can help you further investigate the models that have been tested. One purpose of the report is to help you figure out whether or not the candidate explanatory variables you are considering yield any properly specified OLS models. In the event that there are no passing models (models that meet all of the criteria you specified when you launched the Exploratory Regression tool), however, the output will also show you which variables are consistent predictors and help you determine which diagnostics are giving you problems. Strategies for addressing problems associated with each of the diagnostics are given in the Regression Analysis Basics document (see Common regression problems, consequences, and solutions) and in What they don't tell you about regression analysis. For more information about how to determine whether or not you have a properly specified OLS model, please see Regression Analysis Basics and Interpreting OLS results.
The report
The Exploratory Regression report has five distinct sections. Each section is described below.
1. Best models by number of explanatory variables
The first set of summaries in the output report is grouped by the number of explanatory variables in the models tested. If you specify a 1 for the Minimum Number of Explanatory Variables parameter, and a 5 for the Maximum Number of Explanatory Variables parameter, you will have 5 summary sections. Each section lists the three models with the highest adjusted R2 values and all passing models. Each summary section also includes the diagnostic values for each model listed: corrected Akaike Information Criteria - AICc, Jarque-Bera p-value - JB, Koenker’s studentized Breusch-Pagan p-value - K(BP), the largest Variance Inflation Factor - VIF, and a measure of residual Spatial Autocorrelation (the Global Moran’s I p-value) - SA. These summaries give you an idea of how well your models are predicting (Adj R2), and if any models pass all of the diagnostic criteria you specified. If you accepted all of the default Search Criteria (Minimum Acceptable Adj R Squared, Maximum Coefficient p-value Cutoff, Maximum VIF Value Cutoff, Minimum Acceptable Jarque Bera p-value, and Minimum Acceptable Spatial Autocorrelation p-value parameters), any models included in the Passing Models list will be properly specified OLS models.
If there aren’t any passing models, the rest of the output report still provides lots of good information about variable relationships, and can help you make decisions about how to move forward.
2. Exploratory Regression Global Summary
The Exploratory Regression Global Summary section is an important place to start, especially if you haven't found any passing models, because it shows you why none of the models are passing. This section lists the five diagnostic tests and the percentage of models that passed each of those tests. If you don’t have any passing models, this summary will help you figure out which diagnostic test is giving you trouble.
Often the diagnostic giving you problems will be the Global Moran’s I test for Spatial Autocorrelation (SA). When all of the models tested have spatially autocorrelated regression residuals, it most often indicates you are missing key explanatory variables. One of the best ways to find missing explanatory variables is to examine the map of the residuals output from the Ordinary Least Squares regression (OLS) tool. Choose one of the exploratory regression models that performed well for all of the other criteria (use the lists of highest adjusted R-Squared values, or select a model from those in the optional output table), and run OLS using that model. Output from the Ordinary Least Squares regression (OLS) tool is a map of the model residuals. You should examine the residuals to see if they provide any clues about what might be missing. Try to think of as many candidate spatial variables as you can (distance to major highways, hospitals, or other key geographic features, for example). Consider trying spatial regime variables: if all of your underpredictions are in the rural areas, for example, create a dummy variable to see if it improves your exploratory regression results.
The other diagnostic that is commonly problematic is the Jarque-Bera test for normally distributed residuals. When none of your models pass the Jarque-Bera (JB) test, you are having a problem with model bias. Common sources of model bias include:
- Nonlinear relationships
- Data outliers
Viewing a scatterplot matrix of the candidate explanatory variables in relation to your dependent variable will show you if you have either of these problems. Additional strategies are outlined in Regression Analysis Basics. If your models are failing the Spatial Autocorrelation test (SA), fix those issues first. The bias may be the result of missing key explanatory variables.
3. Summary of Variable Significance
The Summary of Variable Significance section provides information about variable relationships and how consistent those relationships are. Each candidate explanatory variable is listed with the proportion of times it was statistically significant. The first few variables in the list have the largest values for the % Significant column. You can also see how stable variable relationships are by examining the % Negative and % Positive columns. Strong predictors will be consistently significant (% Significant), and the relationship will be stable (primarily negative or primarily positive).
This part of the report is also there to help you be more efficient. This is especially important when you are working with a lot of candidate explanatory variables (over 50), and want to try models with five or more predictors. When you have a large number of explanatory variables and are testing many combinations, the calculations can take a long time. In some cases, in fact, the tool won’t finish at all due to memory errors. A good approach is to gradually increase the number of models tested: start by setting both the Minimum Number of Explanatory Variables and the Maximum Number of Explanatory Variables to 2, then 3, then 4, and so on. With each run, remove the variables that are rarely statistically significant in the models tested. This Summary of Variable Significance section will help you find those variables that are consistently strong predictors. Even removing one candidate explanatory variable from your list can greatly reduce the amount of time it takes for the Exploratory Regression tool to complete.
4. Summary of Multicollinearity
The Summary of Multicollinearity section of the report can be used in conjunction with the Summary of Variable Significance section to understand which candidate explanatory variables may be removed from your analysis in order to improve performance. The Summary of Multicollinearity section tells you how many times each explanatory variable was included in a model with high multicollinearity, and the other explanatory variables that were also included in those models. When two (or more) explanatory variables are frequently found together in models with high multicollinearity, it indicates that those variables may be telling the same story. Since you only want to include variables that are explaining a unique aspect of the dependent variable, you may want to choose only one of the redundant variables to include in further analysis. One approach is to use the strongest of the redundant variables based on the Summary of Variable Significance.
5. Additional diagnostic summaries
The final diagnostic summaries show the highest Jarque-Bera p-values (Summary of Residual Normality) and the highest Global Moran’s I p-values (Summary of Residual Autocorrelation). To pass these diagnostic tests, you are looking for large p-values.
These summaries are not especially useful when your models are passing the Jarque-Bera and Spatial Autocorrelation (Global Moran’s I) test, because if your criterion for statistical significance is 0.1, all models with values larger than 0.1 are equally passing models. These summaries are useful, however, when you do not have any passing models and you want to see how far you are from having normally distributed residuals or residuals that are free from statistically significant spatial autocorrelation. For instance, if all of the p-values for the Jarque-Bera summary are 0.000000, it's clear that you are far away from having normally distributed residuals. Alternatively, if the p-values are 0.092, then know you're close to having residuals that are normally distributed (in fact, depending on the level of significance that you chose, a p-value of 0.092 might be passing). These summaries are there to demonstrate how serious the problem is and, when none of your models are passing, which variables are associated with the models that are at least getting close to passing.
The table
If you provided a value for the Output Results Table, a table will be created containing all models that met your Maximum Coefficient p-value Cutoff and Maximum VIF Value Cutoff criteria. Even if you do not have any passing models, there is a good chance that you will have some models in the output table. Each row in the table represents a model meeting your criteria for coefficient and VIF values. The columns in the table provide the model diagnostics and explanatory variables. The diagnostics listed are Adjusted R-Squared (R2), corrected Akaike Information Criteria (AICc), Jarque-Bera p-value (JB), Koenker’s studentized Breusch-Pagan p-value (BP), Variance Inflation Factor (VIF), and Global Moran’s I p-value (SA). You may want to sort the models by their AICc values. The lower the AICc value, the better the model performed. You can sort the AICc values in ArcMap by double-clicking on the AICc column. If you are choosing a model to use in an OLS analysis (in order to examine the residuals), remember to choose a model with a low AICc value and passing values for as many of the other diagnostics as possible. For example, if you have looked at your output report and you know that Jarque-Bera was the diagnostic that gave you trouble, you would look for the model with the lowest AICc value that met all of the criteria except for Jarque-Bera.
Additional resources
If you’re new to regression analysis in ArcGIS, we strongly encourage that you watch the Free Esri Virtual Campus Training Seminar on Regression, then run through the Regression Analysis tutorial before using Exploratory Regression.
You may also want to see:
- Learn more about how Exploratory Regression works
- What they don't tell you about regression analysis
- Regression analysis basics
Burnham, K.P. and D.R. Anderson. 2002. Model Selection and Multimodel Inference: a practical information-theoretic approach, 2nd Edition. New York: Springer. Section 1.5.