Essential vocabulary for Geostatistical Analyst—ArcMap

Available with Geostatistical Analyst license.

The following terms and concepts arise repeatedly in geostatistics and within Geostatistical Analyst.

Essential vocabulary for Geostatistical Analyst
Term	Description
Cross validation	A technique used to assess how accurate an interpolation model is. In Geostatistical Analyst, cross validation leaves one point out and uses the rest to predict a value at that location. The point is then added back into the dataset, and a different one is removed. This is done for all samples in the dataset and provides pairs of predicted and known values that can be compared to assess the model's performance. Results are usually summarized as Mean and Root Mean Squared errors.
Deterministic methods	In Geostatistical Analyst, deterministic methods are those that create surfaces from measured points, based on either an extent of similarity (for example, inverse distance weighted) or a degree of smoothing (for example, radial basis functions). They do not provide a measure of uncertainty (error) of the predictions.
Geostatistical layer	Results produced by the Geostatistical Wizard and many of the geoprocessing tools in the Geostatistical Analyst toolbox are stored in a surface called a geostatistical layer. Geostatistical layers can be used to make maps of the results, view and revise the interpolation method's parameter values (by opening then in the Geostatistical Wizard), create other types of geostatistical layers (such as prediction error maps), and export the results to raster or vector (contour, filled contour and points) formats.
Geostatistical methods	In Geostatistical Analyst, geostatistical methods are those that are based on statistical models that include autocorrelation (the statistical relationships among the measured points). These techniques have the capability of producing prediction surfaces, and also some measure of the uncertainty (error) associated with these predictions.
Interpolation	A process that uses measured values taken at known sample locations to predict (estimate) values for unsampled locations. Geostatistical Analyst offers several interpolation methods, which differ based on their underlying assumptions, data requirements, and capabilities to generate different types of output (for example, predicted values as well as the errors [uncertainties] associated with them).
Kernel	A weighting function used in several of the interpolation methods offered in Geostatistical Analyst. Typically, higher weights are assigned to sample values that are close to the location where a prediction is being made, and lower weights are assigned to sample values that are further away.
Kriging	A collection of interpolation methods that rely on semivariogram models of spatial autocorrelation to generate predicted values, errors associated with the predictions, and other information regarding the distribution of possible values for each location in the study area (through quantile and probability maps, or via geostatistical simulation, which provides a set of possible values for each location).
Search neighborhood	Most of the interpolation methods use a local subset of the data to make predictions. Imagine a moving window—only data within the window is used to make a prediction at the center of the window. This is done because there is redundant information in samples that are far away from the location where we need to make a prediction and to speed up the computing time required to generate predicted value for the entire study area. The choice of neighborhood (number of nearby samples and their spatial configuration within the window) will affect the prediction surface, and should be chosen with care.
Semivariogram	A function that describes the differences (variance) between samples separated by varying distances. Typically, the semivariogram will show low variance for small differences and larger variances at greater separation distances, indicating that the data is spatially autocorrelated. Semivariograms estimated from sample data are empirical semivariograms. They are represented as a set of points on a graph. A function is fitted to these points and is known as a semivariogram model. The semivariogram model is a key component in kriging (a powerful interpolation method that can provide predicted values, errors associated with the predictions, and information about the distribution of possible values for each location in the study area).
Simulation	In geostatistics, this refers to a technique that extends kriging by producing many possible versions of a predicted surface (in contrast to kriging, which produces one surface). The set of predicted surfaces provides a wealth of information that can be used to describe the uncertainty in a predicted value for a particular location, the uncertainty for a set of predicted values in an area of interest, or a set of predicted values that can be used as input to a second model (physical, economic, and so forth) to assess risk and make better informed decisions.
Spatial autocorrelation	Natural phenomena often present spatial autocorrelation—that sample values taken close to one another are more alike than samples taken far away from each other. Some interpolation methods require an explicit model of spatial autocorrelation (for example, kriging), others rely on assumed degrees of spatial autocorrelation without providing a means to measure it (for example, Inverse Distance Weighting), and others do not require any notion of the spatial autocorrelation in the dataset. Note that when spatial autocorrelation exists, traditional statistical methods (which rely on the independence among observations) cannot be used reliably.
Transformation	A data transformation is done when a function (log, Box-Cox, arcsin, Normal score) is applied to the data to change the shape of its distribution and/or stabilize the variance (reduce the relationship between the mean and variance, for example, that data variability increases as the mean value increases).
Validation	Similar to cross validation, but instead of using the same dataset to build and evaluate the model, two datasets are used—one to build the model and the other as an independent test of performance. If only one dataset is available, the Subset Features tool can be used to randomly split it into training and test subsets.