Selecting an appropriate kriging model using soil contamination data—Analytics

Workflow using ArcGIS Desktop
References

Exploratory spatial data analysis

Since the Chernobyl accident in 1986, contamination levels have been collected in nearby areas. The half-life of the major contaminant, cesium-137, is approximately 30 years, and it is likely that contamination will spatially vary across the region; however, it is not feasible to take samples everywhere. Using interpolation, values at locations that have not sampled can be estimated using the collected data (Krivoruchko, 2012).

The distribution of the cesium-137 soil samples from 1992, Belarus

Due to the complex meteorological conditions following the accident, contamination is likely a combination of both large-scale and small-scale variations. It was known, for example, to have rained several days following the nuclear accident in the area to the northeast of the sample area. Sample sites closer to the Chernobyl nuclear power plant site are likely to have the highest cesium-137 levels. Some measurement error is also inevitable given that radioactive decay is a stochastic process. Using kriging to predict values, both large- and small-scale variations can be reconstructed, and if the data is not precise, in most cases the method can filter out measurement error.

With some understanding of the area and the processes that took place, such as the wind conditions and rainfall, looking at the data distribution can reveal more information. Exploring the data will, in part, help determine which kriging model is appropriate for this data. It also gives you knowledge of effective parameters for use in the modeling process.

Exploratory spatial data analysis is a crucial first step in statistical analysis; it helps you ensure your analysis is apposite.

Data stationarity

Kriging assumes that data exhibits stationarity. Stationarity means that statistical properties do not depend on exact locations. The mean of a sample at one location is equal to the mean at any other location; data variance is constant in the area being analyzed, and the correlation (covariance or semivariogram) between any two locations depends only on the distance between them, not their exact locations.

A number of steps can be followed to identify whether your data meets these assumptions, and in cases where it doesn't, it may be possible to use some approaches as part of the analysis. Ensuring data assumptions are met is crucial to obtaining valid results. Prediction error is particularly sensitive to model assumption violations.

Examining the distribution of data

Kriging is an optimal predictor when data follows a normal (Gaussian) distribution. Plotting the cesium-137 data as a histogram shows the frequency distribution of the data along with some summary statistics. To meet the assumption of data normality, the distribution in the histogram should be bell-shaped; the mean and median values should be similar, the skewness value should be around 0, and the kurtosis value around 3. None of these are evident in the cesium-137 data. This data can still be used in a kriging model if a transformation can be applied so that the transformed values have a normal distribution.

Histogram of cesium-137 and histogram with log transformation of cesium-137

The data shows a skewed data distribution with very few larger values. A logarithmic transformation may bring the data closer to a normal distribution. Therefore, lognormal kriging will likely produce more accurate predictions than using kriging without the data transformation option. For further validation, the normal QQ plot (probability plot) can be used to graph the data distribution against the standard normal distribution (shown as a line). Clearly, using a log transformation does bring the data closer to a normal distribution and provides a reasonable option. Further exploration can reveal more information.

Cesium-137 normal QQ plot with log-transformed data

Using the normal score transformation in the simple kriging model not only allows you to visualize your data but additionally reports diagnostic information confirming the data distribution. The Akaike information criterion (AIC), as a measure of the relative quality of a statistical model, is used to choose the optimal base and number of distributions needed for a normal distribution approximation (Gribov and Krivoruchko 2012). Indeed, a lognormal model is calculated as the default distribution; however, rather than just one, seven modifiers are needed to obtain a reasonable normal distribution approximation. This is most clearly seen with the normal QQ plot.

Multiplicative skewing using simple kriging modeling in the Geostatistical Analyst wizard

Exploring spatial structure

A value that is significantly higher than others, known as an outlier, can cause problems when choosing the spatial correlation model (semivariogram). A high value located close to Chernobyl is evident in all the above approaches. This could possibly be removed from the model building process but then included when the data is used for prediction, and should be investigated at the modeling stage.

Stationarity is evident when any two locations a similar distance and direction apart also have a similar difference squared. Using pairs of sample locations, the spatial autocorrelation in the data can be examined and plotted as a semivariogram cloud. If spatial correlation exists, pairs of points that are close together should have less difference and so be low on the y-axis as distance increases, so the difference squared should increase and the values will be higher on the y-axis. This is not strongly evident in this data; in particular, the semivariogram values are very different for pairs with the largest cesium value near Chernobyl. For this high value near Chernobyl, at different distances to other points, the differences squared are always large (high on the y-axis) rather than varying from small to large. Furthermore, this structure, which is evidence of spatial dependence in data, can also been seen in the semivariogram surface.

Looking for global trends

Using a transformation ensures the cesium data is normally distributed; however, normal score transformation should only be used with data without trend.

The Trend Analysis tool provides a three-dimensional perspective of the data, and polynomials are fit through the scatterplots on the projected planes. The stick heights reflect the cesium-137 value at each sample location, and rotating the data to 315 degrees so that Chernobyl is located in the back corner of the grid shows you a strong trend is visible in two perpendicular directions, the north-northeast and south-southeast directions.

Investigating cesium-137 using the Trend Analysis tool

We can explore data detrending interactively using the simple kriging model. For this analysis a first-order polynomial is appropriate, as shown by the goodness of fit diagnostic. A value of zero in the trend surface analysis slider is equivalent to using the Global Polynomial Interpolation method, and any greater value uses Local Polynomial Interpolation. When the polynomial coefficients are calculated locally, the number of points involved in the calculation at every point in the data domain decreases, resulting in a map with more detail. Usually, the large-scale data variation is clearly evident at about one-third (slider value of 33).

Confirming detrending was appropriate with this data; the default transformation approximation method has now changed to the student's t-distribution because the data residuals are closer to a normal distribution.

After detrending the residuals are closer to a normal distribution.

Examining local variation

Continuing to explore the data to evaluate if assumptions of data stationarity are reasonable in the cesium-137 data, the local-scale variability can be evaluated using Voronoi polygons. Voronoi (or Thiessen) polygons show the areas of influence around each sample point.

The Voronoi Map tool helps to investigate local data variability. Dividing the data into five categories using entropy or standard deviation shows some distinct areas of local data variability in the northeast and south of the sampling region. These patterns are more evident with lognormal data and at a local scale, before detrending, there is clearly an issue with local data stationarity in the data.

Predicting cesium-137 with kriging

Together with an understanding of the processes behind the data, this exploration has showed that the kriging assumptions can be met using appropriate options as part of carrying out kriging. Detrending the data and using normal score transformation bring the data closer to meeting the assumptions of stationarity and data normality required by kriging. Simple kriging offers both data detrending and data transformation options and may therefore provide a reasonable model; however, there is also some evidence of varying local spatial dependence in the data remains. This together with knowledge of the processes that differed across the area suggests that empirical Bayesian kriging (EBK) could be a more appropriate kriging model for this data. This kriging model uses local models to capture small-scale effects and, unlike other kriging models, fits multiple models across the region (Krivoruchko 2012). Using EBK, therefore, the areas that were known to differ in process and observed to differ in data structure will be modeled with appropriate local models. Since both approaches are valid, models should be built using simple and empirical Bayes models and validation diagnostics should be used to select the final model. Exploring the cesium data has helped you to gain a better understanding of the data and the knowledge to select appropriate parameters and identify the appropriate kriging model for prediction.

Workflow using ArcGIS Desktop

ArcGIS Geostatistical Analyst extension is required for this analysis and you must ensure that the extension is enabled.

Examining the distribution of data

Use the Histogram tool under Explore Data (ESDA tools) with the sample values of cesium-137.
Look for a bell-shaped distribution and check the reported diagnostics for the following:
- Similar mean and median values
- Skewness around 0
- Kurtosis value around 3
Change to a log transformation and check the diagnostics again.
Use the Normal QQPlot ESDA tool and use a log transformation again.
- For further data exploration, with the knowledge that a lognormal transformation will likely produce more accurate predictions, adding a new field with log values can be helpful. Use Add Field to create a double field and Calculate Field to add the lognormal values of cesium-137.
Open the Geostatistical wizard and select Kriging/CoKriging. Enter Cs137 as the source dataset and CS137_CI_K as the data field. On the next menu, ensure simple kriging is selected (the default).
The multiplicative skewing method shows that to transform this data to closely resemble a standard normal distribution, seven modifiers are required to change the shape to that of a normal distribution (Gribov and Krivoruchko 2012). Here, however, you can explore transformations to gain a deeper understanding using different data distributions. Use the AIC to judge the quality of the fitted model (the preferred model is the one with the minimum AIC value):
- Select the student's t-distribution without modifiers (number of modifiers = 1) which is symmetric and bell-shaped, like the normal distribution. Try increasing the number of bins to 50 to better see the distribution.
- Use lognormal distribution to the data (Number of modifiers = 1). This slightly improves the results, but note the AIC value. Look at the normal QQ plot by changing the tab at the bottom of the graph and see how closely your data fits along the line that represents the standard normal distribution.
- Return to step 1 in the Geostatistical wizard (still in simple kriging) and this time use your log-transformed data as the data field. Although this gives improved results, clearly the data cannot be described by one common theoretical distribution. Advanced modeling that combines the data detrending and transformation is required.

Exploring spatial structure

The Semivariogram/Covariance Cloud can be used to explore spatial correlation. Spatial correlation exists where pairs of points that are close together have less difference and the difference squared increases with distance.
Highlighting data points that do not exhibit spatial correlation allows you to isolate any values that may be data errors or simply problematic in model construction, such as data outliers.
- The views in exploratory tools and ArcMap are interconnected, so selecting (brushing) will highlight data points on both the map and in the ESDA tool; the two are linked.
Repeat the same process as step 2 using log-transformed data.

Looking for global trends

The Trend Analysis tool in the Explore Data menu allows you to see your data in three dimensions. The sample points are plotted on the x,y plane and the value is shown by the height of a stick.
Rotate the locations to help see if a strong trend is visible in any direction.
Repeat the process using the lognormal cesium values. The trend is more evident.
In the Geostatistical wizard, click Kriging/CoKriging and then click Simple Kriging and select the order of trend removal to be first.
Change the slider values to explore possible detrended surfaces.

Examining local variation

Open the Voronoi map (under Explore Data), change the type to Entropy, and use the region border to clip the surface.
The map shows two areas, the northeast and the south, where spatial correlation differs. If we use our lognormal data this pattern becomes more evident.
Change the type to Standard Deviation for confirmation of varying spatial dependence in the data.

References

Krivoruchko, K. (2011). Spatial Statistical Data Analysis for GIS Users. ESRI Press.

Gribov, A., and K. Krivoruchko (2012). "New Flexible Non-parametric Data Transformation for Trans-Gaussian Kriging." Geostatistics Oslo 2012, Quantitative Geology and Geostatistics, Volume 17, Part 1, pp. 51–65, Springer Netherlands.

Krivoruchko, K. (2012). "Empirical Bayesian Kriging." ArcUser, Fall 2012. http://www.esri.com/news/arcuser/1012/empirical-byesian-kriging.html

Krivoruchko, K. (2012). "Modeling Contamination Using Empirical Bayesian Kriging." ArcUser, Fall 2012. http://www.esri.com/news/arcuser/1012/modeling-contamination-using-empirical-bayesian-kriging.html

The cesium data used in this case study is distributed with the first reference above.