Header

K-fold cross validation in GRASS GIS

Author: Paulo van Breugel
Updated on: 2017-06-01

Introduction

Model calibration is typically done by maximizing the model's performance on a set of training (also called calibration) data. During model calibration, you can get models to better fit the training data by increasing the complexity of the model. The problem is that this does not necessarily improve the predictive power of a model. On the contrary, additional complexity can lead to models that describe noise in the data instead of the underlying trends or relationships. This is called overfitting and reduces the model's ability to make predictions about an independent set of test data.

For model testing this means that using the same training data for model evaluation can be misleading and generally should be avoided. This is even true for models that are not overly complex; one can expect models to perform less well on a new data set than on the data set used for fitting / calibration. So if you have a number of models and want to find out which model performs best, you'll need to test the model's ability to generalize by evaluating its performance on a set of data not used for training (see here for an example).

A common technique to estimate the accuracy of a predictive model is cross-validation. There are different types of cross-validation, but they all partition the original sample data into a training set to calibrate the model, and a test set to evaluate it. One way to do this is k-fold cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into a number of subsamples (let's say we have k subsamples) with an approximately equal number of records. Of these subsamples, a single subsample is retained as the validation data for testing the model, and the remaining subsamples are combined to be used as training data. The cross-validation process is then repeated as many times as there are sub-samples (i.e., k times), with each of the subsamples used exactly once as the validation data (Table 1). The k evaluation results can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

Table 1. Illustration of data partitioning in a 4-fold cross-validation, with training data used to train the model, and test data to validate the model. The estimated accuracy of the models can then be computed as the average accuracy across the 4 models.

Functions for modeling and machine learning in e.g., R and Python's Scikit-learn often contain build-in cross-validation routines. But it is also fairly easy to build such a routine yourself. You can even do this in Libreoffice, although for more flexibility, you'll need to use a programming language like R or Python. In the example below, I will build a k-fold cross-validation routine in Python / pyGRASS in order to evaluate the predictive performance of two interpolation techniques; the inverse Distance Weighting (IDW; a simple spatial interpolation technique), and the bilinear spline interpolation with Tykhonov regularization (BSI).

There are various ways to run GRASS GIS commands. We'll use the PyGRASS api, which offers a convenient interface to GRASS GIS modules and makes it easy to combine Python and GRASS functions. See this wiki page for more information.

As evaluation statistic, we will use the root square mean error (RSME). The RSME of n predicted values () for observations of a dependent variable () is computed as the square root of the mean of the squares of the deviations:

An example data set

In this example we'll use the IDW and Bspline techniques to estimate the elevation in North Carolina (USA). As base layer we'll use the raster layer with the elevation in North Carolina (elev_state_500m) from the NC sample data. You can download the sample data set here.

For this exercise we are going to assume we only have information about the elevation in 150 randomly selected locations. We will use these points to estimate the elevation in other locations in the country. So effectively, we are going to use the point data to 'reconstruct' the elevation map.

We use the r.random function to create these points. The function also writes for each point the corresponding value of a user-defined raster layer (the elevation layer in our example) to the attribute table.

# Load required modules
from grass.pygrass.modules import Module

# Set region and mask
Module("g.region", raster="elev_state_500m")
Module("r.mask", raster="elev_state_500m@PERMANENT")

# Create point layer with 150 points. The elevation values are uploaded to the
# column 'value' in the attribute table of samplepoints
Module("v.random", output="samplepoints", npoints=150, zmax=1, column="rnd",
       restrict="nc_state@PERMANENT", seed=26)

To avoid repeating the same lines over and over again, I assume that all code in this tutorial is run in the same session. So once a module is loaded in Python, we do not have to load it again.

Inverse Distance Weighting

Using IDW to create an interpolation surface of the elevation

The inverse distance weighting (IDW) technique is a method in which cell values are estimated using a weighted average of the known values of neighbouring points. The weight for a point is computed as the squared inverse distance to that point. This means that the closer a point is to the centre of the cell being estimated, the more influence, or weight, it has in the averaging process. The method provides robust estimates and even though based on simple assumptions, may sometimes perform better than e.g., Kriging (Babak & Deutsch 2009). We use the GRASS GIS function v.surf.idw (Shapiro & Kelly 2015). As parameter setting, we set the number of interpolation points to the 12 nearest neighbours and the power parameter (weighting factor) to 2.

# Load required modules
from subprocess import PIPE

# Create new layer and assign same colors as the original elevation map
Module("v.what.rast", map="samplepoints", raster="elev_state_500m@PERMANENT",
       column="calibration_obs")
Module("v.surf.idw", input="samplepoints", column="calibration_obs",
       npoints=12, power=2, output="elev_idw")
Module("r.colors", map="elev_idw", raster="elev_state_500m")

# Compute the difference as percentage of original values
Module("r.mapcalc", expression="absdiff_elev_idw = elev_state_500m - elev_idw")
Module("r.colors", map="absdiff_elev_idw", color="differences")

expression="reldiff_elev_idw = (elev_state_500m - elev_idw) / elev_state_500m"
Module("r.mapcalc", expression=expression)

# Define color scheme
Module("r.colors", map="reldiff_elev_idw", rules='-',
       stdin_=('0% #d53e4f\n-1 #fc8d59\n-0.5 #fee08b\n0 white\n0.5 #e6f598'
       '\n1 #99d594\n100% #3288bd\nend'))

Simple validation

A visual comparison of the original and estimated elevation maps (Figure 1) shows that the absolute differences between the original and estimated values are largest at higher altitudes. These differences, over 1 kilometer in some locations, can be explained by the steep slopes in the mountain region. To improve on this result, one could therefore decide to take more sample points in the mountainous regions. However, note that although small in absolute terms, in relative terms differences are largest in the coastal region. These differences may have important repercussions; a difference between 1 meter may make the difference between dry or wet feets at times of high water.


Figure 1. A) Elevation map of North Carolina. B) Elevation estimation based on inverse distance weighting interpolation of the elevation at 150 random sample points. C) Residue map with the differences between A and B. D) Relative differences between A and B, computed as (A-B)/A. Map C and D are overlaid with the 150 sample locations. See here for the shell script used to create this figure.

For a global evaluation of the model performance, we use the root square mean error (RSME). As input we could use the actual and estimated elevation maps as input. However, normally you would only have the actual and estimated elevation at the sample point locations, so that is what we are going to use here as well. To get these values, we use the GRASS v.what.rast function. This function writes the elevation values to the attribute table of the point layer (in the example below the actual and estimated values are written to the columns calibration_actual and calibration_idw respectively).

# Get estimated elevation values at sample point locations
# (we already have the actual values in the column 'calibration_obs')
Module("v.what.rast", map="samplepoints", raster="elev_idw",
       column="calibration_idw")

Next, we import the values in the two columns as a Numpy array. First step is to extract the two relevant columns from the attribute table of the samplepoints vector layer (line 5-7; see db.select for more information). This gives us a text string. In line 8-10 this string is converted to a Numpy 2D array. This is subsequently used to compute the RSME (line 13; see np.diff, np.mean and np.sqrt for more information).

# Load required modules
import numpy as np

# Import the values into a Numpy array
sqlstat = "SELECT calibration_obs,calibration_idw FROM samplepoints"
stats = Module("db.select", flags="c", sql=sqlstat,
               stdout_=PIPE).outputs.stdout
stats = stats.replace("\n", "|")[:-1].split("|")
stats = (np.asarray([float(x) for x in stats], dtype="float").
         reshape(len(stats)/2, 2))

# Compute the root square mean error
rsme = np.sqrt(np.mean(np.diff(stats, axis=1)**2))

print rsme
0.0484

In this case we have a RSME of approx. 0.05. This is extremely small considering the elevation ranges from 0 to 1952 meter. So does this mean this technique results in highly accurate estimates? Not really; the technique results in very accurate estimates for the calibration points, but not necessarily for other locations outside the sample points. As is already evident in Figure 1 C and D, a high model accuracy does not necessarily equal a high predictive power.

Cross validation - explaining the steps

As argued earlier, evaluating a model based on the calibration points may inflate your evaluation statistics. A better approach to evaluate the predictive performance is to use a n-fold cross validation. In the example below, we are going to carry out a 4-fold cross validation. First let's go through the individual steps.

The first step is to divide the points layer samplepoints in four groups using the v.kcv function. The function writes the group membership to an user-defined column in the attribute table. In the example below, this is the column part.

# Partition sample points in 4 groups
Module("v.kcv", map="samplepoints", npartitions=4, column="part")

The second step is to run the IDW using the points from group 2-4 as calibration and the points from group 1 as validation (first row in Table 1). To do so, we use v.extract to extract all points belonging to group 2-4 into a new layer tmpcalibration and all points belonging to group 1 to a new layer tmpvalidation.

# Note that the flag 'r' means that we make a reverse selection,
# i.e., we select all all points where part is not 1
Module("v.extract", flags="r", input="samplepoints", output="tmpcalibration",
       where="part=1")
Module("v.extract", input="samplepoints", output="tmpvalidation",
       where="part=1")

Next, we run v.surf.idw using the tmpcalibration point data set as input.

# Create interpolation raster based on IDW with points from group 2-4 as input
Module("v.surf.idw", input="tmpcalibration", column="calibration_obs",
       npoints=12, power=2, output="elev_idw2")

Now we compute the RSME based on the prediction errors, i.e., the differences in the actual observed and estimated elevation at the validation point location (given by the tmpvalidation point layer).

The individual differences between the actual observed and estimated values are called residuals when the calculations are performed over the data sample that was used for estimation (previous example). They are called prediction errors when computed out-of-sample (this example).

# Get actual and estimated values at validation point locations
Module("v.what.rast", map="tmpvalidation", raster="elev_state_500m",
       column="actual")
Module("v.what.rast", map="tmpvalidation", raster="elev_idw2", column="idw")

# Import the values into a Numpy array
sqlstat = "SELECT actual,idw FROM tmpvalidation"
stats = Module("db.select", flags="c", sql=sqlstat,
               stdout_=PIPE).outputs.stdout
stats = stats.replace("\n", "|")[:-1].split("|")
stats = (np.asarray([float(x) for x in stats], dtype="float").
         reshape(len(stats)/2, 2))

# Compute the root square mean error
rsme = np.sqrt(np.mean(np.diff(stats, axis=1)**2))

print rsme
75.3491565543

Cross validation - full example

We need to repeat the above for all other combinations of three subsamples, whereby we use each time the remaining subsampling as validation data (following the scheme in Table 1).

# Define the list that will hold the rsme values
rsme = []

for i in range(1, 5):

    # Name output map
    idwout = "tmp_cv"

    # Create calibration and validation point data sets
    Module("v.extract", flags="r", input="samplepoints",
           output="tmpcalibration", where="part={}".format(i),
           overwrite=True)
    Module("v.extract", input="samplepoints", where="part={}".format(i),
           output="tmpvalidation", overwrite=True)

    # Run IDW
    Module("v.surf.idw", input="tmpcalibration", column="calibration_obs",
           npoints=12, power=2, output="tmp_cv", overwrite=True)

    # Get values for validation points
    Module("v.what.rast", map="tmpvalidation", raster="elev_state_500m",
           column="actual")
    Module("v.what.rast", map="tmpvalidation", raster="tmp_cv", column="idw")

    # Compute RSME
    stats = Module("db.select", sql="SELECT actual,idw FROM tmpvalidation",
                   flags="c", stdout_=PIPE).outputs.stdout
    stats = stats.replace("\n", "|")[:-1].split("|")
    stats = (np.asarray([float(x) for x in stats], dtype="float").
             reshape(len(stats)/2, 2))
    rsme.append(np.sqrt(np.mean(np.diff(stats, axis=1)**2)))

results = ("RSME (mean +/- stddev) = {} ± {}".
           format(np.asarray(rsme).mean().round(2),
                  np.asarray(rsme).std().round(2)))
print results

RSME (mean +/- stddev) = 75.57 ± 29.012

The mean RSME of 75.57 shows that the prediction errors are considerably larger than the residuals. This illustrates again that a high accuracy does not say much about the predictive power of the interpolation technique. The cross-validation statistic provides a convenient way to quickly evaluate the predictive power of the model or modelling technique. But remember, it is a summary statistic. Always make sure to carefully examine the residue map as well.

We can use the same approach to test the predictive power of other interpolation or modelling techniques. Comparing the results allows us to evaluate which technique performs best. Below, we'll run a 4-fold cross-validation to evaluate the bspline interpolation method.

Bilinear spline interpolation

Introduction

Bilinear spline interpolation is a method that estimates values using a mathematical function that minimizes overall surface curvature, resulting in a smooth surface that passes through the input points. We use the GRASS GIS function v.surf.bspline, which computes a bilinear spline interpolation with Tykhonov regularization.

Parameters

The Tykhonov regularization parameter (lambda_i) acts to smooth the interpolation. With a small lambda_i, the interpolated surface closely follows observation points; a larger value will produce a smoother interpolation. To estimate the lambda that produces an interpolation that best fits the original observation data, v.surf.bspline has a "leave-one-out" cross validation procedure (see the manual page for more details). Running this (not shown here) gives us λ = 0.0057.

There is also the option to estimate the point density and distances, which can be used to determine the best lenght of each spline step in east-west and north-south direction (not shown here). The estimated mean distance between points of our point data set is 37 km. Based on this outcome, we'll use spline steps (ns and ew) of double that distance (74 km).

Cross validation - full example

Below I'll give an example of running a cross-validation routine with v.surf.bspline. Note that we already have the samplepoints point layer, created earlier, with the column part that partitiones the points in four groups.

# Define the list that will hold the rsme values
rsme = []

# Start loop
for i in range(1, 5):
    # Create calibration and validation point data sets
    Module("v.extract", flags="r", input="samplepoints",
           output="tmpcalibration", where="part={}".format(i),
           overwrite=True)
    Module("v.extract", input="samplepoints", where="part={}".format(i),
           output="tmpvalidation", overwrite=True)
    
	# Run v.surf.bspline on ith calibration set 
    Module("v.surf.bspline", input="tmpcalibration", column="calibration_obs",
           ew_step=74000, ns_step=74000, method='bilinear', lambda_i= 0.0057,
           solver='cholesky', raster_output="tmp_cv", overwrite=True)
    
	# Get values for validation points
    Module("v.what.rast", map="tmpvalidation", raster="elev_state_500m",
           column="actual")
    Module("v.what.rast", map="tmpvalidation", raster="tmp_cv",
           column="idw")

	# Compute RSME
    stats = Module("db.select", sql="SELECT actual,idw FROM tmpvalidation",
                flags="c", stdout_=PIPE).outputs.stdout
    stats = stats.replace("\n", "|")[:-1].split("|")
    stats = (np.asarray([float(x) for x in stats], dtype="float").
          reshape(len(stats)/2, 2))
    rsme.append(np.sqrt(np.mean(np.diff(stats, axis=1)**2)))

    results = ("RSME (mean +/- stddev) = {} ± {}".
               format(np.asarray(rsme).mean().round(2),
                   np.asarray(rsme).std().round(2)))
    print results

	RSME (mean +/- stddev) = 114.2 ± 9.72

The RSME is higher than we got with IDW (114 compared to 76). This suggests that the IDW methods provides better results than the Bspline method. On the other hand, The RSME is more consistent between each cross-validation run (smaller standard deviation). This may suggest that the IDW is more sensitive for the location (and number?) of sample points used. To find out, we could repeat the cross-validation several times, each time using different sample points as input. But I'll leave that up to the interested reader ;-)

Comparison based on full interpolation results

As input for the computation of the root square mean error (RSME), we used actual (observed) and estimated elevation at the sample point locations. However, in this specific case, we have the observed and estimated (interpolated) values for all raster cells. This means we can use the full data set to evaluate the predictive performance of IDW and BSpline.

v.surf.bspline

# v.surf.bspline

# Create interpolation maps using all sample points
Module("v.surf.bspline", input="samplepoints", column="calibration_obs",
       ew_step=74000, ns_step=74000, method='bilinear', lambda_i= 0.0057,
       solver='cholesky', raster_output="elev_bspline")

# Evaluation using observed and predicted raster layers to compute RSME

expr = "tmpmap1 = exp(elev_bspline - elev_state_500m@PERMANENT, 2)"
Module("r.mapcalc", expression=expr, overwrite=True)
outp = Module("r.univar", map="tmpmap1", flags="g",
              stdout_=PIPE).outputs.stdout
outp = [z.split('=') for z in filter(None, outp.split('\n'))]
RSME_bspline = np.sqrt(float(outp[11][1]) / float(outp[0][1]))

print RSME_bspline
104.052756907

v.surf.idw

# v.surf.idw (we already did this before, but let's run it again)

# Create interpolation maps using all sample points
Module("v.surf.idw", input="samplepoints", column="calibration_obs",
       npoints=12, power=2, output="elev_idw", overwrite=True)

# Evaluation using observed and predicted raster layers to compute RSME
expr = "tmpmap2 = exp(elev_idw - elev_state_500m@PERMANENT, 2)"
Module("r.mapcalc", expression=expr, overwrite=True)
outp = Module("r.univar", map="tmpmap2", flags="g",
              stdout_=PIPE).outputs.stdout
outp = [z.split('=') for z in filter(None, outp.split('\n'))]
RSME_idw = np.sqrt(float(outp[11][1]) / float(outp[0][1]))

print RSME_idw
96.5215380443

Comparison

The different results are compared in the table below. The validation based on the full data set confirms that the IDW performs better in this example. However, differences are noticeable smaller than we found with the k-fold cross validation (Table 2).

Table 2. RSME of 4-fold cross validation (mean and standard deviation) and the evaluation based on the full data.

To find out more about these differences, one could run the same analyses a number of times, using different sets of calibration points. Results may furthermore differ depending on the number of calibration points and how they are sampled (e.g., random or systematic). Hopefully this tutorial provides some pointers how this can be done.

References

  • Babak, O., & Deutsch, C.V. 2009. Statistical approach to inverse distance interpolation. Stochastic Environmental Research and Risk Assessment 23: 543–553
  • Shapiro, M., & Kelly, P. 2015. v.surf.idw. GRASS Development Team.
  • Yao X, Fu B, Lü Y, Sun F, Wang S, Liu M (2013) Comparison of Four Spatial Interpolation Methods for Estimating Soil Moisture in a Complex Terrain Catchment. PLoS ONE 8(1): e54660. https://doi.org/10.1371/journal.pone.0054660

Further readings

Tags


Software: @grassgis @pygrass @python @libreoffice
Subject: @interpolation @crossvalidation @idw @bspline