Header

boxplot of vector attribute data

Author: Paulo van Breugel
Updated on: 23-03-19

Introduction

A boxplot is a convenient way to get a quick overview of the distribution of your data, to compare the distributions of different datasets and to identify outliers in your dataset. The tutorial Plotting GRASS data in Python provides examples of how you can use Python scripts in GRASS GIS to create your own plots, including boxplots. But for a quick exploration of your data, having a dedicated tool would be much more convenient.

You can add such functionality by installing the d.vect.colbp addon. The add-on, based on the d.vect.colhist, lets you draw a boxplot of the values in a vector map attribute column. Additional options are to select a subset of the values and group, plot the values of the column according to the categories in a second column, rotate the plot and the x-axis labels and to include notches. As the main idea is to use this for quick exploration of your data, there are not many other options to change the appearance of the plot. However, you can save the plot as a svg file, and further edit it in e.g., Inkscape.

Install the addon

First, you need to install the d.vect.colbp addon. You can do this using the wxGUI Extension Manager (Menu: Settings → Addon extensions → Install extensions from addons). Or you can install the addon from the command line using the g.extension module.

g.extension extension=d.vect.colbp

Create a boxplot

Let's see how this addon works. We'll use a vector layer from the Dutch Central Bureau of Statistics with key demographic statistics per neighbourhood and municipality [source], with as additional information the province [source].

Let's say we want to know how the population is distributed across the neighbourhoods. The boxplot is one way to visualize the distribution (the histogram is the other, possibly better way, but bear with me), so let's create one.

d.vect.colbp map=buurten2018 column=AANT_INW

As you can see in the results is there something wrong with the data (Figure 2A). As it turns out, the data includes the value -99999999, which codes for no data (yes, I know, always check the metadata first).


Figure 2a: Boxplot of the number of inhabitants per neighbourhood. A) boxplot based on raw data, B) boxplot after removing the no-data code.@endcaption

You can use the WHERE statement to select which records you want to include, or like in the example below, you want to exclude. Of course, you can also use the GUI (see this screenshot).

d.vect.colbp map=buurten2018@kerncijfers500m column=AANT_INW where="AANT_INW IS NOT -99999999"

This time the results look better (Figure 2B). And, for the curious ones, you can check that the records with no-data are the water-bodies (see the yellow areas in this screenshot).

Creating grouped boxplots

Boxplots are particularly handy for comparing the distributions of different groups of observations. If the attribute table has a second column with the name or ID of these groups, you can use that column to create grouped boxplot, i.e., one boxplot for each group. You do this using the group_by parameter.

d.vect.colbp -r map=buurten2018a@kerncijfers500m column=AANT_INW \
where="AANT_INW >= 0" group_by=provincie

For the attentive reader, yes, I used a different where statement below. It does the same but is a bit shorter. I furthermore use the -r flag to rotate the x-axis labels, otherwise, they will not fit.


Figure 4. Distribution of number of inhabitants per neighborhood. Data source: CBS.

The results in Figure 4 show that the largest neighbourhoods (in terms of the number of inhabitants) can be found in the province Zuid-Holland. Beyond the outliers, however, it is difficult to assess the differences between the provinces.

Two options that can help to get a better idea about the differences is to leave out the outliers (use the -o flag) and to sort the boxplots in ascending (use order = ascending) or descending (use order = descending) order by their median. A third option is to use notched boxplots (use the -n flag). The notch displays an approximately 95% confidence interval around the median [1,2]. And, to make it easier to read the labels, we can use horizontal plots (use the -h flag). The command to draw the plot is given below. Alternatively, use the GUI (see the screenshot).

d.vect.colbp map=buurten2018@kerncijfers500m column=AANT_INW \
where="AANT_INW >= 0" group_by=provincie -h -n -o order=ascending

The resulting plot provides a quick overview of the differences between the provinces in terms of the distribution of the number of inhabitants per neighbourhood. Neighbourhoods in Noord-Holland are clearly larger than in other provinces. Utrecht and Zuid-Holland come at second and third place, but differences are small. And if you prefer smaller neighbourhoods, you'll be better off in the northern provinces (Drenthe, Friesland, Groningen).


Figure 6: Distribution of the number of inhabitants per neighbourhood, ordered by the median. Data source: CBS.

Another example

We have used the absolute number of inhabitant per neighbourhood. In terms of quality of living, the population density might be a more important measure. Where can we find the neighbourhoods with the highest population densities? Let's see.

# Add columns
v.db.addcolumn map=buurten2018 \
columns="Area double precision,Density double precision"

# Compute the surface area per polygon (neighborhood)
v.to.db map=buurten2018 option=area columns=Area

# Compute the population density
db.execute sql="UPDATE buurten2018 SET Density = AANT_INW / (Area / 1000000)"

# Plot the results as boxplots
d.vect.colbp -h -n -o map=buurten2018 column=Density where="AANT_INW >= 0" group_by=provincie order=ascending

The clear winners are Noord- en Zuid-Holland and Utrecht. Although those who like a more quiet environment might think otherwise of course.


Figure 7: Distribution of population densities of neighbourhoods, ordered by the median.@endcaption

Afterword

GRASS GIS is a powerful tool for spatial analysis, which includes several tools to quickly explore raster data. For vector data, there are less data exploration tools available out of the box. But as this example shows, it is worth it checking out the addons to see if anybody else created something you can use.

If you are more into Python, the integration between Python and GRASS GIS is possibly even better. See for example this post on Plotting GRASS data in Python. Actually, many add-ons are written in Python, including the d.vect.colbp addon. So if you know how to plot data in Python, it will only be a very small step to make your own GRASS GIS addon.

If you have questions

If you have questions or comments about the text, let me know. You can use this contact form. Please make sure to include the page title ("boxplot of vector attribute data") or page name ("boxplot of vector attribute data").