Header

Plotting GRASS data in Python

Author: Paulo van Breugel
Last updated on: 26-04-17

Introduction

GRASS GIS offers some useful but basic plotting options for raster data. However, for plotting of data in attribute tables and for more advanced graphs, we need to use other software tools. Python offers various libraries for plotting data. The Matplotlib for example is a well-established and powerful plotting library that can be used to produce publication quality figures in a variety of hard-copy formats and interactive environments across platforms. It has a lot of flexibility to allow the user to completely customize the look of the output. The downside is that it has a fairly steep learning curve. A relative new package is the ggplot package. It is based on R's ggplot2 and the Grammar of Graphics so will be familiar to those coming from R. Other packages are e.g., Seaborn, Bokeh, pygal and plotly. Each offer interesting features which I will try to explore in later posts.

Here, we will look at Pandas plot(). Panda's plot function provides a useful wrapper around several matplotlib plotting routines for dataframes and series, allowing for quick and handy plotting of data frames. At the same time, you can (and sometimes will have to) use modifiers from matplotlib / pyplot to further tune your plot.

Based on two cases, we will explore some of the possibilities these two library offers and how this can be used together with GRASS GIS. We will be using a number of libraries / modules, including the module for Miscellaneous operating system interfaces (OS), GRASS Python scripting library (grass script), the Python Data Analysis Library (pandas), the pandas I/O API (pandas.io), matplotlib's plotting framework (pyplot), the numpy scientific computing package, the sqlite3 interface for SQLite databases and the tabulate package for printing tabular data. If you haven't installed these libaries yet, see the respective webpages for instructions.

# Load libraries
import os
import grass.script as gs
import pandas as pd
from pandas.io import sql
import matplotlib.pyplot as plt
import numpy as np
import sqlite3
from tabulate import tabulate

It is furthermore a good idea to set the working directory. This is where all the output files will be written to.

# Set working director
os.chdir("/home/paulo/")

Example 1. Distribution of bus route lengths

In the first example, we'll create a simple barplot visualizing the length of the bus routes of the Wolfline bus service of the NC State University. Each step is explained in some details. If you are more interested in a quick overview of the code used to create the final figure, we'll provide a summary of the code on github.

Data preparation

First we need to get the data. See this tutorial for a more detailed explanation of how the data was created and imported as a Pandas dataframe.

# Read in the attribute table
sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath)
sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE"
df = pd.read_sql_query(sqlstat, con).dropna()
con.close()

Distances are in meters. Let's convert that to kilometers and add that to a new column in the table.

# Create a column with distances in km
df['Length (km)'] = df['totallength'] / 1000

We can use print the table to screen, but what if we want to show the table in a report or on a web page? With the tabulate package we can export the table in various formats including html and Latex. Below, we use the pipe option because this is the same format ZIM-wiki, the editor I use to write these posts, uses for tables.

# Print the table in pretty format
df2 = df.ix[:,(0,2)]
print tabulate(df2, headers=df2.columns, tablefmt="pipe", showindex=False,
               floatfmt=".2f")

Table 1. Bus rout lengths

ROUTE Length (km)
1 9.93
2 8.93
3 7.76
4 8.97
5 8.22
6 10.36
7 5.86
7a 5.94
8 8.45
9 8.54
10 7.21
11 8.69

Creating a basic barplot

The table provides a quick overview of the length of each of the routes. For the more visually inclined persons, we can also create a figure; a barplot in this case. With the data in a dataframe, this is very easy. In the line below, we use the data in column 'Length (km)' as input and define the kind of plot as 'bar'.

df['Length (km)'].plot(kind="bar")


Figure 1. Length of Wolfline bus routes of the NC State University

Tweaking the barplot - part 1

Useful for a quick overview, but it doesn't really look great. Luckily this is easy to remedy with a few small changes. First, we sort the data in ascending order.

df2 = df.ix[:,(0,2)]
df2 = df2.set_index('ROUTE')
df2 = df2.sort_values(by=['Length (km)'])

We'll furthermore add labels for the axes and change the color of the bars. And just to explore the different options, let's create a horizontal bar plot this time (this is done by setting kind='barh').

dfplot = df2.plot(kind='barh', legend=None, title='NC bus routes', color='green')
dfplot.set_ylabel('Routes')
dfplot.set_xlabel('Length (km)')

You can also combine the last two lines in one:

dfplot.set(ylabel='Routes', xlabel='Length (km)')


Figure 2. Length of Wolfline bus routes of the NC State University

Tweaking the barplot even further

Better, but still far from print / publication ready. A quick way to change the overall appearance of your plot is to set a different style. There are a number of pre-defined styles provided by matplotlib style package. One of these is the "ggplot" style, which emulates the aesthetics of ggplot2 (a popular plotting package for R). To use a style, you first need to import the matplotlib.pyplot module (done that already in the beginning of this tutorial), after which you can define the style as shown below.

plt.style.use('ggplot')

Now you can define the plot again. This time we stick with the vertical bars and we rotate the labels on the x-axis using the rot argument.

df2 = df.ix[:,(0,2)]
df2 = df2.set_index('ROUTE')
df2 = df2.sort_values(by=['Length (km)'])
dfplot = df2.plot(kind='bar', legend=None, title='NC bus routes', color='orange', rot=0)
dfplot.set(xlabel='Routes', ylabel='Length (km)')

To further improve the appearance of the plot we can also remove the upper/right axis tick marks.

dfplot.xaxis.set_ticks_position('bottom')
dfplot.yaxis.set_ticks_position('left')


Figure 3. Length of Wolfline bus routes of the NC State University

There are other nice style sheets and you can define your own. Run print(plt.style.available) for a list of available styles or check out this gallery. To further turn this plot into publication ready plot, you can play around with the options that matplotlib supplies. This falls outside the scope of this tutorial, but have a look at this tutorial, which provides an example of how far you can go to tune your images with matplotlib. You can download the code used to create Figure here.

Example 2. Schools and children

In our next example, we will look at the distance to schools in south-west Wake County in North Carolina and how these are distributed within the different municipals. The layers we'll use are the vector layers schools_wake, boundary_municp and boundary_county. The layers give the location of the schools, the municipal boundaries and the country boundaries respectively. In the example we'll run all commands from the Python console, but you can of course run the equivalent shell commands, or use the GUI interface. For an overview of the code used to create the final figure, see this github page.

Preparation of the layers

First step is to extract boundaries of Wake county (v.extract) and cut out municipalities that fall within the boundaries of Wake (v.overlay). Note that we are ignoring the fact that some of the municipalities extend beyond the counties boundaries. Also note that we use the parameter olayer in the v.overlay command to determine which attribute table is copied over. By setting the first number at 0, we are telling the function to not merge the attribute tables of the input layers. By setting the second number at 1 and third at 0, we are telling the function to transfer the attributes of the first input map (the municipality layer) to layer 1 of the new vector map and ignore the attributes of the WakeCounty boundary map.

# Cut out target layers
gs.run_command("v.extract", input="boundary_county", where="NAME='WAKE'",
               output="WakeCounty")
gs.run_command("v.overlay", ainput="boundary_municp", binput="WakeCounty",
               operator="and", output="WakeMunicp", olayer="0,1,0")

In the attribute table of WakeMunicp there is a column giving the surface area of the polygons in acres. I am from Europe, so I prefer the metric system. We will therefore compute the area in hectares using the v.to.db function.

gs.run_command("v.to.db", map="WakeMunicp", option="area", columns="AREA",
               units="hectares")

Now, let's see how many municipalities we have and how large they are. Like in the first example, we will use a sqlite query to extract the required data directly in aggregated form.

sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath, isolation_level=None)
sqlstat = ("SELECT TOWN_CODE as ID, MB_NAME as Municipality, sum(AREA) as "
           "'Area (ha)' FROM WakeMunicp GROUP BY MB_NAME")
df = pd.read_sql_query(sqlstat, con).dropna()

Now we can print the table out using the tabulate function. As before, I am using the 'pipe' format, but you can use any format that suits you best.

# Print the table in pretty format
print tabulate(df, headers=df.columns, tablefmt="latex", showindex=False,
	floatfmt=".2f")

Table 2. The municipalities and the area falling within the boundaries of Wake county.

ID Municipality Area (ha)
0501 Apex 3066.67
0503 Cary 12538.33
0507 Fuquay-Varina 2126.10
0508 Garner 3461.92
0510 Holly Springs 2856.32
0512 Knightdale 838.28
0517 Morrisville 1745.42
0520 Raleigh 33283.20
0521 Rolesville 555.85
0527 Wake Forest 3264.31
0529 Wendell 530.02
0532 Zebulon 834.45

As you can see in the table, there is one municipality, Durham-Chapel Hill, with a very small area. A closer look at the map shows that this municipality lies almost completely in the neighboring county Durham. We will henceforth assume that this is a result of inaccuracy of the map's boundaries.

There are furthermore a number of polygons without municipality ID (I leave it up to you to figure out why these do not show up in Table 2). Checking those areas using e.g., openmaps shows that these are largely non-built up areas. We can use v.db.droprow to remove these features as well as the polygons classified as Durham. With g.rename we replace the original layer with the new layer.

# Remove municipality Durham and non-classified polygons, and
# read in the attribute table again as df
sqlstat = "MB_NAME='Durham' OR MB_NAME IS NULL"
gs.run_command("v.db.droprow", input="WakeMunicp", where=sqlstat,
            output="WakeMunicp2")
gs.run_command("g.rename", vector="WakeMunicp2,WakeMunicp", overwrite=True)

Creating a map

First let's create a map of the area using the afore mentioned three layers. If all you need is a quick overview, you may as well use the GUI tools. On the other hand, when you have to repeat the same procedure, the command line will save you time. It furthermore makes it easier for others to reproduce the map and allows you to combine it with other tools, e.g., for creating the color table as in the example below.

We will color the municipalities using the qualitative palet Paired-12 from ColorBrewer2. We can copy/paste the colors from the Colorbrewer2 website, but an easier way is to use the Python library Palettable. Palettable is a Python library that can supply color maps for matplotlib or other application, including all ColorBrewer2 palettes. Here we used it to create a list of tuples with the colors in the Paired-12 palette (line 2).

# Get the Paired-12 color palet
from palettable.colorbrewer.qualitative import Paired_12
cls = Paired_12.colors
csl = [":".join(map(str, x)) for x in cls]

We can write the Paired-12 colors to a new column (RGBcolor) in the attribute table of the WakeMunicp vector layer.

# Write color codes to a new column  in the WakeMunicp attribute table.
sqlstat = "ALTER TABLE WakeMunicp ADD RGBcolor varchar;"
sql.execute(sqlstat, con)
for i in xrange(len(df)):
    sql.execute("UPDATE WakeMunicp SET RGBcolor='{}' WHERE TOWN_CODE='{}';".
                format(csl[i], df['ID'].iloc[i]), con)
con.close()

We also write the color to a legend file that we will use later to construct the legend.

# Write legend file
lgt = "legend/area_curved"
with open("tmplegendfile", "w") as text_file:
    for i in xrange(len(df)):
        text_file.write("{}|{}|5|lf|none|{}|1|area||\n".
                        format(df.iloc[i, 1], lgt, csl[i]))

We can now plot the map using d.mon and d.vect. The map will be build up using the WakeCounty and WakeMunicp layers. For the color of the municipals, d.vect will use the colors in the RGBcolor column in WakeMunicp. Note that instead of printing the map to file, you can also print it to screen (see d.mon for details).

# Print out map
gs.run_command("g.region", vector="WakeCounty", res=100)
gs.run_command("d.mon", start="cairo", width=600, height=450,
			   output="mymap.png")
gs.run_command("d.frame", flags="c", frame="map1", at="0,100,30,100")
gs.run_command("d.vect", map="WakeCounty", fill_color="#fffef1",
			   color="53:53:53", flags="s")
gs.run_command("d.vect", map="WakeMunicp", rgb_column="RGBcolor", font="Sans",
			   label_size="10")

Last step is to add the legend (d.legend.vect), which is defined by the legend file we created before.

gs.run_command("d.frame", flags="c", frame="map2", at="5,100,0,29")
gs.run_command("d.legend.vect", flags="b", at="5,95", font="Roboto:Light",
			   title="Municipals", input="tmplegendfile", border_color="none",
			   title_fontsize=14, fontsize=12)
gs.run_command("d.mon", stop="cairo")

The map is shown below. We could have created the map using matplotlib basemap, but to be honest I haven't seen the point yet as GRASS provides all the tools I need. Perhaps something for a future post.

Figure 4. Map of the municipals of Wake county

Distances to school within the county

There are different ways to describe the distribution of the schools. Here we will look at the distribution and median of distances to the nearest school within the county. First we need to define the region of interest (g.region). We'll use the spatial extent of the layer WakeCounty. We also need to set the resolution of the region. In the example below, the resolution is set to 50 meter.

# Set region
gs.run_command("g.region", flags="a", vector="WakeCounty", res="50")

Next, we compute for each raster cell the distance to the nearest school. We use the r.grow.distance function to compute the as-the-crow-flies distance to each school. This function requires raster point data as input, so we first convert the vector point layer to a raster, and then run r.grow.distance.

# Convert school point layer to raster layer and compute distances to schools
gs.run_command("v.to.rast", input="schools_wake", output="schools", use="attr",
               attribute_column="cat")
gs.run_command("r.grow.distance", flags="m", input="schools",
               distance="dist2schools")

the above will result in an underestimation of the distance. For a method to compute actual 'travel distances', check out the r.cost function.

We are mostly interested in the distances within the urban areas, so we mask out all areas outside the municipal boundaries. We then use the function grass.script.array.read() to read in the raster. Note that this function respects the MASK, i.e., the values of raster cells outside the municipal boundaries will be set to NA (no data).

# Set mask
gs.run_command("r.mask", vector="WakeMunicp")

# Import raster layer as numpy array
import grass.script.array as garray
distmap = garray.array()
distmap.read("dist2schools", null=np.nan)

To convert the 2D array to a 1D numpy array we use the np.flatten() function. We furthermore remove all NA values. This is done as follows: the np.isnan() function tests element-wise for NaN and returns a boolean array (True for NaN and False otherwise). The ~ sign inverts this array, which can thus be used to select all elements in distmap that are not NaN.

# Convert from 2D to 1D array, remove nodata values and
# divide by 1000 to convert from meters to kilometers
distmap = distmap.flatten()
distmap = distmap[~np.isnan(distmap)] / 1000

Creating a histogram

Now we have the data, we can create a plot. However, first let's set a plot style and the figure size. The latter is done using pl.figure(). This function expects you to define the plot size in inches. By defining the dpi you can define the width and height in pixels as shown below.

If you use Spyder IDE and want to print the plot to file, you may first have to disable the interactive mode with plt.ioff(). If you want to enable it again, use plt.ion().

# Set the plot style and create new figure
plt.style.use('ggplot')

# Set size of figure in pixels
my_dpi=100
histn = plt.figure(figsize=(450/my_dpi, 350/my_dpi), dpi=my_dpi)

We will use matplotlib's hist function to create the histogram. We'll use 'stepfilled' as histogram type and set normed to True to get fractional instead of absolute numbers. Finally, we add titles for the X and Y axes.

# Define histogram
plt.hist(distmap, bins=50, histtype="stepfilled", color="grey", normed=True)
plt.xlabel("Distances (km)")
plt.ylabel("Fraction of cells")

It might also be useful to add the median as a vertical line.

vl = np.median(distmap)
plt.axvline(x=vl, color="red", linestyle="dashed", linewidth=2)

And, for the finishing touch, we can add an annotation with arrow to provide the reader with the value of the median.

txt = "median = {:.2f}".format(float(vl))
y = histn[0].max() - 0.05
x = vl + 0.1
plt.annotate(txt, xy=(x, y), xytext=(x + 1, y - 0.1),
			 arrowprops=dict(facecolor='black', shrink=0.05))

To be able to use the figure later, we can save the plot to file using plt.savefig() before closing it (plt.close()).

plt.savefig(filename='dist2school_county.png', bbox_inches='tight')
plt.close()


Figure 5. Distribution of distances to the nearest schools in Wake country

It should be noted that GRASS has two different tools to create histograms with one or more rasters as input. For a quick overview you probably want to use these tools. On the other hand, creating plots in Python offers the advantage of more flexibility and advanced plotting options (e.g., adding annotations like in the example above), and it is great for repetitive tasks.

Distances to schools per municipal

Above we have the distribution of distances to the nearest schools within the urban areas in Wake county. The original question, however, to what extent distances to schools in south-west Wake County in North Carolina differ between the municipals. To answer this question, we need to split the distance map per municipal. To do this, we first convert the vector layer WakeMunicp to a raster layer, using the columns OBJECTID and MB_NAME to define the attribute and attribute labels respectively.

# Convert the WakeMunicp map to raster layer.
gs.run_command("v.to.rast", input="WakeMunicp", output="WakeMunicp",
               use="attr", attribute_column="OBJECTID", type="area",
               label_column="MB_NAME")

We have now two raster layers, WakeMunicp and dist2schools, which for each raster cell the ID (and name) of the municipal and the distance to the nearest school respectively. We export the two layers as csv file using r.stats and read in the file as a Pandas dataframe. By setting the -n flag, we exclude raster cells where one or both layers have nodata (see the r.stats manual page for more details). This effectively means that we only include the raster cells that fall within the municipal boundaries.

# Export the maps as csv file, and read in as a dataframe
gs.run_command("r.stats", input=["WakeMunicp", "dist2schools"],
               output="tmp1.csv", separator="comma", flags="al1gn")
df3 = pd.read_csv("tmp1.csv", sep=",", header=None, usecols=[0,1,2,3,4])
df3.columns = ["x", "y", "ID", "municipals", "distance2school"]

Due to an issue reported here, you may end up with missing labels in the municipals column. If so, see below.

As an alternative to the approach above, we can also get the raster values using r.out.xyz (which is a wrapper of r.stats).

gs.run_command("r.out.xyz", input=["WakeMunicp", "dist2schools"],
               output="tmp2.csv", separator="comma")
df4 = pd.read_csv("tmp2.csv", sep=",", header=None)
df4.columns = ["x", "y", "ID", "distance2school"]

The table df4 is the same as the table df3 created earlier, except that r.out.zyz does not export raster labels, so we need to add these to df4. We do this by first importing a list of all ID_municp and the corresponding names of the Municipals, and link these df4.

sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath, isolation_level=None)
sqlstat = ("SELECT OBJECTID as ID, MB_NAME as Municipals FROM WakeMunicp "
           "GROUP BY OBJECTID")
df5 = pd.read_sql_query(sqlstat, con).dropna()

Now we can merge the two tables df4 and df5 using a left-join. This returns all records from the left table (df4), and the matched records from the right table (df5). The resulting table df6 should be the same as table df3.

df6 = pd.merge(df4, df5, on='ID', how='left')

Plotting the data

Good, now that we have a table with the distances to the schools and the municipal, we can create a boxplot showing for each municipal the distribution and median of distances from each raster cell within that municipal to the nearest school. Pandas offers different options to create boxplots. Boxplot can be drawn calling Series.plot.box() and DataFrame.plot.box(), or DataFrame.boxplot() to visualize the distribution of values within each column. Below, we are using the last one. The function allows automatic grouping of values in one or more column(s) based on categories in another column using the by parameter. Here we want to plot the distances to schools grouped by municipal.

df3.boxplot(column="distance2school", by="municipals", figsize=(4.5,4))
plt.savefig(filename='d2s_munic1.png', bbox_inches='tight')
plt.close()


Figure 6. Distribution of distances to nearest schools within each of Wake county's 12 municipals

Tweaking the image

Very easy, but not exactly good-looking, nor very clear with those overlapping labels. There are a number of ways we can further modify the boxplot. We can set the style (here we will use 'seaborn-bright') and we can set the size of the figure, like we did before, using the figsize parameter. We can furthermore use the rot parameter to rotate the x-axis labels and the patch_artist to created boxes with a fill color. In addition, we can change some properties of the different elements of the boxplot, like the fliers (flierprop parameter) and the boxprops parameter.

# Open figure and set style
plt.style.use('seaborn-bright')

# Define properties of the flier, boxes and median
flierprops = dict(marker='o', markerfacecolor='none', markersize=1)
boxprops = dict(linestyle='-', linewidth=1)
medianprops = dict(linestyle='-', linewidth=4)

# Create the boxplot, with as return type a dictionary
bp = df3.boxplot(column="distance2school", by="municipals",rot=90,
            figsize=(4.7, 4.3), notch=True, flierprops=flierprops,
            boxprops=boxprops, return_type="dict", patch_artist=True)

Unfortunately, not all properties can be changed this way, such as the colors of the box and median. A workaround is to specify the return_type as dict (as we did above). This will return the boxplot properties directly in a dictionary, which is indexed by each column that was plotted in the boxplot. I.e., the dictionary one or more dictionaries (one per plotted column), which in turn contains lists describing the different elements of the plot. In our example, we bp only contains one dictionary (distance2school) as we only used one column for the plotting.

> bp.keys()
['distance2school']

> bp['distance2school'].keys()
['boxes', 'fliers', 'medians', 'means', 'whiskers', 'caps']

Knowing this, we can assign colors and other properties directly to the different elements. See here, here, and here for details.

# boxplot style adjustments
[[item.set_color((0.2,0.2,0.2)) for item in bp[key]['medians']] for key in bp.keys()]
[[item.set_linewidth(2) for item in bp[key]['medians']] for key in bp.keys()]
[[item.set_color((0.6,0.6,0.6)) for item in bp[key]['fliers']] for key in bp.keys()]
[[item.set_color((0.6,0.6,0.6)) for item in bp[key]['whiskers']] for key in bp.keys()]
[[item.set_linestyle('--') for item in bp[key]['whiskers']] for key in bp.keys()]
[[item.set_linewidth(2) for item in bp[key]['whiskers']] for key in bp.keys()]
[[item.set_color((0.6,0.6,0.6)) for item in bp[key]['caps']] for key in bp.keys()]
[[item.set_color((0.3,0.3,0.3)) for item in bp[key]['boxes']] for key in bp.keys()]
[[item.set_facecolor((0.9,0.9,0.9)) for item in bp[key]['boxes']] for key in bp.keys()]

We already saw how we can get rid of the title and sub-title:

plt.title("")
plt.suptitle("")

For adjustment of the e.g., tick labels, we need the Axes object (this is what one would normally call the plot) of the current figure, which we get by calling gca().

the matlibplot terminology can be somewhat confusing. A brief explanation of the different terms can be found here.

# label adjustments
ax = plt.gca()

ax.set_ylabel("distance to schools (km)", fontsize=11)
ax.set_xlabel("Municipals", fontsize=11)
ax.tick_params(axis='y', labelsize=10)
ax.tick_params(axis='x', labelsize=10)

# Remove ticks from upper and right axes
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')

Due to the changes in e.g., labels, the plot may not fit completely within the figure area. The tight_layout() function will automatically adjust the plot parameters so that the plot fits in to the figure area.

plt.tight_layout()
plt.savefig(filename='d2s_municp.png')

We won't close the figure yet, as we are going to further change it below. This is actually a handy feature, you can keep on tweaking / changing the current figure, which makes it easy to save multiple versions to file (e.g., when you want to examine different versions).


Figure 7. Distribution of distances to the nearest schools within each of Wake county's 12 municipals

A bit more of tweaking

Now that we know how to change the individual elements of the figure, we can also assign separate colors to the boxes, for which we will use the same colors as for the map of the municipals. The attentive reader, yes, we already did define the fill color (box facet color) earlier. The code below will overwrite this.

# Get the colors
from palettable.colorbrewer.qualitative import Paired_12
cls = Paired_12.colors

# Assign colors to boxes
item.set_color((0.3,0.3,0.3)) for item in bp[key]['boxes' for key in bp.keys()]
for i, color in enumerate(cls):
	clr = tuple([float(x)/255 for x in color])
	bp['distance2school']['boxes'][i].set_facecolor(clr)

And let's save the plot again. This time we also close the plot.

plt.savefig(filename='d2s_municp_clr.png')
plt.close('all')


Figure 8. Distribution of distances to the nearest schools within each of Wake county's 12 municipals

In this overview of the distribution of distances in the 12 municipals of Wake county, we can easily pick out the municipal where overall distances to the nearest schools are highest (Holly Springs) and lowest (Knightdale). To make it easier to see the ranking of the different municipals, we can order the boxplots according to median distance of each municipal to the nearest school.

Ordered boxplots

There doesn’t seem to be an obvious way to sort pandas boxplots like this. However, I found a clever solution on stack exchange (explained in more detail here). This solution consists of two steps. First, we need to transform the dataframe by moving the values of each municipal in a separate column (essentially converting the long table df3 to a wide table).

dfw = pd.DataFrame({col:vals['distance2school'] for col, vals in df3.groupby('municipals')})

Next step is to sort the order of the columns based on their median value. We do this by creating a Pandas Series (meds) with the sorted median value of each column. We use the index of this frame to sort the columns in the dfw dataframe.

meds = dfw.median().sort_values()
dfw = dfw[meds.index]

The resulting dataframe can be used as input in the matplotlib/pyplot boxplot function. This function makes a box and whisker plot for each column in the input dataframe in the order the columns have in the dataframe.

# Set style and plot size
plt.style.use('seaborn-white')
fig, ax = plt.subplots(figsize=(4.8, 4.3))

# Create the plot
bp = dfw[meds.index].boxplot(rot=90, patch_artist=True, notch=True,
        return_type='dict')
plt.tight_layout()

# Save to file
plt.savefig(filename='d2s_municp_ordered.png')


Figure 9. Distribution of distances to the nearest schools within each of Wake county's 12 municipals, ordered by their median distance.

We can now further tweak the figure following the same steps as above. There are some differences, however, because the structure of the object ax returned by Matplotlib's boxplot command is different from that returned by Pandas boxplot command. First step, let's add labels, change the size of the tick labels and remove the ticks from the upper and right axes.

ax.set_ylabel("distance to schools (km)", fontsize=11)
ax.set_xlabel("Municipals", fontsize=11)
ax.tick_params(axis='y', labelsize=10)
ax.tick_params(axis='x', labelsize=10)
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')

We already got the colors for the boxes, but as a reminder, let's get them again. Next, we create a Dataframe(coltab) with three columns with the R, G and B color values. As index we use the names of the municipals. They are sorted in alphabetic order which corresponds to the order in which we assigned the colors to the municipals previously (Figure 4 and Figure 7).

# Get colors
from palettable.colorbrewer.qualitative import Paired_12
cls = Paired_12.colors

# Create Dataframe with R, G, B values
municp = pd.Series(df3['municipals'].unique()).sort_values()
coltab = pd.DataFrame(
		 data=[[float(x)/255 for x in item] for item in cls],
		 index=municp)

Now we can sort the rows in coltab using the index of the meds (which was the series with median distances of the municipals sorted in ascending order). Last step is to combine the R, G and B columns into a Series with RGB color codes.

coltab = coltab.reindex(meds.index)
colseries = coltab.apply(lambda x: (x[0], x[1], x[2]), axis=1)

Now all is ready to set the properties of the different plot elements. Note that here we do not use the list comprehension method we used earlier. Calling the standard matplotlib boxplot function returns a dictionary containing for each plot element (median, fliers, whiskers, caps, boxes) a list that describes the different characteristics (color, line width, line style, etc.). This dictionary output makes styling the boxplot easy.

plt.setp(bp['medians'], color=(0.2,0.2,0.2), linewidth=2, linestyle='-')
plt.setp(bp['fliers'], color=(0.8,0.8,0.8), marker='o', markersize=1)
plt.setp(bp['whiskers'], color=(0.6,0.6,0.6), linestyle='--', linewidth=2)
plt.setp(bp['caps'], color=(0.6,0.6,0.6))
plt.setp(bp['boxes'], color=(0.3,0.3,0.3), linewidth=1)

To assign the different colors to each of the boxes, we need to loop over each of the individual boxes and assign the corresponding color.

	for i, patch in enumerate(bp['boxes']):
		patch.set_facecolor(colseries[i])

Now, we tighten the margins and print the figure

plt.tight_layout()
plt.savefig(filename='d2s_municp_ordered2.png')
plt.close('all')


Figure 10. Distribution of distances to the nearest schools within each of Wake county's 12 municipals, ordered by their median distance.

Final notes

These examples show that with the Pandas plot() library is it easy to create plots using GRASS GIS vector or raster data. It is also fairly easy to do some simple customizations, but for more advanced tweaking of the plot, you will need to resort directly to mapplotlib's functions and methods, using its more object-based way to make the plots. Although potentially very powerfull, there is a steep learning curve. It took me some time to get a grasp of the basics. In part because of unfamiliarity with matplotlib's terminology (patch_artist isn't exactly the first term that came to my mind when I need to define the fill color of a boxplot) and partly because it took some time and effort to get used to the way the documentation is organized and functions are described. If I compare it with the R documentation, the Python documentation seems to focus more programmers / software developers whereas R's documentation is more targeted at the user (non-programmer). Of course, as a long-time R user, I am probably biased in that respect, and what is arguably most important is that like for R, there is an incredible amount of online information and examples produced by a very active user community, to help you quickly on your way.

Further reading