Header

GRASS and Pandas - from attribute table to pandas dataframe

Author: Paulo van Breugel
Updated on: 2017-04-07

Introduction

In this tutorial I show how to import an attribute table of a vector layer in a GRASS GIS database into a Pandas data frame. Pandas stands for Python Data Analysis Library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For people familiar with R, the Pandas data frame is an object similar to the R data frame. They are a lot like the most common way in which spreadsheets are used, with the data presented in rectangular form with columns holding variables and rows holding observations. An important characteristic is that the data frame, like a spreadsheet, can hold different types of data in different columns: numbers, character data, dates and so on.

I use the North Carolina data set, which is a demo data set in GRASS GIS format that can be downloaded from the GRASS GIS website. I will use the vector layer with the bus routes of the NC State University bus service Wolfline (busroutesall) as example. I first compute the length of the line segments for each route and subsequently upload the values to the attribute table of the vector layer. Next, I import the attribute table into a Pandas data frame and compute the total length per route. I show how this can be done using different approaches.


Routes of the NC State University bus service Wolfline (red lines). Background layer is the Openstreet Map (OSM: provider OSM-WMS Uni Heidelberg).

Compute the route lengths

The next screenshots show how you can compute the length of each of the route segments (click on the images to open a larger version). The values will be written to a new column in the attribute table of the vector layer with bus routes. The same can be done (arguably more conveniently) using the command line or Python, as is subsequently shown.

Using the GUI


Step 1: Copy the original layer 'busroutesall' to 'busroutes_nc' using g.copy.


Step 2: Add a column 'length' to the attribute table of busroutes_nc using v.db.addcolumn.


Step 3: upload the length of the segments to the newly created column using v.to.db.

On the command line

The 3 steps shown above can also be carried out on the command line:

g.copy vector=busroutesall,busroutes_nc
v.db.addcolumn map=busroutes_nc columns="length double precision"
v.to.db map=busroutes_nc@PERMANENT option=length columns=length

Using Python

If you are writing a python script, you may want to implement the steps above as part of your python script. GRASS GIS offers different API's (application programming interface) that provide direct access to the many functions of GRASS GIS. Below I use the GRASS Python Scripting Library. More information on the use of GRASS GIS and Python can be found here. If you are new to using Python scripts in GRASS, make sure to check it out first. Note that you need to run the GRASS Python script from inside a GRASS session, i.e. either from the command line or from the Python shell embedded in the wxGUI.

In the script below, first the required library is imported, viz. the grass.script library. Next are steps 1-3 above, but this time as a python script. For those of you familiar with rgrass7 addon in R, you'll notice that the syntax is very similar.

# import libraries
import grass.script as gs
import pandas as pd
import sqlite3

# Compute length of route segments.
gs.run_command("g.copy", vector=["busroutesall", "busroutes_nc"])
gs.run_command("v.db.addcolumn", map="busroutes_nc", columns="length double precision")
gs.run_command("v.to.db", map="busroutes_nc", option="length", columns="length ")

Examining the results

You can use the Attribute Table Manager to inspect the results. As you will see, you have now a new column with holds for each line segment the length (in meters). The problem is that we do not want to have the length per segment, but per route. So what we need to do is to aggregate the data, summing up the lenght of each segments of each of the routes. This is what we'll do in the next section.

Import the attribute table in Pandas

Now, let's read in the attribute table into a Pandas data frame. Again, first step is to import the required libraries, which in this case include the grass.script library as well as the pandas and sqlite3 libraries.

Next, you use the convenient command db.databases to get the path to the sqlite database that holds the attribute table. You can of course provide the path directly (in this case that would be ~/data/GRASSdb/nc_spm_08_grass7/PERMANENT/sqlite), but this is much easier. You use the path to connect to the database (line 7-8 below).

Now, you can use the read_sql_query function to aggregate the data and read in the results using a SQL query. The advantage, especially if you are dealing with very large data tables, is that the data is aggregated in the SQLite database first, and you are only reading in the results. After you are ready, make sure to close the connection to the database.

# import libraries
import grass.script as gs
import pandas as pd
import sqlite3

# Read in the attribute table
sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath)
sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE"
df = pd.read_sql_query(sqlstat, con)
con.close()

Examine the data

To check the data types of the columns in your dataframe, use the dtype function. This will show you that the data types for the columns cat, ROUTE and length are respectively integer, object and float. Note that the data type object is assigned to a column if it has mixed types (numbers and strings). The shape function will show you the dimensions of the table.

> df.dtypes

Out[28]:
cat         int64
ROUTE      object
length    float64
dtype: object

> df.shape

Out[5]: (13, 2)

So let's have a look at the data. You can view the data by printing the data to the console using the print command, or if you want to show a few lines only, the head command. In Spyder you can also open the dataframe in a separate screen as an editable table. If you examine the whole table you will see that the names of the routes in the column ROUTE indeed consist of both numbers and strings.

> df.head()

Out[7]:
  ROUTE  totallength
0  None   534.347153
1     1  9926.863490
2    10  7209.341378
3    11  8686.935003
4     2  8933.922698

As you can see, there are one or more segments without busroute number. We can use the dropna() function in Pandas to remove all rows with in at least one of the column a missing value (this includes both NaN or None). You can now use the data, e.g., for further analysis or you can plot the data.

> df = df.dropna()
> df.head()

Out[9]:
  ROUTE  totallength
1     1  9926.863490
2    10  7209.341378
3    11  8686.935003
4     2  8933.922698
5     3  7756.842336

Alternative approaches

The approach above is not the only one possible. You can for example also use the GRASS function db.select to query the database. Disadvantage of this approach is that it requires a bit more effort to get the data into a Dataframe. But if you need the data as a list rather than Dataframe, this approach might be preferable. For example, for GRASS GIS addons, you may not want to introduce the extra dependency on Pandas.

# Aggregate data using SQL group by and read into Python
sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE"
df = gs.read_command("db.select", sql=sqlstat)

# Convert string in list
df = [z.split('|') for z in df.split('\n')]
colnames = df.pop(0)

# Convert list to panda dataframe and remove rows with empty / nodata
# Note that the replace statement replaces empty strings for nodata,
# so we can then use dropna() to remove those rows.
df = pd.DataFrame(df, columns=colnames)
df['ROUTE'].replace('', pd.np.nan, inplace=True)
df = df.dropna()

A second alternative is to read in the whole table using the read_sql_query function , and then use Pandas pivot_table function to aggregate the values in the column 'length' by ROUTE. Although it makes less sense in this particular case, you may for example use this approach when you need to do a whole range of different computations on the attribute table. The pivot_table function is ideal for number crunching and data aggregation, offering similar (and more) functionality as the pivot table in Excel and Libreoffice.

# Read in the attribute table
sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath)
sqlstat="SELECT * FROM busroutes_nc"
df = pd.read_sql_query(sqlstat, con)
con.close()

# Remove rows with nodata and aggregate data
df = df.dropna()
df.pivot_table(index='ROUTE', values='length', aggfunc="sum")
print(df)

ROUTE
1      9926.863490
10     7209.341378
11     8686.935003
2      8933.922698
3      7756.842336
4      8972.997628
5      8216.070478
6     10359.835495
7      5858.153859
7a     5942.113479
8      8453.168063
9      8542.797914
Name: length, dtype: float64

Other sources

Besides the various links in this post, make sure to check out this very nice tutorial on using Panda's dataframe by Datacamp. And see this tutorial by Datacarpentry if you want to know more about how to deal with data types and formats in Python.