Author: Paulo van Breugel
Updated on: 11-08-18
In this tutorial I show how to import an attribute table of a vector layer in a GRASS GIS database into a Pandas data frame. Pandas stands for Python Data Analysis Library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For people familiar with R, the Pandas data frame is an object similar to the R data frame. They are a lot like the most common way in which spreadsheets are used, with the data presented in rectangular form with columns holding variables and rows holding observations. An important characteristic is that the data frame, like a spreadsheet, can hold different types of data in different columns: numbers, character data, dates and so on.
I use the North Carolina data set, which is a demo data set in GRASS GIS format that can be downloaded from the GRASS GIS website. I will use the vector layer with the bus routes of the NC State University bus service Wolfline (busroutesall) as example. I first compute the length of the line segments for each route and subsequently upload the values to the attribute table of the vector layer. Next, I import the attribute table into a Pandas data frame and compute the total length per route. I show how this can be done using different approaches.
Compute the route lengths
The next screenshots show how you can compute the length of each of the route segments (click on the images to open a larger version). The values will be written to a new column in the attribute table of the vector layer with bus routes. The same can be done (arguably more conveniently) using the command line or Python, as is subsequently shown.
Using the GUI
On the command line
The 3 steps shown above can also be carried out on the command line:
g.copy vector=busroutesall,busroutes_nc v.db.addcolumn map=busroutes_nc columns="length double precision" v.to.db map=busroutes_nc@PERMANENT option=length columns=length
If you are writing a python script, you may want to implement the steps above as part of your python script. GRASS GIS offers different API's (application programming interface) that provide direct access to the many functions of GRASS GIS. Below I use the GRASS Python Scripting Library. More information on the use of GRASS GIS and Python can be found here. If you are new to using Python scripts in GRASS, make sure to check it out first. Note that you need to run the GRASS Python script from inside a GRASS session, i.e. either from the command line or from the Python shell embedded in the wxGUI.
In the script below, first the required library is imported, viz. the grass.script library. Next are steps 1-3 above, but this time as a python script. For those of you familiar with rgrass7 addon in R, you'll notice that the syntax is very similar.
# import libraries import grass.script as gs import pandas as pd import sqlite3 # Compute length of route segments. gs.run_command("g.copy", vector=["busroutesall", "busroutes_nc"]) gs.run_command("v.db.addcolumn", map="busroutes_nc", columns="length double precision") gs.run_command("v.to.db", map="busroutes_nc", option="length", columns="length ")
Examining the results
You can use the Attribute Table Manager to inspect the results. As you will see, you have now a new column with holds for each line segment the length (in meters). The problem is that we do not want to have the length per segment, but per route. So what we need to do is to aggregate the data, summing up the lenght of each segments of each of the routes. This is what we'll do in the next section.
Import the attribute table in Pandas
Now, let's read in the attribute table into a Pandas data frame. Again, first step is to import the required libraries, which in this case include the grass.script library as well as the pandas and sqlite3 libraries.
Next, you use the convenient command db.databases to get the path to the sqlite database that holds the attribute table. You can of course provide the path directly (in this case that would be ~/data/GRASSdb/nc_spm_08_grass7/PERMANENT/sqlite), but this is much easier. You use the path to connect to the database (line 7-8 below).
Now, you can use the read_sql_query function to aggregate the data and read in the results using a SQL query. The advantage, especially if you are dealing with very large data tables, is that the data is aggregated in the SQLite database first, and you are only reading in the results. After you are ready, make sure to close the connection to the database.
# import libraries import grass.script as gs import pandas as pd import sqlite3 # Read in the attribute table sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '') con = sqlite3.connect(sqlpath) sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE" df = pd.read_sql_query(sqlstat, con) con.close()
Examine the data
To check the data types of the columns in your dataframe, use the dtype function. This will show you that the data types for the columns cat, ROUTE and length are respectively integer, object and float. Note that the data type object is assigned to a column if it has mixed types (numbers and strings). The shape function will show you the dimensions of the table.
> df.dtypes Out: cat int64 ROUTE object length float64 dtype: object > df.shape Out: (13, 2)
So let's have a look at the data. You can view the data by printing the data to the console using the print command, or if you want to show a few lines only, the head command. In Spyder you can also open the dataframe in a separate screen as an editable table. If you examine the whole table you will see that the names of the routes in the column ROUTE indeed consist of both numbers and strings.
> df.head() Out: ROUTE totallength 0 None 534.347153 1 1 9926.863490 2 10 7209.341378 3 11 8686.935003 4 2 8933.922698
As you can see, there are one or more segments without busroute number. We can use the dropna() function in Pandas to remove all rows with in at least one of the column a missing value (this includes both NaN or None). You can now use the data, e.g., for further analysis or you can plot the data.
> df = df.dropna() > df.head() Out: ROUTE totallength 1 1 9926.863490 2 10 7209.341378 3 11 8686.935003 4 2 8933.922698 5 3 7756.842336
The approach above is not the only one possible. You can for example also use the GRASS function db.select to query the database. Disadvantage of this approach is that it requires a bit more effort to get the data into a Dataframe. But if you need the data as a list rather than Dataframe, this approach might be preferable. For example, for GRASS GIS addons, you may not want to introduce the extra dependency on Pandas.
# Aggregate data using SQL group by and read into Python sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE" df = gs.read_command("db.select", sql=sqlstat) # Convert string in list df = [z.split('|') for z in df.split('\n')] colnames = df.pop(0) # Convert list to panda dataframe and remove rows with empty / nodata # Note that the replace statement replaces empty strings for nodata, # so we can then use dropna() to remove those rows. df = pd.DataFrame(df, columns=colnames) df['ROUTE'].replace('', pd.np.nan, inplace=True) df = df.dropna()
A second alternative is to read in the whole table using the read_sql_query function , and then use Pandas pivot_table function to aggregate the values in the column 'length' by ROUTE. Although it makes less sense in this particular case, you may for example use this approach when you need to do a whole range of different computations on the attribute table. The pivot_table function is ideal for number crunching and data aggregation, offering similar (and more) functionality as the pivot table in Excel and Libreoffice.
# Read in the attribute table sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '') con = sqlite3.connect(sqlpath) sqlstat="SELECT * FROM busroutes_nc" df = pd.read_sql_query(sqlstat, con) con.close() # Remove rows with nodata and aggregate data df = df.dropna() df.pivot_table(index='ROUTE', values='length', aggfunc="sum") print(df) ROUTE 1 9926.863490 10 7209.341378 11 8686.935003 2 8933.922698 3 7756.842336 4 8972.997628 5 8216.070478 6 10359.835495 7 5858.153859 7a 5942.113479 8 8453.168063 9 8542.797914 Name: length, dtype: float64
Besides the various links in this post, make sure to check out this very nice tutorial on using Panda's dataframe by Datacamp. And see this tutorial by Datacarpentry if you want to know more about how to deal with data types and formats in Python.
Software: @grassgis @python
Tools: @pandas @sqlite3 @db_databases
Subject: @data_import @modis
If you have questions
If you have questions or comments about the text, let me know. You can use this contact form. Please make sure to include the page title ("GRASS and Pandas - from attribute table to pandas dataframe") or page name ("grass-pandas-dataframe").