Use R to obtain gbif data

Author: Paulo van Breugel
Updated on: 2016-10-08



The Global Biodiversity Information Facility (GBIF) is an international open data infrastructure that allows anyone, anywhere to access data about all types of life on Earth, shared across national boundaries via the Internet. GBIF provides a single point of access through http://www.gbif.org/ to species records shared freely by hundreds of institutions worldwide. The data accessible through GBIF relate to evidence about more than 1.6 million species, collected over three centuries of natural history exploration and including current observations from citizen scientists, researchers and automated monitoring programs.

There are various ways to import GBIF data, including directly from the website as comma delimited file (csv) and using the v.in.gbif addon for GRASS. Here, however, we'll use the rgbif package for R to obtain the data. In the link section some tutorials are listed that illustrate the use of different R packages.

R / GRASS integration

We ultimately want to store the downloaded data in a GRASS GIS database. One can use GRASS GIS and R in conjunction in two different ways. One is to use R within a GRASS GIS session. This means that we start R (or RStudio) from the GRASS GIS command line. The other option is to use GRASS GIS within a R session. This means we connect to a GRASS GIS location/mapset from within R (or RStudio). For more details, see the GRASS GIS Wiki. The first approach is arguably the easiest, and is what we will be using below. If you just want to download and use the data in R, you can simply skip the parts about GRASS GIS.

An example - Kalmia latifolia

Mountain Laurel (Kalmia latifolia) is a mid-sized shrub that can be found in acidic forests, bluffs, bogs, along sandhill streams, and in a wide range of other habitats, nearly ubiquitous in the mountains, up to at least 1600m, while more restricted in habitat in the lower Piedmont and Coastal Plain (NC wildflower website). In this example, we will download the presence points of this species in the USA. All it really takes is 5 lines of code.

# Get the data and convert it in the right format,
# reproject it to the right projection, and export to GRASS GIS
KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE,
		   country="US", hasGeospatialIssue=FALSE)$data
coordinates(KL) <- cbind(KL$decimalLongitude, KL$decimalLatitude)
proj4string(KL) = CRS("+init=epsg:4326")
KLnew <- spTransform(KL, getLocationProj())
writeVECT(SDF=KLnew[,c(1,2,5)], vname="KL3", driver="SQLite")

Of course, more work might be needed to check for inconsistencies or errors in the data. There are various packages in R that can assist you with this, but that is for another time.

Import GBIF data in R

Open R and load libraries

In this example we start GRASS GIS first and then started R (or Rstudio) from the GRASS GIS command line. After opening R, we load the libraries rgbif, maptools (for displaying a map of the data in R), the raster package, and rgrass7 (to interact with GRASS GIS).


Define the extent of the region of interest

The GBIF data set, from which we will download the data later, contains data for the whole world. If you are only interested in occurrence data in a specific region, you can define the bounding box of your area on interest, as in the first example below, or provide the name of the country, as in the second example. So, suppose we are interested in the occurrence data in North Carolina. The bounding box of the state is north: 36:36:30N, south: 33:36N, west: 84:20W, and east: 75:05W. We can set this extent in R by creating a vector with the minimum and maximum X-coordinates and the minimum and maximum Y-coordinates.

# Define extent
NCext <- c(-84.333, -75.08333, 33.6, 36.60833)

If you are running R from within a GRASS GIS session, you can also read in GRASS location metadata using the gmeta functions from the rgrass7 package. That will give you a list object with information about the location, mapset, region settings and projection.

NCmeta <- gmeta()

Now, first you need to get the names of the list elements that define the bounding box. As it turns out, the names of the list elements for the north, south, east and west coordinates are n, s, e and w. You can use this to extract them and define the extent.

# Get the names of the list objects

# Create a vector with the bounding box
NCext <- c(NCmeta$e, NCmeta$w, NCmeta$n, NCmeta$s)

An arguably easier way to get the bounding box is to use the helper function gmeta2grd. This function creates a GridTopology object from the current GRASS mapset region definitions. This can be converted to a SpatialGrid, from which the bounding box can be extracted using the bbox function.

NCext <- bbox(SpatialGrid(grid=gmeta2grd()))

If you want to check if the extent is correctly defined, you can plot the bounding box on top of the world maps. In the script below, in line 4, the NCext object is first converted to object of the class extent using the extent function from the raster package, and subsequently converted to a SpatialPolygon.

# Plot data
plot(wrld_simpl, col = "white", axes = T)
NCbbox <- as(extent(NCext), "SpatialPolygons")
plot(NCbbox, col = "red", add=TRUE)

The map below (click to enlarge) shows in red the bounding box that was defined above.

Import the species point data in R using rgbif{rgbif}

OK, so now we have defined the extent, it is time to get the point data of our target species; Kalmia latifolia. The rgbif package has a very convenient function occ_data to do this. To start with, we need the scientific name. Then we'll need to set the parameter hasCoordinate to TRUE to ensure only records with coordinates are downloaded. The area from which we want to download the data can be set by the parameter geometry, which is set to the bounding box we had defined earlier. There are a lot more parameters to further fine-tune the search, see the help file for details.

NCext <- bbox(SpatialGrid(grid=gmeta2grd()))
KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE, geometry=NCext,

The KL is a list with two elements. The first is a list with the metadata and the second a dataframe with the actual records, including the coordinates. We can quickly map the points in the dataframe using the convenient gbifmap function.

gbifmap(KL$data, mapdatabase="state")

This will plot a simple map of the USA with state boundaries and the location of Kalmia latifolia (click on map to enlarge).

If you want something more involved, you can use for example ggplot2 directly to create your own map. This time, we are downloading the data for the whole of the USA. Instead of the geometry parameter, we can use the country parameter, which is a convenient way to define the country of interest (see the help file for details).

# Download Kalmia latifolia presence data for the USA
KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE, country="US",

# Get USA state map
NCstate = map_data("state")

# Create the map
NCstate = map_data("state")
p <- ggplot(NCstate, aes(long, lat)) + geom_polygon() + coord_equal()
p <- ggplot() + coord_fixed(1.1)
p <- p + geom_polygon(data = NCstate, aes(x=long, y = lat, group = group),
                      fill="white", col="grey", lwd=0.3)
p <- p + geom_point(data=KL$data, aes(x = decimalLongitude, y=decimalLatitude,
                      color="Presence points"), cex=0.5) + labs(color = "Kalmia latifolia")
p <- p + geom_polygon(data = NCstate, aes(x=long, y = lat, group = group),
                      fill=rgb(1,1,1,0), col=rgb(0.8,0.8,0.8,0.3), lwd=0.1)
p <- p + theme(legend.position=c(0.14,0.15))
ggsave(filename="distribution of Kalmia_latifolia in USA.png", plot=p, width=17, height=8.5, dpi=300, scale=1, unit="cm")

The map (click to enlarge) clearly shows that the species is largely confined to the east of the USA. I would normally check the data for consistency and possible errors (notice e.g., the one odd point in the state Washington in the northwest) , but I will skip that step for now.

Import the data in GRASS GIS

Convert the data to SpatialPointsDataFrame

Concerting the data in a SpatialPointsDataFrame. This is done by assigning the coordinates. Next, we need to define the coordinate reference system (CRS). The GBIF data is downloaded in latlon WGS84, which as the EPGS code EPS 4326. We can use the proj4string and the CRS functions to set the projection of our SpatialPointsDataFrame.

coordinates(KL) <- cbind(KL$decimalLongitude, KL$decimalLatitude)
latlong = "+init=epsg:4326"
proj4string(KL) = CRS(latlong)

Export to the GRASS database

Next, we can use the writeVECT function from the rgrass7 package to export the point layer to the GRASS GIS database. Note, this only works if you started R from within the GRASS GIS session, as mentioned earlier. In the example below, only a few columns are exported, because I won't be needing the other ones. If you want to retain all columns, note that some columns may contain characters causing problems when importing in GRASS. In that case you will need to check and clean the offending columns.

writeVECT(SDF=KL[,c(1,2,5)], vname="KL", driver="SQLite")

Special case - reprojecting

What if the downloaded data is in another projection than the projection of the Location/Mapset in your GRASS database? The coordinate reference system (CRS) of the GBIF point data is latlon (WGS 84). However, in this tutorial we'll be working with the North Carolina (NC) sample GRASS database, which can be downloaded from the GRASS GIS website. The projection of this dataset is NAD83(HARN) / North Carolina (EPSG 3358). This means that the data needs to be reprojected. There are various ways one can do this. (A) One option is to reproject the data in R before writing it to GRASS. (B) A second option is to import the data in a location/mapset with the same projection as the point data (as we have just done above), and than change to the target Location/mapset in GRASS and use use r.proj to import and reproject the data from the first location/mapset into the target location/mapset. For an example of option B, see this tutorial.

We will go for option A here. As said, the data downloaded earlier is in latlon WGS84. We can get the projection details of our target mapset using the getLocationProj function from the rgrass7 package (Note, I am assuming that we have started GRASS with the target Location/mapset). Next, to reproject the SpatialPointsDataFrame, we use the spTransform function from the sp package (note that there is also a spTransform function in the rgdal package, which does the same). The last step is to export the data into the GRASS GIS database.

# Get the projection of the GRASS mapset and reproject the data
proj <- getLocationProj()
KLnew <- spTransform(KL, proj)
writeVECT(SDF=KLnew[,c(1,2,5)], vname="KL3", driver="SQLite")

And below the map, but this time created in GRASS GIS, using the d.* family of GRASS functions, as explained in the tutorial on building a map on the command line.


As usual, there are more than one way to skin the proverbial cat. Below some links of articles / blog posts about using different R packages to retrieve data from GBIF.