-
Notifications
You must be signed in to change notification settings - Fork 1
Cell Processing Workflow
Workflow for Processing Worldclim Variable to Datastore
This document provides a broad view of the process to prepare Worldclim for access via the API described in https://github.com/eightysteele/Spatial-Data-Library/wiki/REST-API.
The workhorse for bulkloading is https://github.com/eightysteele/Spatial-Data-Library/blob/master/sdl/sdl.py. This script is set up to process one Worldclim tile (see http://www.worldclim.org/tiles.php) or any part of a Worldclim tile with one command-line call. Run the following to see the command line option help:
./sdl.py --help
The basic command to execute the entire bulkloading process for a single tile is -c full. All other commands are subsets of the process that can be used for convenience or testing. Here's what a full processing call looks like:
./sdl.py -c full -k 32 -v ~/Data/SDL/worldclim/32 -w ~/SDL/workspace -g ~/Spatial-Data-Library/data/gadm/Terrestrial-10min-unbuffered-dissolved.shp -u http://localhost:5984 -d worldclim-test -n 120 -b 25000 -f -120,0 -t -90,-30 -l tile32.log &
-k is the number of the tile to load. Tile numbering follows that given on the Worldclim download page (http://www.worldclim.org/tiles.php), with values from 00 to 411.
-v signifies the directory where the Worldclim variables will be downloaded and processed, in this example, ~/Data/SDL/worldclim/32. The script uses curl to get the Worldclim files for all of the variables at 30-second resolution and extracts them into the given directory.
-w is the workspace directory where temporary files used in processing are stored, including the optional log file. If a tile is being parsed into sections to avoid processing large areas of ocean where there are no data, choose a distinct workspace directory for each section.
-g is the path to the file to use for clipping. We constructed the file Terrestrial-10min-unbuffered-dissolved.shp by first polygonizing, then dissolving the Worldclim tmean6 layer (http://biogeo.ucdavis.edu/data/climate/worldclim/1_4/grid/cur/tmean_10m_bil.zip) at 10 minute resolution. The resulting clipping file is one multipolygon including all 10-minute cells having data. Clipping using this file assures that all cells having data are processed while keeping the complexity of the clipping layer to a minimum to speed processing. Final resulting cells around the borders of this low-resolution clipping layer can still have no values, but these are discarded in the `starspancsvdir2couchcsvs()' function before loading to the data store.
-u is the URL for the CouchDB data repository where cell data will be stored for retrieval on demand by App Engine.
-d is the name of the database on CouchDB.
-n is the number of cells per degree of longitude at the equator and determines the resolution of the overall grid pattern. All cells use this options to determine their latitudinal dimension in degrees, while the longitudinal dimension of the cell varies with latitude to maintain constant area across all cells on the globe. The default geodetic model is the WGS84 ellipsoid.
-b is the batch size - the number of cells to process before sending the results to CouchDB in a batch. The higher the number, the more efficient the loading process. In practice we use batch sizes of 25000.
-f is the coordinate (lng,lat) of the northwest corner of the bounding box to process. To process an entire tile, use the coordinates of the bounding box of the tile (e.g., -120,0 -90,30 for tile 32). Processing some tiles can be greatly optimized by running sdl.py one or more times with bounding boxes inside the tile that include only terrestrial area. For example, Tile 32 can be processed quickly by running sdl.py twice, once for each small bounding box within the tile including only the area of the islands in that region.
-t is the coordinate (lng,lat) of the southeast corner of the bounding box, used in tandem with the -f option above.
-l is the name of a log file in the working directory (given by the -w option) in which to store processing informational messages. Some commands within the sdl.py script are executed as subprocess calls whose output goes to stdout rather than to the log file. The value none can be supplied to log to the console instead of to a file.