-
Notifications
You must be signed in to change notification settings - Fork 1
Cell Processing Workflow
Workflow for Processing Worldclim Variable to Datastore
This document provides a broad view of the process to prepare Worldclim for access via the API described in https://github.com/eightysteele/Spatial-Data-Library/wiki/REST-API.
#sdl.py The workhorse for bulkloading is https://github.com/eightysteele/Spatial-Data-Library/blob/master/sdl/sdl.py. This script is set up to process one Worldclim tile (see http://www.worldclim.org/tiles.php) or any part of a Worldclim tile with one command-line call. Run the following to see the command line option help:
./sdl.py --help
The basic command to execute the entire bulkloading process for a single tile is -c full. All other commands are subsets of the process that can be used for convenience or testing. Here's what a full processing call looks like:
./sdl.py -c full -k 32 -v ~/Data/SDL/worldclim/32 -w ~/SDL/workspace -g ~/Spatial-Data-Library/data/gadm/Terrestrial-10min-unbuffered-dissolved.shp -u http://localhost:5984 -d worldclim-test -n 120 -b 50000 -f -120,0 -t -90,-30 -l tile32.log &
-
-knumber of the tile to load. Tile numbering follows that given on the Worldclim download page (http://www.worldclim.org/tiles.php), with values from 00 to 411. -
-vdirectory where the Worldclim variables will be downloaded and processed, in this example, ~/Data/SDL/worldclim/32. The script uses curl to get the Worldclim files for all of the variables at 30-second resolution and extracts them into the given directory. -
-wworkspace directory where temporary files used in processing are stored, including the optional log file. If a tile is being parsed into sections to avoid processing large areas of ocean where there are no data, choose a distinct workspace directory for each section. -
-gpath to the file to use for clipping. We constructed the fileTerrestrial-10min-unbuffered-dissolved.shpby first polygonizing, then dissolving the Worldclim tmean6 layer (http://biogeo.ucdavis.edu/data/climate/worldclim/1_4/grid/cur/tmean_10m_bil.zip) at 10 minute resolution. The resulting clipping file is one multipolygon including all 10-minute cells having data. Clipping using this file assures that all cells having data are processed while keeping the complexity of the clipping layer to a minimum to speed processing. Final resulting cells around the borders of this low-resolution clipping layer can still have no values, but these are discarded in thestarspancsvdir2couchcsvs()function before loading to the data store. -
-uURL for the CouchDB data repository where cell data will be stored for retrieval on demand by App Engine. -
-ddatabase name on CouchDB. -
-nnumber of cells per degree of longitude at the equator and determines the resolution of the overall grid pattern. This option sets the latitudinal dimension of the cell grid in degrees, while the longitudinal dimension of cells varies with latitude to maintain constant area across all cells on the globe. The value to create cells with a resolution comparable to Worldclim (30 seconds) at the equator is 120. The default geodetic model for the grid is the WGS84 ellipsoid. -
-bbatch size - the number of cells to process before sending the results to CouchDB in a batch. The higher the number, the more efficient the loading process. In practice we use batch sizes of 50000. -
-fis the coordinate (lng,lat) of the northwest corner of the bounding box to process. To process an entire tile, use the coordinates of the bounding box of the tile (e.g., -120,0 -90,30 for tile 32). Processing some tiles can be greatly optimized by runningsdl.pyone or more times with bounding boxes inside the tile that include only terrestrial area. For example, Tile 32 can be processed quickly by runningsdl.pytwice, once for each small bounding box within the tile including only the area of the islands in that region. -
-tis the coordinate (lng,lat) of the southeast corner of the bounding box, used in tandem with the-foption above. -
-lis the name of a log file in the workspace directory (given by the-woption) in which to store processing messages. Some commands within thesdl.pyscript (such as starspan) are executed as subprocess calls whose output goes to stdout rather than to the log file. Redirect the output to a file if you want to store and review this output after processing. The valuenonecan be supplied to log to the console instead of to a file.
#sdl.py commands
As mentioned above under the -c command line argument, part or all of the cell processing workflow can be executed for a give Tile or section of a Tile. Following are summaries of the different commands available:
-
prepareworkspace- Checks the workspace directory provided in the-wargument to see that it is empty. Aborts if it is not to avoid overwriting previously processed data. Otherwise creates and checks that the workspace directory exists. -
getworldclimtile- Downloads the Worldclim 30-second zipped generic grid files for the Tile given by the-kargument into the directory specified by the-vargument, unzips them, and removes the zip file. -
cliptileonly- Creates a new clipping shapefile called[k]-clipped, wherekis the value provided in the-kargument, in the workspace given by the-wargument. The shapefile is the intersection of the bounding box provided in the-fand-targuments and the shapefile representing the area having data given by the-gargument. The resulting clip file is used to reduce the area processed to only those areas having data at the 10-minute resolution of Worldclim. -
batchcells2shapes- Prepares the workspace directory is in theprepareworkspacecommand, clips the bounding box to the area having data as in thecliptileonlycommand, creates shapefiles in batches of cells given by the-nargument, and clips the batches to the area in the bounding box having data. Resulting shapefiles of clipped batches of cells are stored in the/batchessubdirectory of the workspace directory given by the-wargument. -
starspan- Extracts statistics on variables in the Worldclim tile for the batches of cells in the/batchessubdirectory of the workspace directory given in the-wargument. Creates one csv file containing extracted statistics for every batch shape file in the/batchessubdirectory. -
starspan2couch- Processes the avg statistic from starspan-produced csv files in the/batchessubdirectory of the workspace directory to csv files containing rows with a cell key and a document with all variables for each cell. These csv files are stored in a/forcouchsubdirectory of the given workspace and cover the range of cells in the batch and contain only cells having data - determined by checking that the values of alt (Altitude), bio12 (Annual Precipitation), and tmax1 (Maximum Temperature, January) are not all equal to 0. The cell, document format of the csv files is needed for processing to load to either CouchDB or App Engine. -
tilesection2couchcsvs- Does everything that the commandbatchcells2shapesdoes, followed by the processing achieved by executing thestarspanandstarspan2couchcommands. Use this command after the 'getworldclimtilecommand to completely process a Tile or section of a Tile up to the point of having csv files ready for loading to CouchDB and App Engine from the/forcouch` subdirectory of the given workspace directory. -
couchfromcsvs- Loads the cells from all of the csv files in the/forcouchsubdirectory of the given workspace directory into the CouchDB database given by the-dargument on the server given by the-uargument.
csv2appengine -
full -
#Typical workflow example - Tile01
getworlclimtile tilesection2couchcsvs couchfromcsvs csv2appengine