Skip to content

Commit 78bdda9

Browse files
authored
ensure direct read of datasets is possible (#5)
* ensure direct read is possible * finish db cleaning * update docstrings * fix test * update readme * fix tests once more * expand json docs
1 parent b27fce6 commit 78bdda9

File tree

8 files changed

+283
-263
lines changed

8 files changed

+283
-263
lines changed

README.md

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -36,25 +36,28 @@ In [1]: import geodatasets
3636
In [2]: geodatasets.data
3737
Out[2]:
3838
{'geoda': {'airbnb': {'url': 'https://geodacenter.github.io/data-and-lab//data/airbnb.zip',
39-
'license': 'CC-0',
40-
'attribution': 'GeoDa Data and Lab',
39+
'license': 'NA',
40+
'attribution': 'Center for Spatial Data Science, University of Chicago',
4141
'name': 'geoda.airbnb',
4242
'description': 'Airbnb rentals, socioeconomics, and crime in Chicago',
43+
'geometry_type': 'Polygon',
4344
'nrows': 77,
44-
'ncols': 20,
45+
'ncols': 21,
4546
'details': 'https://geodacenter.github.io/data-and-lab//airbnb/',
4647
'hash': 'a2ab1e3f938226d287dd76cde18c00e2d3a260640dd826da7131827d9e76c824',
4748
'filename': 'airbnb.zip'},
4849
'atlanta': {'url': 'https://geodacenter.github.io/data-and-lab//data/atlanta_hom.zip',
49-
'license': 'CC-0',
50-
'attribution': 'GeoDa Data and Lab',
50+
'license': 'NA',
51+
'attribution': 'Center for Spatial Data Science, University of Chicago',
5152
'name': 'geoda.atlanta',
5253
'description': 'Atlanta, GA region homicide counts and rates',
54+
'geometry_type': 'Polygon',
5355
'nrows': 90,
54-
'ncols': 23,
56+
'ncols': 24,
5557
'details': 'https://geodacenter.github.io/data-and-lab//atlanta_old/',
56-
'hash': 'missing',
57-
'filename': 'atlanta_hom.zip'},
58+
'hash': 'a33a76e12168fe84361e60c88a9df4856730487305846c559715c89b1a2b5e09',
59+
'filename': 'atlanta_hom.zip',
60+
'members': ['atlanta_hom/atl_hom.geojson']},
5861
...
5962
```
6063

@@ -69,8 +72,8 @@ And one to get the local path. If the file is not available in the cache, it wil
6972
downloaded first.
7073

7174
```py
72-
Out[4]: '/Users/martin/Library/Caches/geodatasets/airbnb.zip'
7375
In [4]: geodatasets.get_path('geoda airbnb')
76+
Out[4]: '/Users/martin/Library/Caches/geodatasets/airbnb.zip'
7477
```
7578

7679
You can also get all the details:
@@ -79,12 +82,13 @@ You can also get all the details:
7982
In [5]: geodatasets.data.geoda.airbnb
8083
Out[5]:
8184
{'url': 'https://geodacenter.github.io/data-and-lab//data/airbnb.zip',
82-
'license': 'CC-0',
83-
'attribution': 'GeoDa Data and Lab',
85+
'license': 'NA',
86+
'attribution': 'Center for Spatial Data Science, University of Chicago',
8487
'name': 'geoda.airbnb',
8588
'description': 'Airbnb rentals, socioeconomics, and crime in Chicago',
89+
'geometry_type': 'Polygon',
8690
'nrows': 77,
87-
'ncols': 20,
91+
'ncols': 21,
8892
'details': 'https://geodacenter.github.io/data-and-lab//airbnb/',
8993
'hash': 'a2ab1e3f938226d287dd76cde18c00e2d3a260640dd826da7131827d9e76c824',
9094
'filename': 'airbnb.zip'}
@@ -96,12 +100,13 @@ Or using the name query:
96100
In [6]: geodatasets.data.query_name('geoda airbnb')
97101
Out[6]:
98102
{'url': 'https://geodacenter.github.io/data-and-lab//data/airbnb.zip',
99-
'license': 'CC-0',
100-
'attribution': 'GeoDa Data and Lab',
103+
'license': 'NA',
104+
'attribution': 'Center for Spatial Data Science, University of Chicago',
101105
'name': 'geoda.airbnb',
102106
'description': 'Airbnb rentals, socioeconomics, and crime in Chicago',
107+
'geometry_type': 'Polygon',
103108
'nrows': 77,
104-
'ncols': 20,
109+
'ncols': 21,
105110
'details': 'https://geodacenter.github.io/data-and-lab//airbnb/',
106111
'hash': 'a2ab1e3f938226d287dd76cde18c00e2d3a260640dd826da7131827d9e76c824',
107112
'filename': 'airbnb.zip'}

ci/dev.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ dependencies:
66
# tests
77
- pytest
88
- pytest-cov
9+
- geopandas-base
10+
- pyogrio
911
- pip
1012
- pip:
1113
- git+https://github.com/fatiando/pooch.git@main

ci/latest.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,5 @@ dependencies:
77
# tests
88
- pytest
99
- pytest-cov
10+
- geopandas-base
11+
- pyogrio

doc/source/contributing.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ schema to add a single dataset:
2222
"attribution": "University of Github",
2323
"name": "dataset_name",
2424
"description": "Contents of my file",
25+
"geometry_type": "Polygon",
2526
"nrows": 77,
2627
"ncols": 20,
2728
"details": "https://your-site.com/link-to-explanantion/",
@@ -43,6 +44,7 @@ you can group then within a `Bunch` using the following schema:
4344
"attribution": "University of Github",
4445
"name": "dataset_name",
4546
"description": "Contents of my file",
47+
"geometry_type": "Polygon",
4648
"nrows": 77,
4749
"ncols": 20,
4850
"details": "https://your-site.com/link-to-explanantion/",
@@ -55,11 +57,13 @@ you can group then within a `Bunch` using the following schema:
5557
"attribution": "University of Github",
5658
"name": "dataset_name",
5759
"description": "Contents of my file",
60+
"geometry_type": "Point",
5861
"nrows": 77,
5962
"ncols": 20,
6063
"details": "https://your-site.com/link-to-explanantion/",
6164
"hash": "a2ab1e3f938226d287dd76cde18c00e2d3a260640dd826da7131827d9e76c824",
62-
"filename": "my_file.zip"
65+
"filename": "my_file.zip",
66+
"members": ["use_only_this.geojson"]
6367
}
6468
},
6569
}
@@ -68,7 +72,9 @@ you can group then within a `Bunch` using the following schema:
6872
It is mandatory to always specify at least `name`, `url`, `hash` and `filename`. `hash`
6973
is a sha256 hash of the file to check that a user gets the expected file and a
7074
`filename` specifies how the downloaded file will be called. Ensure that it has a correct
71-
suffix. Don't forget to add any other custom attributes you'd like.
75+
suffix. Don't forget to add any other custom attributes you'd like. Attribute `members` has
76+
a specific meaning and specifies file (or files in case of ESRI Shapefile) that shall be
77+
extracted from the archive and used.
7278

7379
## Code and documentation
7480

geodatasets/api.py

Lines changed: 44 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ def get_path(name):
5555
contain the same letters in the same order as the item's name irrespective
5656
of the letter case, spaces, dashes and other characters.
5757
58+
For Datasets containing multiple files, the archive is automatically extracted.
59+
5860
Parameters
5961
----------
6062
name : str
@@ -81,7 +83,20 @@ def get_path(name):
8183
>>> path2
8284
'/Users/martin/Library/Caches/geodatasets/airbnb.zip'
8385
"""
84-
return CACHE.fetch(data.query_name(name).filename)
86+
dataset = data.query_name(name)
87+
if "members" in dataset.keys():
88+
unzipped_files = CACHE.fetch(
89+
dataset.filename, processor=pooch.Unzip(members=dataset.members)
90+
)
91+
if len(unzipped_files) == 1:
92+
return unzipped_files[0]
93+
elif len(unzipped_files) > 1: # shapefile
94+
return [f for f in unzipped_files if f.endswith(".shp")][0]
95+
else:
96+
raise
97+
98+
else:
99+
return CACHE.fetch(dataset.filename)
85100

86101

87102
def fetch(name):
@@ -94,6 +109,8 @@ def fetch(name):
94109
contain the same letters in the same order as the item's name irrespective
95110
of the letter case, spaces, dashes and other characters.
96111
112+
For Datasets containing multiple files, the archive is automatically extracted.
113+
97114
Parameters
98115
----------
99116
name : str, list
@@ -106,18 +123,41 @@ def fetch(name):
106123
Examples
107124
--------
108125
>>> geodatasets.fetch('nybb')
109-
Downloading file 'nybb_22c.zip' from 'https://data.cityofnewyork.us/api/geospatial/\
110-
tqmj-j8zm?method=export&format=Original' to '/Users/martin/Library/Caches/geodatasets'.
126+
Downloading file 'nybb_22c.zip' from 'https://data.cityofnewyork.us/api/geospatial\
127+
/tqmj-j8zm?method=export&format=Original' to '/Users/martin/Library/Caches/geodatasets'.
128+
Extracting 'nybb_22c/nybb.shp' from '/Users/martin/Library/Caches/geodatasets/nybb_\
129+
22c.zip' to '/Users/martin/Library/Caches/geodatasets/nybb_22c.zip.unzip'
130+
Extracting 'nybb_22c/nybb.shx' from '/Users/martin/Library/Caches/geodatasets/nybb_\
131+
22c.zip' to '/Users/martin/Library/Caches/geodatasets/nybb_22c.zip.unzip'
132+
Extracting 'nybb_22c/nybb.dbf' from '/Users/martin/Library/Caches/geodatasets/nybb_\
133+
22c.zip' to '/Users/martin/Library/Caches/geodatasets/nybb_22c.zip.unzip'
134+
Extracting 'nybb_22c/nybb.prj' from '/Users/martin/Library/Caches/geodatasets/nybb_\
135+
22c.zip' to '/Users/martin/Library/Caches/geodatasets/nybb_22c.zip.unzip'
111136
112137
>>> geodatasets.fetch(['geoda airbnb', 'geoda guerry'])
113138
Downloading file 'airbnb.zip' from 'https://geodacenter.github.io/data-and-lab//dat\
114139
a/airbnb.zip' to '/Users/martin/Library/Caches/geodatasets'.
115140
Downloading file 'guerry.zip' from 'https://geodacenter.github.io/data-and-lab//dat\
116141
a/guerry.zip' to '/Users/martin/Library/Caches/geodatasets'.
142+
Extracting 'guerry/guerry.shp' from '/Users/martin/Library/Caches/geodatasets/guerr\
143+
y.zip' to '/Users/martin/Library/Caches/geodatasets/guerry.zip.unzip'
144+
Extracting 'guerry/guerry.dbf' from '/Users/martin/Library/Caches/geodatasets/guerr\
145+
y.zip' to '/Users/martin/Library/Caches/geodatasets/guerry.zip.unzip'
146+
Extracting 'guerry/guerry.shx' from '/Users/martin/Library/Caches/geodatasets/guerr\
147+
y.zip' to '/Users/martin/Library/Caches/geodatasets/guerry.zip.unzip'
148+
Extracting 'guerry/guerry.prj' from '/Users/martin/Library/Caches/geodatasets/guerr\
149+
y.zip' to '/Users/martin/Library/Caches/geodatasets/guerry.zip.unzip'
117150
118151
"""
119152
if isinstance(name, str):
120153
name = [name]
121154

122155
for n in name:
123-
_ = CACHE.fetch(data.query_name(n).filename)
156+
dataset = data.query_name(n)
157+
if "members" in dataset.keys():
158+
_ = CACHE.fetch(
159+
data.query_name(n).filename,
160+
processor=pooch.Unzip(members=dataset.members),
161+
)
162+
else:
163+
_ = CACHE.fetch(data.query_name(n).filename)

0 commit comments

Comments
 (0)