Skip to content

Commit ea03bb4

Browse files
authored
Added README docs. Also, changed API a bit. (#7)
- Using `chunksize`, which is a dask convention - Accepts strings or fcs - Added docstring to main function. - Aaron is an author. - rm pandas dep.
1 parent c859cf0 commit ea03bb4

File tree

4 files changed

+98
-10
lines changed

4 files changed

+98
-10
lines changed

README.md

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,78 @@
11
# dask-ee
22

3-
Earth Engine FeatureCollections via Dask Dataframes
3+
Google Earth Engine `FeatureCollection`s via Dask DataFrames
4+
5+
## How to use
6+
7+
Install with pip:
8+
```shell
9+
pip install --upgrade dask-ee
10+
```
11+
12+
Then, authenticate Earth Engine:
13+
```shell
14+
earthengine authenticate --quiet
15+
```
16+
17+
In your Python environment, you may now import the library:
18+
19+
```python
20+
import ee
21+
import dask_ee
22+
```
23+
24+
You'll need to initialize Earth Engine before working with data:
25+
```python
26+
ee.Initialize()
27+
```
28+
29+
From here, you can read Earth Engine FeatureCollections like they are DataFrames:
30+
```python
31+
ddf = dask_ee.read_ee("WRI/GPPD/power_plants")
32+
ddf.head()
33+
```
34+
These work like Pandas DataFrames, but they are lazily evaluated via [Dask](https://dask.org/).
35+
36+
Feel free to do any analysis you wish. For example:
37+
```python
38+
# Thanks @aazuspan, https://www.aazuspan.dev/blog/dask_featurecollection
39+
(
40+
ddf[ddf.comm_year.gt(1940) & ddf.country.eq("USA") & ddf.fuel1.isin(["Coal", "Wind"])]
41+
.astype({"comm_year": int})
42+
.drop(columns=["geo"])
43+
.groupby(["comm_year", "fuel1"])
44+
.agg({"capacitymw": "sum"})
45+
.reset_index()
46+
.sort_values(by=["comm_year"])
47+
.compute(scheduler="threads")
48+
.pivot_table(index="comm_year", columns="fuel1", values="capacitymw", fill_value=0)
49+
.plot()
50+
)
51+
```
52+
![Coal vs Wind in the US since 1940](demo.png)
53+
54+
There are a few other useful things you can do.
55+
56+
For one, you may pass in a pre-processed `ee.FeatureCollection`. This allows full utilization
57+
of the Earth Engine API.
58+
59+
```python
60+
import dask_ee
61+
62+
fc = (
63+
ee.FeatureCollection("WRI/GPPD/power_plants")
64+
.filter(ee.Filter.gt("comm_year", 1940))
65+
.filter(ee.Filter.eq("country", "USA"))
66+
)
67+
ddf = dask_ee.read_ee(fc)
68+
```
69+
70+
In addition, you may change the `chunksize`, which controls how many rows are included in each
71+
Dask partition.
72+
```python
73+
ddf = dask_ee.read_ee("WRI/GPPD/power_plants", chunksize=7_000)
74+
ddf.head()
75+
```
476

577
## License
678
```

dask_ee/read.py

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,24 @@
2929
# TODO(#4): Support 'auto' chunks, where we calculate the maximum allowed page size given the number of
3030
# bytes in each row.
3131
def read_ee(
32-
fc: ee.FeatureCollection, io_chunks: t.Union[int, t.Literal['auto']] = 5_000
32+
fc: t.Union[ee.FeatureCollection, str],
33+
chunksize: t.Union[int, t.Literal['auto']] = 5_000,
3334
) -> dd.DataFrame:
35+
"""Read Google Earth Engine FeatureCollections into a Dask Dataframe.
3436
35-
if io_chunks == 'auto':
36-
raise NotImplementedError('Auto `io_chunks` are not implemented yet!')
37+
Args:
38+
fc: A Google Earth Engine FeatureCollection or valid string path to a FeatureCollection.
39+
chunksize: The number of rows per partition to use.
40+
41+
Returns:
42+
A dask DataFrame with paged Google Earth Engine data.
43+
"""
44+
45+
if isinstance(fc, str):
46+
fc = ee.FeatureCollection(fc)
47+
48+
if chunksize == 'auto':
49+
raise NotImplementedError('Auto chunksize is not implemented yet!')
3750

3851
# Make all the getInfo() calls at once, up front.
3952
fc_size, all_info = ee.List([fc.size(), fc.limit(0)]).getInfo()
@@ -42,12 +55,12 @@ def read_ee(
4255
columns.update(all_info['columns'])
4356
del columns['system:index']
4457

45-
divisions = tuple(range(0, fc_size, io_chunks))
58+
divisions = tuple(range(0, fc_size, chunksize))
4659

4760
# TODO(#5): Compare `toList()` to other range operations, like getting all index IDs via `getInfo()`.
48-
pages = [ee.FeatureCollection(fc.toList(io_chunks, i)) for i in divisions]
61+
pages = [ee.FeatureCollection(fc.toList(chunksize, i)) for i in divisions]
4962
# Get the remainder, if it exists. `io_chunks` are not likely to evenly partition the data.
50-
d, r = divmod(fc_size, io_chunks)
63+
d, r = divmod(fc_size, chunksize)
5164
if r != 0:
5265
pages.append(ee.FeatureCollection(fc.toList(r, d)))
5366
divisions += (fc_size,)

demo.png

39.4 KB
Loading

pyproject.toml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
[project]
22
name = "dask-ee"
33
dynamic = ["version"]
4-
description = "Google Earth Engine FeatureCollections via Dask Dataframes."
4+
description = "Google Earth Engine FeatureCollections via Dask DataFrames."
55
readme = "README.md"
66
requires-python = ">=3.8"
77
license = {text = "Apache-2.0"}
88
authors = [
99
{name = "Alexander Merose", email = "al@merose.com"},
10+
{name = "Aaron Zuspan"}
1011
]
1112
classifiers = [
1213
"Development Status :: 4 - Beta",
@@ -22,11 +23,13 @@ classifiers = [
2223
"Programming Language :: Python :: 3.11",
2324
"Programming Language :: Python :: 3.12",
2425
"Topic :: Scientific/Engineering :: Atmospheric Science",
26+
"Topic :: Scientific/Engineering :: GIS",
27+
"Topic :: Scientific/Engineering :: Hydrology",
28+
"Topic :: Scientific/Engineering :: Oceanography",
2529
]
2630
dependencies = [
2731
"earthengine-api>=0.1.374",
28-
"pandas",
29-
"dask",
32+
"dask[dataframe]",
3033
]
3134

3235
[project.optional-dependencies]

0 commit comments

Comments
 (0)