|
1 | 1 | ======================================================================================= |
2 | | -PACKAGE_NAME Release Notes |
| 2 | +PUDL Data Catalog Release Notes |
3 | 3 | ======================================================================================= |
4 | 4 |
|
5 | 5 | .. _release-v0-1-0: |
6 | 6 |
|
7 | 7 | --------------------------------------------------------------------------------------- |
8 | | -0.1.0 (2022-XX-XX) |
| 8 | +0.1.0 (2022-04-21) |
9 | 9 | --------------------------------------------------------------------------------------- |
10 | 10 |
|
11 | | -This is a Header |
12 | | -^^^^^^^^^^^^^^^^ |
13 | | -* Briefly describe the substantial changes to the code in here when you make a PR. |
14 | | -* That way and users we have documentation as to what's going on. |
15 | | -* You can refer to the relevant pull request using the ``pr`` role: :pr:`1` |
16 | | -* Don't hesitate to give shoutouts to folks who contributed like :user:`cmgosnell` |
17 | | -* You can link to issues that were closed like this: :issue:`2,3,4` |
18 | | - |
19 | | -Bug Fixes |
20 | | -^^^^^^^^^ |
21 | | -* It's good to make a note of any known bugs that are fixed by the release, and refer |
22 | | - to the relevant issues. |
| 11 | +First Release |
| 12 | +^^^^^^^^^^^^^ |
| 13 | +* We're excited to start providing bulk, versioned, programmatic access to the PUDL |
| 14 | + data, starting with the EPA CEMS hourly emissions data. This is still experimental. |
| 15 | +* The data is available in a Google cloud object store, via an Intake data catalog, and |
| 16 | + is stored in Apache Parquet files. |
| 17 | +* We're still working out some performance and metadata issues, but it's at least |
| 18 | + nominally functional, and we wanted to get it out early and see if we could get some |
| 19 | + feedback. |
| 20 | +* Currently there's a single-file and a partitioned version of the same data. We |
| 21 | + recommend using the single-file version (the source named ``hourly_emissions_epacems`` |
| 22 | + in the catalog) since performance is generally better and we need to work on making |
| 23 | + per-file local caching more efficient before its worth using the partitioned data. |
| 24 | +* Thanks to :user:`martindurant` for helping us get things set up and helping us debug |
| 25 | + some issues. |
23 | 26 |
|
24 | 27 | Known Issues |
25 | 28 | ^^^^^^^^^^^^ |
26 | | -* It's also good to list any remaining known problems, and link to their issues too. |
| 29 | +* Local caching of the Parquet files works, but with both the monolithic and partitioned |
| 30 | + versions of the data will typically cache the entire dataset immediately upon first |
| 31 | + access. This is because the metadata describing what data is in which file is only |
| 32 | + available within the Parquet files themseles, so every files has to be accessed in |
| 33 | + order to filter the entire dataset. Since the data is several GB, it can take a while |
| 34 | + to cache initially. Subsequent access is fast. See :issue:`4` |
| 35 | +* Accessing the year-state partitioned version of the data is much slower than the |
| 36 | + monolithic single file version. We don't really understand why. For now it's |
| 37 | + recommended to use the monolithic EPA CEMS data. See :issue:`8` |
0 commit comments