Skip to content

Commit a1b7f22

Browse files
committed
more updates to readme
1 parent 0c8bb1d commit a1b7f22

File tree

3 files changed

+82
-73
lines changed

3 files changed

+82
-73
lines changed

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ This Python project processes Government of Canada service-related data, merging
77

88
### Key Features
99
- **Data ingestion**: Downloads and processes service inventory and service standard performance data.
10-
- **Dataset Merging**: Combines service inventory data from 2018-2023 historical datasets and 2024+ datasets from the Open Government Portal.
11-
- **Quality Assurance**: Identifies and flags inconsistencies in datasets.
12-
- **Output Generation**: Produces structured CSVs that reflect the latest information on the Open Government Portal.
10+
- **Dataset merging**: Combines service inventory data from 2018-2023 historical datasets and 2024+ datasets from the Open Government Portal.
11+
- **Quality assurance**: Identifies and flags inconsistencies in datasets.
12+
- **Output generation**: Produces structured CSVs that reflect the latest information on the Open Government Portal.
1313

1414
Service inventory and service standard performance data are collected as a requirement under the [Policy on Service and Digital](https://www.tbs-sct.canada.ca/pol/doc-eng.aspx?id=32603).
1515

@@ -39,6 +39,8 @@ python main.py # Runs full processing pipeline
3939
```
4040
#### Optional Arguments:
4141
- `--local`: Runs the script without downloading new datasets.
42+
- `--live`: Runs the script without any snapshot-related calculations.
43+
- `--download`: Downloads the input data without running the process.
4244
- `--help`: Provides additional help.
4345

4446
---
@@ -65,7 +67,7 @@ python main.py # Runs full processing pipeline
6567
- **Update Frequency**: Ad-hoc
6668

6769
### [Utilities developed for GC Service Inventory data analysis](https://github.com/gc-performance/utilities)
68-
- **Files**: `org_var.csv`, `serv_prog.csv`
70+
- **Files**: `org_var.csv`, `serv_prog.csv`, `sid_registry.csv`
6971
- **Content**: A manually updated list of every organization, department, and agency with their associated names mapped to a single numeric ID (`org_var.csv`). Long-form program names from the 2018-2023 service inventory mapped to program IDs from Departmental Plans and Results Reports (`serv_prog.csv`).
7072
- **Update Frequency**: Ad-hoc
7173

@@ -138,6 +140,7 @@ For a more detailed description of each file and field, please consult [README_i
138140
- `context.md` - Context on this dataset for use with LLM.
139141
- `database.dbml` - **Draft** schema defining a database model.
140142
- `tidy-script` - Bash script producing file paths for deleting inputs, outputs, caches, etc.
143+
- `README_indicators.md` - Detailed information about datasets produced by script
141144

142145

143146
### Python script files (src/)
@@ -160,6 +163,10 @@ For a more detailed description of each file and field, please consult [README_i
160163
- `generate_reference.py`: script for generating field names and types for all output files, see ref/ directory
161164
- `reference_fields.csv`: Table of all tables, fields, and datatypes for use with test script
162165

166+
### Github workflows (.github/workflows)
167+
168+
- `generate-files.yml`: Github actions script that produces releases on a given schedule or on an ad-hoc basis.
169+
163170
---
164171

165172
### Release Schedule
@@ -179,7 +186,7 @@ For a more detailed description of each file and field, please consult [README_i
179186

180187
---
181188
## Directory structure for project
182-
*Given that files produced by the script are made available in releases, all transitory input and output files are no longer tracked with git, or included in the repo. Releases have a flat structure, so the directory structure below is only relevant if you clone the repo and run the script.*
189+
*Given that files produced by the script are made available in releases, all transitory input and output files are no longer tracked with git, or included in the repo. The exception is the input snapshots, which are a part of the repo. Releases have a flat structure, so the directory structure below is only relevant if you clone the repo and run the script.*
183190

184191
```
185192
.

README_indicators.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -474,7 +474,7 @@ Unique list of service IDs with latest reporting year and department. Generated
474474
- `service_scope_ext_or_ent`: Calculated field that indicates whether the service is external or internal enterprise to assist in quick filtering of relevant services. Refers to reported value from latest fiscal year
475475

476476
#### `si_all.csv`
477-
Full service inventory merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included. See list of fields for `si.csv`. Generated by `src/merge.py/merge_si`.
477+
Full service inventory merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included, not just `EXTERN` and `ENTERPRISE`. See list of fields for `si.csv`. Generated by `src/merge.py/merge_si`.
478478

479479
#### `ss_all.csv`
480-
Full service standard dataset merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included. See list of fields for `ss.csv`. Generated by `src/merge.py/merge_ss`.
480+
Full service standard dataset merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included, not just `EXTERN` and `ENTERPRISE`. See list of fields for `ss.csv`. Generated by `src/merge.py/merge_ss`.

main.py

Lines changed: 68 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ def get_config():
6262
'2025-2026':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2526-eng.csv',
6363
'2026-2027':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2627-eng.csv'
6464
}
65-
6665

6766
program_urls_fr = {
6867
'2018-2019':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-1819-fra.csv',
@@ -75,6 +74,7 @@ def get_config():
7574
'2025-2026':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2526-fra.csv',
7675
'2026-2027':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2627-fra.csv'
7776
}
77+
7878
APP_COLS = [
7979
'num_applications_by_phone',
8080
'num_applications_online',
@@ -162,6 +162,7 @@ def main():
162162
parser = argparse.ArgumentParser(description="Process service data and generate outputs.")
163163
parser.add_argument("--local", action="store_true", help="Use local inputs without downloading new ones.")
164164
parser.add_argument("--live", action="store_true", help="Run process without running update for snapshots.")
165+
parser.add_argument("--download", action="store_true", help="Download the input data without running the processing script.")
165166
args = parser.parse_args()
166167

167168
# Set up logging
@@ -187,71 +188,72 @@ def main():
187188
download_program_csv_files(config)
188189

189190
# Merge historical data
190-
try:
191-
logger.info("Merging historical data...")
192-
si = merge_si(config)
193-
ss = merge_ss(config)
194-
except:
195-
logger.error("Merging historical data failed", exc_info=True)
196-
sys.exit(1)
197-
198-
# Generate processed files
199-
logger.info("Generating processed files...")
200-
si_ss_dict = process_files(si, ss, config)
201-
202-
# Run QA checks
203-
logger.info("Running QA checks...")
204-
qa_check(si, ss, config)
205-
206-
# Copying files from raw to utils
207-
logger.info("Copying files from input to utils...")
208-
build_ifoi(config)
209-
copy_org_var(config)
210-
build_data_dictionary(config)
211-
212-
# Run snapshots unless "live" arg was passed
213-
if not args.live: # if the "live" option was passed, don't run the snapshots
214-
snapshots_list = config['snapshots_list']
215-
for snapshot in snapshots_list:
216-
# Merge historical snapshot data
217-
logger.info("Processing snapshots: %s", snapshot)
218-
try:
219-
logger.info("Merging historical data for snapshots...")
220-
si_snap = merge_si(config, snapshot)
221-
ss_snap = merge_ss(config, snapshot)
222-
except:
223-
logger.error("Merging historical data for snapshots failed", exc_info=True)
224-
sys.exit(1)
225-
226-
# Generate processed files
227-
logger.info("Generating processed snapshot files...")
228-
si_ss_snap_dict = process_files(si_snap, ss_snap, config, snapshot)
229-
230-
# Compare snapshots to live data
231-
try:
232-
logger.info("Comparing snapshot to live data...")
233-
si_compare_dict = {
234-
'df_base': si_ss_snap_dict['si'],
235-
'df_comp': si_ss_dict['si'],
236-
'base_name': f"{snapshot}_si",
237-
'comp_name':"si",
238-
'key_name':"fy_org_id_service_id",
239-
'file_name':"si_comparison"
240-
}
241-
build_compare_file(si_compare_dict, config, snapshot)
242-
243-
ss_compare_dict = {
244-
'df_base': si_ss_snap_dict['ss'],
245-
'df_comp': si_ss_dict['ss'],
246-
'base_name': f"{snapshot}_ss",
247-
'comp_name':"ss",
248-
'key_name':"fy_org_id_service_id_std_id",
249-
'file_name':"ss_comparison"
250-
}
251-
build_compare_file(ss_compare_dict, config, snapshot)
252-
except:
253-
logger.error("Comparing snapshot to live data failed", exc_info=True)
254-
191+
if not args.download: # If the "download" option was passed, do not run the process
192+
try:
193+
logger.info("Merging historical data...")
194+
si = merge_si(config)
195+
ss = merge_ss(config)
196+
except:
197+
logger.error("Merging historical data failed", exc_info=True)
198+
sys.exit(1)
199+
200+
# Generate processed files
201+
logger.info("Generating processed files...")
202+
si_ss_dict = process_files(si, ss, config)
203+
204+
# Run QA checks
205+
logger.info("Running QA checks...")
206+
qa_check(si, ss, config)
207+
208+
# Copying files from raw to utils
209+
logger.info("Copying files from input to utils...")
210+
build_ifoi(config)
211+
copy_org_var(config)
212+
build_data_dictionary(config)
213+
214+
# Run snapshots unless "live" arg was passed
215+
if not args.live: # if the "live" option was passed, don't run the snapshots
216+
snapshots_list = config['snapshots_list']
217+
for snapshot in snapshots_list:
218+
# Merge historical snapshot data
219+
logger.info("Processing snapshots: %s", snapshot)
220+
try:
221+
logger.info("Merging historical data for snapshots...")
222+
si_snap = merge_si(config, snapshot)
223+
ss_snap = merge_ss(config, snapshot)
224+
except:
225+
logger.error("Merging historical data for snapshots failed", exc_info=True)
226+
sys.exit(1)
227+
228+
# Generate processed files
229+
logger.info("Generating processed snapshot files...")
230+
si_ss_snap_dict = process_files(si_snap, ss_snap, config, snapshot)
231+
232+
# Compare snapshots to live data
233+
try:
234+
logger.info("Comparing snapshot to live data...")
235+
si_compare_dict = {
236+
'df_base': si_ss_snap_dict['si'],
237+
'df_comp': si_ss_dict['si'],
238+
'base_name': f"{snapshot}_si",
239+
'comp_name':"si",
240+
'key_name':"fy_org_id_service_id",
241+
'file_name':"si_comparison"
242+
}
243+
build_compare_file(si_compare_dict, config, snapshot)
244+
245+
ss_compare_dict = {
246+
'df_base': si_ss_snap_dict['ss'],
247+
'df_comp': si_ss_dict['ss'],
248+
'base_name': f"{snapshot}_ss",
249+
'comp_name':"ss",
250+
'key_name':"fy_org_id_service_id_std_id",
251+
'file_name':"ss_comparison"
252+
}
253+
build_compare_file(ss_compare_dict, config, snapshot)
254+
except:
255+
logger.error("Comparing snapshot to live data failed", exc_info=True)
256+
255257

256258
# Log completion time
257259
elapsed_time = time.perf_counter() - start_time

0 commit comments

Comments
 (0)