more updates to readme

ge-tbs · ge-tbs · commit a1b7f22bc13f · 2026-02-20T18:49:02.000Z
diff --git a/README.md b/README.md
@@ -7,9 +7,9 @@ This Python project processes Government of Canada service-related data, merging
 
 ### Key Features
 - **Data ingestion**: Downloads and processes service inventory and service standard performance data.
-- **Dataset Merging**: Combines service inventory data from 2018-2023 historical datasets and 2024+ datasets from the Open Government Portal.
-- **Quality Assurance**: Identifies and flags inconsistencies in datasets.
-- **Output Generation**: Produces structured CSVs that reflect the latest information on the Open Government Portal.
+- **Dataset merging**: Combines service inventory data from 2018-2023 historical datasets and 2024+ datasets from the Open Government Portal.
+- **Quality assurance**: Identifies and flags inconsistencies in datasets.
+- **Output generation**: Produces structured CSVs that reflect the latest information on the Open Government Portal.
 
 Service inventory and service standard performance data are collected as a requirement under the [Policy on Service and Digital](https://www.tbs-sct.canada.ca/pol/doc-eng.aspx?id=32603).
 
@@ -39,6 +39,8 @@ python main.py  # Runs full processing pipeline
 ```
 #### Optional Arguments:
 - `--local`: Runs the script without downloading new datasets.
+- `--live`: Runs the script without any snapshot-related calculations.
+- `--download`: Downloads the input data without running the process.
 - `--help`: Provides additional help.
 
 ---
@@ -65,7 +67,7 @@ python main.py  # Runs full processing pipeline
 - **Update Frequency**: Ad-hoc
 
 ### [Utilities developed for GC Service Inventory data analysis](https://github.com/gc-performance/utilities)
-- **Files**: `org_var.csv`, `serv_prog.csv`
+- **Files**: `org_var.csv`, `serv_prog.csv`, `sid_registry.csv`
 - **Content**: A manually updated list of every organization, department, and agency with their associated names mapped to a single numeric ID (`org_var.csv`). Long-form program names from the 2018-2023 service inventory mapped to program IDs from Departmental Plans and Results Reports (`serv_prog.csv`).
 - **Update Frequency**: Ad-hoc
 
@@ -138,6 +140,7 @@ For a more detailed description of each file and field, please consult [README_i
 - `context.md` - Context on this dataset for use with LLM.
 - `database.dbml` - **Draft** schema defining a database model.
 - `tidy-script` - Bash script producing file paths for deleting inputs, outputs, caches, etc.
+- `README_indicators.md` - Detailed information about datasets produced by script
 
 
 ### Python script files (src/)
@@ -160,6 +163,10 @@ For a more detailed description of each file and field, please consult [README_i
 - `generate_reference.py`: script for generating field names and types for all output files, see ref/ directory
 - `reference_fields.csv`: Table of all tables, fields, and datatypes for use with test script
 
+### Github workflows (.github/workflows)
+
+- `generate-files.yml`: Github actions script that produces releases on a given schedule or on an ad-hoc basis.
+
 ---
 
 ### Release Schedule
@@ -179,7 +186,7 @@ For a more detailed description of each file and field, please consult [README_i
 
 ---
 ## Directory structure for project
-*Given that files produced by the script are made available in releases, all transitory input and output files are no longer tracked with git, or included in the repo. Releases have a flat structure, so the directory structure below is only relevant if you clone the repo and run the script.*
+*Given that files produced by the script are made available in releases, all transitory input and output files are no longer tracked with git, or included in the repo. The exception is the input snapshots, which are a part of the repo. Releases have a flat structure, so the directory structure below is only relevant if you clone the repo and run the script.*
 
 ```
 .
diff --git a/README_indicators.md b/README_indicators.md
@@ -474,7 +474,7 @@ Unique list of service IDs with latest reporting year and department. Generated
 - `service_scope_ext_or_ent`: Calculated field that indicates whether the service is external or internal enterprise to assist in quick filtering of relevant services. Refers to reported value from latest fiscal year
 
 #### `si_all.csv`
-Full service inventory merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included. See list of fields for `si.csv`. Generated by `src/merge.py/merge_si`.
+Full service inventory merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included, not just `EXTERN` and `ENTERPRISE`. See list of fields for `si.csv`. Generated by `src/merge.py/merge_si`.
 
 #### `ss_all.csv`
-Full service standard dataset merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included. See list of fields for `ss.csv`. Generated by `src/merge.py/merge_ss`.
+Full service standard dataset merging 2018–2023 datasets with the 2024 dataset. All `service_scope` included, not just `EXTERN` and `ENTERPRISE`. See list of fields for `ss.csv`. Generated by `src/merge.py/merge_ss`.
diff --git a/main.py b/main.py
@@ -62,7 +62,6 @@ def get_config():
         '2025-2026':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2526-eng.csv',
         '2026-2027':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2627-eng.csv'
     }
-        
 
     program_urls_fr = {
         '2018-2019':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-1819-fra.csv',
@@ -75,6 +74,7 @@ def get_config():
         '2025-2026':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2526-fra.csv',
         '2026-2027':'https://donnees-data.tpsgc-pwgsc.gc.ca/ba1/cp-pc/cp-pc-2627-fra.csv'
     }
+
     APP_COLS = [
         'num_applications_by_phone', 
         'num_applications_online', 
@@ -162,6 +162,7 @@ def main():
     parser = argparse.ArgumentParser(description="Process service data and generate outputs.")
     parser.add_argument("--local", action="store_true", help="Use local inputs without downloading new ones.")
     parser.add_argument("--live", action="store_true", help="Run process without running update for snapshots.")
+    parser.add_argument("--download", action="store_true", help="Download the input data without running the processing script.")
     args = parser.parse_args()
 
     # Set up logging
@@ -187,71 +188,72 @@ def main():
             download_program_csv_files(config)
 
         # Merge historical data
-        try:
-            logger.info("Merging historical data...")
-            si = merge_si(config)
-            ss = merge_ss(config)
-        except:
-            logger.error("Merging historical data failed", exc_info=True)
-            sys.exit(1)
-
-        # Generate processed files
-        logger.info("Generating processed files...")
-        si_ss_dict = process_files(si, ss, config)
-        
-        # Run QA checks
-        logger.info("Running QA checks...")
-        qa_check(si, ss, config)
-        
-        # Copying files from raw to utils
-        logger.info("Copying files from input to utils...")
-        build_ifoi(config)
-        copy_org_var(config)
-        build_data_dictionary(config)
-        
-        # Run snapshots unless "live" arg was passed
-        if not args.live: # if the "live" option was passed, don't run the snapshots
-            snapshots_list = config['snapshots_list']
-            for snapshot in snapshots_list:
-                # Merge historical snapshot data
-                logger.info("Processing snapshots: %s", snapshot)
-                try:
-                    logger.info("Merging historical data for snapshots...")
-                    si_snap = merge_si(config, snapshot)
-                    ss_snap = merge_ss(config, snapshot)
-                except:
-                    logger.error("Merging historical data for snapshots failed", exc_info=True)
-                    sys.exit(1)
-                
-                # Generate processed files 
-                logger.info("Generating processed snapshot files...")
-                si_ss_snap_dict = process_files(si_snap, ss_snap, config, snapshot)
-
-                # Compare snapshots to live data
-                try:
-                    logger.info("Comparing snapshot to live data...")
-                    si_compare_dict = {
-                        'df_base': si_ss_snap_dict['si'],
-                        'df_comp': si_ss_dict['si'],
-                        'base_name': f"{snapshot}_si",
-                        'comp_name':"si",
-                        'key_name':"fy_org_id_service_id",
-                        'file_name':"si_comparison"
-                    }
-                    build_compare_file(si_compare_dict, config, snapshot)
-
-                    ss_compare_dict = {
-                        'df_base': si_ss_snap_dict['ss'],
-                        'df_comp': si_ss_dict['ss'],
-                        'base_name': f"{snapshot}_ss",
-                        'comp_name':"ss",
-                        'key_name':"fy_org_id_service_id_std_id",
-                        'file_name':"ss_comparison"
-                    }
-                    build_compare_file(ss_compare_dict, config, snapshot)
-                except:
-                    logger.error("Comparing snapshot to live data failed", exc_info=True)
-                
+        if not args.download: # If the "download" option was passed, do not run the process
+            try:
+                logger.info("Merging historical data...")
+                si = merge_si(config)
+                ss = merge_ss(config)
+            except:
+                logger.error("Merging historical data failed", exc_info=True)
+                sys.exit(1)
+
+            # Generate processed files
+            logger.info("Generating processed files...")
+            si_ss_dict = process_files(si, ss, config)
+            
+            # Run QA checks
+            logger.info("Running QA checks...")
+            qa_check(si, ss, config)
+            
+            # Copying files from raw to utils
+            logger.info("Copying files from input to utils...")
+            build_ifoi(config)
+            copy_org_var(config)
+            build_data_dictionary(config)
+            
+            # Run snapshots unless "live" arg was passed
+            if not args.live: # if the "live" option was passed, don't run the snapshots
+                snapshots_list = config['snapshots_list']
+                for snapshot in snapshots_list:
+                    # Merge historical snapshot data
+                    logger.info("Processing snapshots: %s", snapshot)
+                    try:
+                        logger.info("Merging historical data for snapshots...")
+                        si_snap = merge_si(config, snapshot)
+                        ss_snap = merge_ss(config, snapshot)
+                    except:
+                        logger.error("Merging historical data for snapshots failed", exc_info=True)
+                        sys.exit(1)
+                    
+                    # Generate processed files 
+                    logger.info("Generating processed snapshot files...")
+                    si_ss_snap_dict = process_files(si_snap, ss_snap, config, snapshot)
+
+                    # Compare snapshots to live data
+                    try:
+                        logger.info("Comparing snapshot to live data...")
+                        si_compare_dict = {
+                            'df_base': si_ss_snap_dict['si'],
+                            'df_comp': si_ss_dict['si'],
+                            'base_name': f"{snapshot}_si",
+                            'comp_name':"si",
+                            'key_name':"fy_org_id_service_id",
+                            'file_name':"si_comparison"
+                        }
+                        build_compare_file(si_compare_dict, config, snapshot)
+
+                        ss_compare_dict = {
+                            'df_base': si_ss_snap_dict['ss'],
+                            'df_comp': si_ss_dict['ss'],
+                            'base_name': f"{snapshot}_ss",
+                            'comp_name':"ss",
+                            'key_name':"fy_org_id_service_id_std_id",
+                            'file_name':"ss_comparison"
+                        }
+                        build_compare_file(ss_compare_dict, config, snapshot)
+                    except:
+                        logger.error("Comparing snapshot to live data failed", exc_info=True)
+                    
 
         # Log completion time
         elapsed_time = time.perf_counter() - start_time