datacommonsorg · vishalworkdatacommon · Dec 19, 2025 · Dec 22, 2025 · Dec 23, 2025 · Dec 25, 2025
diff --git a/statvar_imports/us_college_ipeds/README.md b/statvar_imports/us_college_ipeds/README.md
@@ -20,14 +20,16 @@ The import process involves downloading raw data, preprocessing it to remove des
 *   **Input files**:
     *   Raw data files are downloaded from the source and stored in a GCP bucket.
     *   `run.sh`: Downloads the raw data files from the GCP bucket into the `input_files/` directory.
+    *   `preprocess.py`: Cleans the raw CSV files by removing initial descriptive rows.
     *   `metadata.csv`: Configuration file for the data processing script.
-    *   `pvmap.csv`: Property-value mapping files used by the processor.
+    *   `pvmap/`: Directory containing property-value mapping files used by the processor.
 
 *   **Transformation pipeline**:
     1.  Raw data files are downloaded from the source to a GCP bucket.
-    2.  `run.sh` script is executed to download these files to the `input_files/` directory and remove descriptive header rows
-    3.  The `stat_var_processor.py` tool is run on each cleaned CSV file, as specified in `manifest.json`.
-    4.  The processor uses the `metadata.csv` and respective `pvmap.csv` files to generate the final `output.csv` and `output.tmcf` files, placing them in the `output/` directory.
+    2.  `run.sh` script is executed to download these files to the `input_files/` directory.
+    3.  `preprocess.py` script is executed to remove descriptive header rows from the downloaded CSV files.
+    4.  The `stat_var_processor.py` tool is run on each cleaned CSV file, as specified in `manifest.json`.
+    5.  The processor uses the `metadata.csv` and respective `pvmap.csv` files to generate the final `output.csv` and `output.tmcf` files, placing them in the `output/` directory.
 
 *   **Data Quality Checks**:
     *   Linting is performed on the generated output files using the DataCommons import tool.
@@ -37,12 +39,13 @@ The import process involves downloading raw data, preprocessing it to remove des
 
 ## Autorefresh
 
-This import is considered semi-automated because the initial data download to the GCP bucket might require manual intervention. However, once in the bucket, the `run.sh` script can copy the files to input_files folder
+This import is considered semi-automated because the initial data download to the GCP bucket might require manual intervention. However, once in the bucket, the `run.sh` and `preprocess.py` scripts automate the download and cleaning process.
 
 *   **Steps**:
     1.  Ensure raw data files are in the specified GCP bucket.
-    2.  Execute `run.sh` to fetch the raw data files into `input_files/` and then it preprocess the input files to remove descriptive header rows
-    3.  The `stat_var_processor.py` tool is then run (as defined in `manifest.json`) on the preprocessed files to generate the final artifacts for ingestion.
+    2.  Execute `run.sh` to fetch the raw data files into `input_files/`.
+    3.  Execute `preprocess.py` to clean the input files by removing descriptive header rows.
+    4.  The `stat_var_processor.py` tool is then run (as defined in `manifest.json`) on the preprocessed files to generate the final artifacts for ingestion.
 
 ---
 
@@ -52,7 +55,7 @@ To run the import manually, follow these steps in order.
 
 ### Step 1: Download Raw Data (via `run.sh`)
 
-This script downloads the raw data from the GCP bucket to the `input_files/` directory and then preprocesses them to remove descriptive header rows
+This script downloads the raw data from the GCP bucket to the `input_files/` directory.
 
 **Usage**:
 
@@ -62,7 +65,19 @@ bash run.sh
 
 ---
 
-### Step 2: Process the Data for Final Output
+### Step 2: Preprocess the Data (via `preprocess.py`)
+
+This script cleans the downloaded CSV files in the `input_files/` directory by removing descriptive header rows.
+
+**Usage**:
+
+```shell
+python3 preprocess.py
+```
+
+---
+
+### Step 3: Process the Data for Final Output
 
 This step involves running the `stat_var_processor.py` for each input file as specified in `manifest.json`. An example command is shown below:
 
@@ -76,7 +91,7 @@ _Note: This command needs to be executed for all 10 input files as defined in `m
 
 ---
 
-### Step 3: Validate the Output Files
+### Step 4: Validate the Output Files
 
 This command validates the generated files for formatting and semantic consistency before ingestion.
 
@@ -86,4 +101,4 @@ This command validates the generated files for formatting and semantic consisten
 java -jar /path/to/datacommons-import-tool.jar lint -d 'output/'
 ```
 
-This step ensures that the generated artifacts are ready for ingestion into Data Commons.
+This step ensures that the generated artifacts are ready for ingestion into Data Commons.
diff --git a/statvar_imports/us_college_ipeds/manifest.json b/statvar_imports/us_college_ipeds/manifest.json
@@ -9,6 +9,7 @@
             "provenance_description": "Data focuses on Full-Time Equivalent (FTE) enrollment across U.S. postsecondary institutions",
             "scripts": [
             "run.sh",
+            "preprocess.py",
                 "../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/controlOfInstitution_data.csv --pv_map=pvmap/controlOfInstitution_pvmap.csv --config_file=metadata.csv --output_path=output/ControlOfInstitution_output",
                 "../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/degreeOfUrbanization_data.csv --pv_map=pvmap/degreeOfUrbanization_pvmap.csv --config_file=metadata.csv --output_path=output/DegreeOfUrbanization_output",
                 "../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/highestDegreeOffered_data.csv --pv_map=pvmap/highestDegreeOffered_pvmap.csv --config_file=metadata.csv --output_path=output/HighestDegreeOffered_output",

diff --git a/statvar_imports/us_college_ipeds/preprocess.py b/statvar_imports/us_college_ipeds/preprocess.py
@@ -0,0 +1,35 @@
+import os
+import pandas as pd
+
+def clean_csv(file_path):
+    with open(file_path, 'r') as f:
+        lines = f.readlines()
+
+    start_index = -1
+    for i, line in enumerate(lines):
+        if line.strip().startswith('Year'):
+            start_index = i
+            break
+
+    if start_index != -1:
+        cleaned_content = lines[start_index:]
+        with open(file_path, 'w') as f:
+            f.writelines(cleaned_content)
+        print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")
+    else:
+        print(f"Could not find 'Year' in {file_path}. No changes made.")
+
+def clean_csv_in_directory(directory):
+    if not os.path.isdir(directory):
+        print(f"Directory '{directory}' not found.")
+        return
+
+    csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
+
+    for csv_file in csv_files:
+        file_path = os.path.join(directory, csv_file)
+        clean_csv(file_path)
+
+if __name__ == '__main__':
+    input_directory = 'input_files'
+    clean_csv_in_directory(input_directory)
diff --git a/statvar_imports/us_college_ipeds/run.sh b/statvar_imports/us_college_ipeds/run.sh
@@ -4,42 +4,4 @@ SCRIPT_PATH=$(realpath "$(dirname "$0")")
 
 mkdir -p "input_files"
 
-gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files"
-
-python3 <<'END_PYTHON'
-import os
-import pandas as pd
-
-def clean_csv(file_path):
-    with open(file_path, 'r') as f:
-        lines = f.readlines()
-
-    start_index = -1
-    for i, line in enumerate(lines):
-        if line.strip().startswith('Year'):
-            start_index = i
-            break
-
-    if start_index != -1:
-        cleaned_content = lines[start_index:]
-        with open(file_path, 'w') as f:
-            f.writelines(cleaned_content)
-        print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")
-    else:
-        print(f"Could not find 'Year' in {file_path}. No changes made.")
-
-def clean_csv_in_directory(directory):
-    if not os.path.isdir(directory):
-        print(f"Directory '{directory}' not found.")
-        return
-
-    csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
-
-    for csv_file in csv_files:
-        file_path = os.path.join(directory, csv_file)
-        clean_csv(file_path)
-
-if __name__ == '__main__':
-    input_directory = 'input_files'
-    clean_csv_in_directory(input_directory)
-END_PYTHON
+gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files"
diff --git a/statvar_imports/us_college_ipeds/test_data/ControlOfInstitution_output.csv b/statvar_imports/us_college_ipeds/test_data/ControlOfInstitution_output.csv
@@ -0,0 +1,46 @@
+observationAbout,observationDate,value,variableMeasured,#input
+country/USA,2024,10895410,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:2
+country/USA,2024,3884457,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:3
+country/USA,2024,1344830,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:4
+country/USA,2023,10568750,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:2
+country/USA,2023,3812142,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:3
+country/USA,2023,1260771,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:4
+country/USA,2022,10591338,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:2
+country/USA,2022,3818557,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:3
+country/USA,2022,1274605,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:4
+country/USA,2021,10985128,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:2
+country/USA,2021,3802117,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:3
+country/USA,2021,1301882,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:4
+country/USA,2020,11366064,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:2
+country/USA,2020,3852214,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:3
+country/USA,2020,1245792,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:4
+country/USA,2019,11420024,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:2
+country/USA,2019,3842713,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:3
+country/USA,2019,1238471,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:4
+country/USA,2018,11470565,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:2
+country/USA,2018,3788721,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:3
+country/USA,2018,1229025,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:4
+country/USA,2017,11429561,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:2
+country/USA,2017,3764543,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:3
+country/USA,2017,1411905,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:4
+country/USA,2016,11441625,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:2
+country/USA,2016,3741014,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:3
+country/USA,2016,1530130,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:4
+country/USA,2015,11490719,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:2
+country/USA,2015,3740928,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:3
+country/USA,2015,1769917,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:4
+country/USA,2014,11573864,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:2
+country/USA,2014,3697921,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:3
+country/USA,2014,2039179,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:4
+country/USA,2013,11682275,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:2
+country/USA,2013,3676175,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:3
+country/USA,2013,2157925,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:4
+country/USA,2012,11924029,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:2
+country/USA,2012,3651247,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:3
+country/USA,2012,2544278,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:4
+country/USA,2011,12059233,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:2
+country/USA,2011,3650465,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:3
+country/USA,2011,2633725,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:4
+country/USA,2010,11804731,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:2
+country/USA,2010,3520524,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:3
+country/USA,2010,2515909,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:4