Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 26 additions & 11 deletions statvar_imports/us_college_ipeds/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,16 @@ The import process involves downloading raw data, preprocessing it to remove des
* **Input files**:
* Raw data files are downloaded from the source and stored in a GCP bucket.
* `run.sh`: Downloads the raw data files from the GCP bucket into the `input_files/` directory.
* `preprocess.py`: Cleans the raw CSV files by removing initial descriptive rows.
* `metadata.csv`: Configuration file for the data processing script.
* `pvmap.csv`: Property-value mapping files used by the processor.
* `pvmap/`: Directory containing property-value mapping files used by the processor.

* **Transformation pipeline**:
1. Raw data files are downloaded from the source to a GCP bucket.
2. `run.sh` script is executed to download these files to the `input_files/` directory and remove descriptive header rows
3. The `stat_var_processor.py` tool is run on each cleaned CSV file, as specified in `manifest.json`.
4. The processor uses the `metadata.csv` and respective `pvmap.csv` files to generate the final `output.csv` and `output.tmcf` files, placing them in the `output/` directory.
2. `run.sh` script is executed to download these files to the `input_files/` directory.
3. `preprocess.py` script is executed to remove descriptive header rows from the downloaded CSV files.
4. The `stat_var_processor.py` tool is run on each cleaned CSV file, as specified in `manifest.json`.
5. The processor uses the `metadata.csv` and respective `pvmap.csv` files to generate the final `output.csv` and `output.tmcf` files, placing them in the `output/` directory.

* **Data Quality Checks**:
* Linting is performed on the generated output files using the DataCommons import tool.
Expand All @@ -37,12 +39,13 @@ The import process involves downloading raw data, preprocessing it to remove des

## Autorefresh

This import is considered semi-automated because the initial data download to the GCP bucket might require manual intervention. However, once in the bucket, the `run.sh` script can copy the files to input_files folder
This import is considered semi-automated because the initial data download to the GCP bucket might require manual intervention. However, once in the bucket, the `run.sh` and `preprocess.py` scripts automate the download and cleaning process.

* **Steps**:
1. Ensure raw data files are in the specified GCP bucket.
2. Execute `run.sh` to fetch the raw data files into `input_files/` and then it preprocess the input files to remove descriptive header rows
3. The `stat_var_processor.py` tool is then run (as defined in `manifest.json`) on the preprocessed files to generate the final artifacts for ingestion.
2. Execute `run.sh` to fetch the raw data files into `input_files/`.
3. Execute `preprocess.py` to clean the input files by removing descriptive header rows.
4. The `stat_var_processor.py` tool is then run (as defined in `manifest.json`) on the preprocessed files to generate the final artifacts for ingestion.

---

Expand All @@ -52,7 +55,7 @@ To run the import manually, follow these steps in order.

### Step 1: Download Raw Data (via `run.sh`)

This script downloads the raw data from the GCP bucket to the `input_files/` directory and then preprocesses them to remove descriptive header rows
This script downloads the raw data from the GCP bucket to the `input_files/` directory.

**Usage**:

Expand All @@ -62,7 +65,19 @@ bash run.sh

---

### Step 2: Process the Data for Final Output
### Step 2: Preprocess the Data (via `preprocess.py`)

This script cleans the downloaded CSV files in the `input_files/` directory by removing descriptive header rows.

**Usage**:

```shell
python3 preprocess.py
```

---

### Step 3: Process the Data for Final Output

This step involves running the `stat_var_processor.py` for each input file as specified in `manifest.json`. An example command is shown below:

Expand All @@ -76,7 +91,7 @@ _Note: This command needs to be executed for all 10 input files as defined in `m

---

### Step 3: Validate the Output Files
### Step 4: Validate the Output Files

This command validates the generated files for formatting and semantic consistency before ingestion.

Expand All @@ -86,4 +101,4 @@ This command validates the generated files for formatting and semantic consisten
java -jar /path/to/datacommons-import-tool.jar lint -d 'output/'
```

This step ensures that the generated artifacts are ready for ingestion into Data Commons.
This step ensures that the generated artifacts are ready for ingestion into Data Commons.
1 change: 1 addition & 0 deletions statvar_imports/us_college_ipeds/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
"provenance_description": "Data focuses on Full-Time Equivalent (FTE) enrollment across U.S. postsecondary institutions",
"scripts": [
"run.sh",
"preprocess.py",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/controlOfInstitution_data.csv --pv_map=pvmap/controlOfInstitution_pvmap.csv --config_file=metadata.csv --output_path=output/ControlOfInstitution_output",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/degreeOfUrbanization_data.csv --pv_map=pvmap/degreeOfUrbanization_pvmap.csv --config_file=metadata.csv --output_path=output/DegreeOfUrbanization_output",
"../../tools/statvar_importer/stat_var_processor.py --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf --input_data=input_files/highestDegreeOffered_data.csv --pv_map=pvmap/highestDegreeOffered_pvmap.csv --config_file=metadata.csv --output_path=output/HighestDegreeOffered_output",
Expand Down
35 changes: 35 additions & 0 deletions statvar_imports/us_college_ipeds/preprocess.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the LICENSE text that we usually add for all source files?

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a top level description to this file which you already documented in the README file please?

"""
Cleans the raw CSV files by removing initial descriptive rows.
"""

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any usage of pandas in this file. Can we remove it from the list of imports then?


def clean_csv(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()

start_index = -1
for i, line in enumerate(lines):
if line.strip().startswith('Year'):
start_index = i
break

if start_index != -1:
cleaned_content = lines[start_index:]
with open(file_path, 'w') as f:
f.writelines(cleaned_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally avoid modifying the original source files so that we can trace any issues back easily. Can we write the content to new files and have statvar processor work on them instead?

print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace print statements with logging.info using absl logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed.

else:
print(f"Could not find 'Year' in {file_path}. No changes made.")

def clean_csv_in_directory(directory):
if not os.path.isdir(directory):
print(f"Directory '{directory}' not found.")
return

csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]

for csv_file in csv_files:
file_path = os.path.join(directory, csv_file)
clean_csv(file_path)

if __name__ == '__main__':
input_directory = 'input_files'
clean_csv_in_directory(input_directory)
40 changes: 1 addition & 39 deletions statvar_imports/us_college_ipeds/run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previous comment, please add the license text to the top of this file.

Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,4 @@ SCRIPT_PATH=$(realpath "$(dirname "$0")")

mkdir -p "input_files"

gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files"

python3 <<'END_PYTHON'
import os
import pandas as pd

def clean_csv(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()

start_index = -1
for i, line in enumerate(lines):
if line.strip().startswith('Year'):
start_index = i
break

if start_index != -1:
cleaned_content = lines[start_index:]
with open(file_path, 'w') as f:
f.writelines(cleaned_content)
print(f"Cleaned {file_path} successfully, removed {start_index} initial rows.")
else:
print(f"Could not find 'Year' in {file_path}. No changes made.")

def clean_csv_in_directory(directory):
if not os.path.isdir(directory):
print(f"Directory '{directory}' not found.")
return

csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]

for csv_file in csv_files:
file_path = os.path.join(directory, csv_file)
clean_csv(file_path)

if __name__ == '__main__':
input_directory = 'input_files'
clean_csv_in_directory(input_directory)
END_PYTHON
gsutil -m cp -r gs://unresolved_mcf/IPEDS/Enrollment_FTE_National/input_files/*.csv "$SCRIPT_PATH/input_files"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line to the end of this file.

Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
observationAbout,observationDate,value,variableMeasured,#input
country/USA,2024,10895410,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:2
country/USA,2024,3884457,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:3
country/USA,2024,1344830,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:2:4
country/USA,2023,10568750,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:2
country/USA,2023,3812142,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:3
country/USA,2023,1260771,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:3:4
country/USA,2022,10591338,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:2
country/USA,2022,3818557,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:3
country/USA,2022,1274605,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:4:4
country/USA,2021,10985128,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:2
country/USA,2021,3802117,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:3
country/USA,2021,1301882,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:5:4
country/USA,2020,11366064,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:2
country/USA,2020,3852214,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:3
country/USA,2020,1245792,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:6:4
country/USA,2019,11420024,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:2
country/USA,2019,3842713,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:3
country/USA,2019,1238471,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:7:4
country/USA,2018,11470565,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:2
country/USA,2018,3788721,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:3
country/USA,2018,1229025,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:8:4
country/USA,2017,11429561,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:2
country/USA,2017,3764543,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:3
country/USA,2017,1411905,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:9:4
country/USA,2016,11441625,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:2
country/USA,2016,3741014,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:3
country/USA,2016,1530130,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:10:4
country/USA,2015,11490719,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:2
country/USA,2015,3740928,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:3
country/USA,2015,1769917,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:11:4
country/USA,2014,11573864,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:2
country/USA,2014,3697921,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:3
country/USA,2014,2039179,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:12:4
country/USA,2013,11682275,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:2
country/USA,2013,3676175,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:3
country/USA,2013,2157925,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:13:4
country/USA,2012,11924029,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:2
country/USA,2012,3651247,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:3
country/USA,2012,2544278,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:14:4
country/USA,2011,12059233,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:2
country/USA,2011,3650465,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:3
country/USA,2011,2633725,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:15:4
country/USA,2010,11804731,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PublicEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:2
country/USA,2010,3520524,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedNotForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:3
country/USA,2010,2515909,dcid:Count_Student_EnrolledInCollegeOrGraduateSchool_FullTimeEquivalent_PrivatelyOwnedForProfitEstablishment_PostSecondaryInstitution,input_files/controlOfInstitution_data.csv:16:4
Loading