This is the official repository for the PM100 HPC job workload dataset, containing the scripts for the creation of the final data. The official dataset can be found in .
extract_data.py: The script containing the functions used to extract the final job power consumption from job traces and power logs. It is showed also an example of a pipeline to extract such values from data structured like the one in M100 dataset.inspect_data.py: The script reports the operation performed on the final PM100 data to produce the plots in theplotsfolder. Moreover, it reports a function to load the data and provide examples on how to inspect it.documentation: The folder contains some documentation of the final dataset, like the job features description.plots: The folder contains the plots presented in the paper.
All the packages used in the project are reported in the requirements.txt, the Python version used was the 3.6.8.
It is a good practice to create a virtual environment and then install the required packages with pip3 install -r requirements.txt.
In order to extract the PM100 dataset from M100, first it is needed to download the correct data from Zenodo.
After downloading the data relative to the whole period (YY-MM from 20-05 to 20-10) or just a subset of it, the archives must be extracted and the tables which are needed for the scripts are:
year_month=YY-MM/plugin=job_table/metric=job_info_marconi100/a_0.parquet: Parquet file containing the job data;year_month=YY-MM/plugin=ipmi_pub/metric=ps0_input_power/a_0.parquet: Parquet file containing the first power socket metrics;year_month=YY-MM/plugin=ipmi_pub/metric=ps1_input_power/a_0.parquet: Parquet file containing the second power socket metrics.
The job tables related to the different months can be merged by running the merge_m100_tables.py, expliciting the path to the downloaded data in the job_table_data_path variable.
Before launching the extraction, the variables job_table_path, ps[0, 1]_table_path and final_table_path must be initialized with the path to the downloaded data and the desidered output file for the dataset.
The result of the execution is a parquet file containing the data structured as presented in the documentation/job_features.md file.