-
Notifications
You must be signed in to change notification settings - Fork 5
Allow supplemental API data to be passed to ColdfrontFetchProcessor #282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,11 +22,23 @@ | |
| ] | ||
|
|
||
|
|
||
| SUPPLEMENTAL_START_DATE = "Start Date" | ||
| SUPPLEMENTAL_END_DATE = "End Date" | ||
|
|
||
|
|
||
| @functools.lru_cache | ||
| def get_rates_info(): | ||
| return load_from_url() | ||
|
|
||
|
|
||
| def _is_in_time_range(start, end) -> bool: | ||
| # Leveraging inherent lexicographical order of YYYY-MM strings | ||
| return ( | ||
| start <= invoice_settings.invoice_month | ||
| and invoice_settings.invoice_month <= end | ||
| ) | ||
|
|
||
|
|
||
| class Loader: | ||
| @functools.lru_cache | ||
| def get_csv_invoice_filepath_list(self) -> list[str]: | ||
|
|
@@ -126,13 +138,6 @@ def get_nonbillable_projects(self) -> pandas.DataFrame: | |
| 3. Is Timed: Boolean indicating if the nonbillable status is time-bound | ||
| """ | ||
|
|
||
| def _is_in_time_range(timed_object) -> bool: | ||
| # Leveraging inherent lexicographical order of YYYY-MM strings | ||
| return ( | ||
| timed_object["start"] <= invoice_settings.invoice_month | ||
| and invoice_settings.invoice_month <= timed_object["end"] | ||
| ) | ||
|
|
||
| project_list = [] | ||
| with open(invoice_settings.nonbillable_projects_filepath) as file: | ||
| projects_dict = yaml.safe_load(file) | ||
|
|
@@ -142,7 +147,7 @@ def _is_in_time_range(timed_object) -> bool: | |
| cluster_list = project.get("clusters") | ||
|
|
||
| if project.get("start"): | ||
| if not _is_in_time_range(project): | ||
| if not _is_in_time_range(project["start"], project["end"]): | ||
| continue | ||
|
|
||
| if cluster_list: | ||
|
|
@@ -154,7 +159,7 @@ def _is_in_time_range(timed_object) -> bool: | |
| for cluster in cluster_list: | ||
| cluster_start_time = cluster.get("start") | ||
| if cluster_start_time: | ||
| if _is_in_time_range(cluster): | ||
| if _is_in_time_range(cluster["start"], cluster["end"]): | ||
| project_list.append((project_name, cluster["name"], True)) | ||
| elif not cluster_start_time: | ||
| project_list.append((project_name, cluster["name"], False)) | ||
|
|
@@ -179,5 +184,20 @@ def get_nonbillable_timed_projects(self) -> list[tuple[str, str]]: | |
| ].itertuples(index=False, name=None) | ||
| ) | ||
|
|
||
| def get_supplement_api_data(self) -> pandas.DataFrame: | ||
| supplemental_df = pandas.DataFrame() | ||
| if invoice_settings.supplement_api_data_filepath: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should log when a config file is load and when it's not loaded. |
||
| supplemental_df = pandas.read_csv( | ||
| invoice_settings.supplement_api_data_filepath | ||
| ) | ||
| in_time_range_mask = supplemental_df.apply( | ||
| lambda row: _is_in_time_range( | ||
| row[SUPPLEMENTAL_START_DATE], row[SUPPLEMENTAL_END_DATE] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the supplemental data as of now[1] has no start and end date, so this would just error out. What is the behaviour supposed to be when no start or end date is found? Erroring out? logging? or assume the data is applicable (cc: @joachimweyl) [1] https://github.com/CCI-MOC/invoicing-private-data/blob/main/project_api_data.csv
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Log, and we should add dates. We will need to work with RH to obtain dates. The template is supposed to be gathering this but all of these projects are pre the template |
||
| ), | ||
| axis=1, | ||
| ) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: not that what you are doing is incorrect, but you could do the following which is more pandas like: in_time_range_mask = (
(supplemental_df[SUPPLEMENTAL_START_DATE] <= invoice_settings.invoice_month) &
(invoice_settings.invoice_month <= supplemental_df[SUPPLEMENTAL_END_DATE])
)This way we don't have to operate row by row. Not that we are running into performance issues but this is more of a vector operation. |
||
| supplemental_df = supplemental_df[in_time_range_mask] | ||
| return supplemental_df | ||
|
|
||
|
|
||
| loader = Loader() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,13 +26,21 @@ | |
| CF_ATTR_INSTITUTION_SPECIFIC_CODE = "Institution-Specific Code" | ||
| CF_ATTR_IS_COURSE = "Is Course?" | ||
|
|
||
| SUPPLEMENTAL_PROJECT_ID = "Project - Allocation Name" | ||
| SUPPLEMENTAL_PROJECT_NAME = "Project - Title" | ||
| SUPPLEMENTAL_PI = "Manager (PI)" | ||
| SUPPLEMENTAL_CLUSTER_NAME = "Cluster Name" | ||
|
|
||
|
|
||
| @dataclass | ||
| class ColdfrontFetchProcessor(processor.Processor): | ||
| nonbillable_projects: pandas.DataFrame = field( | ||
| default_factory=loader.get_nonbillable_projects | ||
| ) | ||
| coldfront_data_filepath: str = invoice_settings.coldfront_api_filepath | ||
| supplement_api_data: pandas.DataFrame = field( | ||
| default_factory=lambda: loader.get_supplement_api_data() | ||
| ) | ||
|
|
||
| @functools.cached_property | ||
| def coldfront_client(self): | ||
|
|
@@ -125,6 +133,20 @@ def _get_allocation_data(self, coldfront_api_data): | |
| except KeyError: | ||
| continue | ||
|
|
||
| for _, row in self.supplement_api_data.iterrows(): | ||
| project_id = row[SUPPLEMENTAL_PROJECT_ID] | ||
| project_name = row[SUPPLEMENTAL_PROJECT_NAME] | ||
| pi_name = row[SUPPLEMENTAL_PI] | ||
| cluster_name = row[SUPPLEMENTAL_CLUSTER_NAME] | ||
|
|
||
| allocation_data[(project_id, cluster_name)] = { | ||
| invoice.PROJECT_FIELD: project_name, | ||
| invoice.PI_FIELD: pi_name, | ||
| invoice.INSTITUTION_ID_FIELD: "N/A", | ||
| invoice.CLUSTER_NAME_FIELD: cluster_name, | ||
| invoice.IS_COURSE_FIELD: False, # (TODO) Quan Assuming supplemental data does not contain course info? | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by "Assuming supplemental data does not contain course info?"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @joachimweyl The supplemental data's purpose is to provide data that the Coldfront API would normally have (an allocation's name, PI, whether it belongs to a course). The current supplemental data file does not currently have a column to indicate if a project belongs in a course. I wanted to ask if we want to assume projects listed in this file can be assumed to never be in courses. @joachimweyl Adding the extra column to indicate course-membership is simple.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please check the supplemental data file against the original; it looks like it is out of date. |
||
| } | ||
|
|
||
| return allocation_data | ||
|
|
||
| def _validate_allocation_data(self, allocation_data): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring here and in general going forward?
The reason I ask this is because I was wondering why this supplemental data like pi name and institution name are time bound, but after digging around turns it's a design decision. So, the docstring could convey that useful information (even linking to the issue or comment where we made this non obvious choice).