AEP50 · dhensle · Jan 2, 2024 · Jan 6, 2024
diff --git a/README.md b/README.md
@@ -3,16 +3,16 @@
 The [Annual Meeting](http://www.trb.org/AnnualMeeting/AnnualMeeting.aspx) of the [Transportation Research Board (TRB)](http://www.trb.org/Main/Home.aspx) is attended by over 10,000 participants. The core feature of the meeting are sessions devoted to the presentation of research. Research papers are submitted to TRB and assigned to TRB committees. The TRB committees are comprised of volunteers from academia and industry. These committees must review research papers and curate worthy entries into TRB sessions during the narrow time window from the paper deadline on August 31st to the posting of the preliminary Annual Meeting agenda in early December. For committees that receive large numbers of papers, this is a difficult task. The purpose of Chandra Bot is to use data and analysis to make the review process more efficient, effective, and fair. The project name is an homage to [Professor Chandra Bhat](http://www.caee.utexas.edu/prof/bhat/home.html) of the University of Texas -- the idea being that if we could only clone Professor Bhat and have him review each paper, the review process would be perfect.
 
 ## Data Model
-In order to organize our thinking and structure our code, we started with a [data model](/chandra_bot/chandra_bot_data_model.proto). It includes:
+In order to organize our thinking and structure our code, we started with a [data model](/chandra_bot/data_model_pydantic.py). It includes:
 * Humans -- humans write and review papers;
 * Papers -- research articles submitted to the Annual Meeting;
 * Reviews -- reviews of submitted papers; and,
 * Numerous other supporting data types and relationships.
 
-The data model is realized as a [Protocol Buffer](https://developers.google.com/protocol-buffers), which provides an abstraction between the model and the underlying software implementation.
+The data model is built ontop of [Pydantic](https://docs.pydantic.dev/latest/) which performs data validation and serialization.
 
 ## Prototype Software
-Prototype software is created to efficiently explore the underlying data. It allows for any number of easy examinations. For example, say Reviewer A only uses a portion of the one to five scale used to rate TRB papers, giving each paper a score of 3, 4, or 5. Reviewer B similarly uses a portion of the scale, giving each paper a score of 1, 2, or 3. When a TRB committee receives scores from Reviewer A and Reviewer B, would it not be more efficient, effective, and fair if the committee could easily normalize these scores to each reviewers internal scoring system? The proposition puct forward here is that a useful data model paired with software is the first step in implementing committees with such tooling. A relatively unique feature of the TRB Annual Meeting is that a relatively small number of reviewers review papers from a relatively small number of authors every year. This allows the opportunity to find patterns and extract information from a time series of data that can be made useful in a relatively short period of time.
+Prototype software is created to efficiently explore the underlying data. It allows for any number of easy examinations. For example, say Reviewer A only uses a portion of the one to five scale used to rate TRB papers, giving each paper a score of 3, 4, or 5. Reviewer B similarly uses a portion of the scale, giving each paper a score of 1, 2, or 3. When a TRB committee receives scores from Reviewer A and Reviewer B, would it not be more efficient, effective, and fair if the committee could easily normalize these scores to each reviewers internal scoring system? The proposition put forward here is that a useful data model paired with software is the first step in implementing committees with such tooling. A relatively unique feature of the TRB Annual Meeting is that a relatively small number of reviewers review papers from a relatively small number of authors every year. This allows the opportunity to find patterns and extract information from a time series of data that can be made useful in a relatively short period of time.
 
 ## Fake Data
 The reviews of academic papers contain sensitive information. To facilitate testing and exploration of the Project's tools, we have created a time series of fake data (see the `/examples` directory) based on open databases of names, affiliations, and sentences. Any resemblence to real people or reviews is unintentional.
@@ -22,8 +22,9 @@ One challenge in assembling the data that powers potential analysis is that ther
 
 ## Contributing Authors
 The Project is being led by TRB's Standing Committee on Travel Demand Forecasting. [David Ory]([email protected]) is the current paper review chair of this committee and is responsible for the Project. Other team members contributing to the project are:
-* [Sijia Wang](https://github.com/i-am-sijia); and,
-* [Gayathri Shivaraman](https://github.com/gshivaraman).
+* [Sijia Wang](https://github.com/i-am-sijia); 
+* [Gayathri Shivaraman](https://github.com/gshivaraman); and,
+* [David Hensle](https://github.com/dhensle);
 
 ## License
 [Apache 2.0](LICENSE.txt)
diff --git a/chandra_bot/chandra_bot.py b/chandra_bot/chandra_bot.py
@@ -7,8 +7,10 @@
 
 import numpy as np
 import pandas as pd
+import json
 
-from . import data_model_pb2 as dm
+# from . import data_model_pb2 as dm
+from . import data_model_pydantic as dm
 
 
 class ChandraBot(object):
@@ -102,7 +104,7 @@ def __init__(
             self.review_df: pd.DataFrame = review_df
             self.human_df: pd.DataFrame = human_df
 
-            self.paper_book = dm.PaperBook()
+            self.paper_book = dm.PaperBook(paper=[])
         else:
             self.paper_book: dm.PaperBook = input_paper_book
 
@@ -112,54 +114,61 @@ def _attribute_paper(self, paper: dm.Paper, row: list) -> None:
         paper.year = int(row["year"])
 
         if row["committee_presentation_decision"].lower() == "reject":
-            paper.committee_presentation_decision = dm.PRESENTATION_REC_REJECT
+            paper.committee_presentation_decision = dm.PresentationRecEnum.PRESENTATION_REC_REJECT
         elif row["committee_presentation_decision"].lower() == "accept":
-            paper.committee_presentation_decision = dm.PRESENTATION_REC_ACCEPT
+            paper.committee_presentation_decision = dm.PresentationRecEnum.PRESENTATION_REC_ACCEPT
         else:
-            paper.committee_presentation_decision = dm.PRESENTATION_REC_NONE
+            paper.committee_presentation_decision = dm.PresentationRecEnum.PRESENTATION_REC_NONE
 
         if row["committee_publication_decision"].lower() == "reject":
-            paper.committee_publication_decision = dm.PUBLICATION_REC_REJECT
+            paper.committee_publication_decision = dm.PublicationRecEnum.PUBLICATION_REC_REJECT
         elif row["committee_publication_decision"].lower() == "accept":
-            paper.committee_publication_decision = dm.PUBLICATION_REC_ACCEPT
+            paper.committee_publication_decision = dm.PublicationRecEnum.PUBLICATION_REC_ACCEPT
         elif row["committee_publication_decision"].lower() == "accept_correct":
-            paper.committee_publication_decision = dm.PUBLICATION_REC_ACCEPT_CORRECT
+            paper.committee_publication_decision = dm.PublicationRecEnum.PUBLICATION_REC_ACCEPT_CORRECT
         else:
-            paper.committee_publication_decision = dm.PUBLICATION_REC_NONE
+            paper.committee_publication_decision = dm.PublicationRecEnum.PUBLICATION_REC_NONE
 
+        paper.abstract = dm.Content.model_construct()
         if "abstract" in row:
             paper.abstract.text = row["abstract"]
         else:
             paper.abstract.text = "Missing"
 
+        paper.body = dm.Content.model_construct()
         if "body" in row:
             paper.body.text = str(row["body"])
         else:
             paper.body.text = "Missing"
 
     def _attribute_author(self, author: dm.Author, row: list):
+        author.human = dm.Human.model_construct()
         author.human.name = row["name"].values[0]
 
         if not pd.isnull(row["aliases"].values[0]):
             for alias in row["aliases"].values[0].split(","):
                 author.human.aliases.append(alias)
 
         author.human.hash_id = row["hash_id"].values[0]
+        author.human.current_affiliation = dm.Affiliation.model_construct()
         if not pd.isnull(row["current_affiliation"].values[0]):
             author.human.current_affiliation.name = row["current_affiliation"].values[0]
         else:
             author.human.current_affiliation.name = ""
 
+        author.human.last_degree_affiliation = dm.Affiliation.model_construct()
         author.human.last_degree_affiliation.name = str(
             row["last_degree_affiliation"].values[0]
         )
 
+        author.human.previous_affiliation = []
         if not pd.isnull(row["previous_affiliation"].values[0]):
             affil_list = row["previous_affiliation"].values[0].split(",")
             if len(affil_list) > 0:
                 for affil in affil_list:
-                    affiliation = author.human.previous_affiliation.add()
+                    affiliation = dm.Affiliation.model_construct()
                     affiliation.name = affil
+                    author.human.previous_affiliation.append(affiliation)
 
         if not pd.isnull(row["orcid_url"].values[0]):
             author.human.orcid_url = str(row["orcid_url"].values[0])
@@ -174,35 +183,39 @@ def _attribute_author(self, author: dm.Author, row: list):
     def _attribute_review(self, review: dm.Review, row: list):
         review.presentation_score = row["presentation_score"]
 
+        review.commentary_to_author = dm.Content.model_construct()
         if not pd.isnull(row["commentary_to_author"]):
             review.commentary_to_author.text = row["commentary_to_author"]
         else:
             review.commentary_to_author.text = ""
 
+        review.commentary_to_chair = dm.Content.model_construct()
         if not pd.isnull(row["commentary_to_chair"]):
             review.commentary_to_chair.text = row["commentary_to_chair"]
         else:
             review.commentary_to_chair.text = ""
 
         if row["presentation_recommendation"].lower() == "reject":
-            review.presentation_recommend = dm.PRESENTATION_REC_REJECT
+            review.presentation_recommend = dm.PresentationRecEnum.PRESENTATION_REC_REJECT
         elif row["presentation_recommendation"].lower() == "accept":
-            review.presentation_recommend = dm.PRESENTATION_REC_ACCEPT
+            review.presentation_recommend = dm.PresentationRecEnum.PRESENTATION_REC_ACCEPT
         else:
-            review.presentation_recommend = dm.PRESENTATION_REC_NONE
+            review.presentation_recommend = dm.PresentationRecEnum.PRESENTATION_REC_NONE
 
         if row["publication_recommendation"].lower() == "reject":
-            review.publication_recommend = dm.PUBLICATION_REC_REJECT
+            review.publication_recommend = dm.PublicationRecEnum.PUBLICATION_REC_REJECT
         elif row["publication_recommendation"].lower() == "accept":
-            review.publication_recommend = dm.PUBLICATION_REC_ACCEPT
+            review.publication_recommend = dm.PublicationRecEnum.PUBLICATION_REC_ACCEPT
         else:
-            review.publication_recommend = dm.PRESENTATION_REC_NONE
+            review.publication_recommend = dm.PresentationRecEnum.PRESENTATION_REC_NONE
 
     def _attribute_reviewer(self, review: dm.Review, row: list):
+        review.reviewer = dm.Reviewer.model_construct()
 
         if row.empty:
             return
 
+        review.reviewer.human = dm.Human.model_construct()
         if not pd.isnull(row["name"].values[0]):
             review.reviewer.human.name = row["name"].values[0]
         else:
@@ -218,24 +231,28 @@ def _attribute_reviewer(self, review: dm.Review, row: list):
         else:
             review.reviewer.human.hash_id = ""
 
+        review.reviewer.human.current_affiliation = dm.Affiliation.model_construct()
         if not pd.isnull(row["current_affiliation"].values[0]):
             review.reviewer.human.current_affiliation.name = row[
                 "current_affiliation"
             ].values[0]
         else:
             review.reviewer.human.current_affiliation.name = ""
 
+        review.reviewer.human.last_degree_affiliation = dm.Affiliation.model_construct()
         if not pd.isnull(row["last_degree_affiliation"].values[0]):
             review.reviewer.human.last_degree_affiliation.name = str(
                 row["last_degree_affiliation"].values[0]
             )
         else:
             review.reviewer.human.last_degree_affiliation.name = ""
 
+        review.reviewer.human.last_degree_affiliation = []
         if not pd.isnull(row["previous_affiliation"].values[0]):
             for affil_name in row["previous_affiliation"].values[0].split(","):
-                affiliation = review.reviewer.human.previous_affiliation.add()
+                affiliation = dm.Affiliation.model_construct()
                 affiliation.name = affil_name
+                review.reviewer.human.previous_affiliation.append(affiliation)
 
         if not pd.isnull(row["orcid_url"].values[0]):
             review.reviewer.human.orcid_url = str(row["orcid_url"].values[0])
@@ -265,30 +282,39 @@ def assemble_paper_book(self):
            None
         """
         for paper_id in self.paper_df.index:
-            paper = self.paper_book.paper.add()
+            paper = dm.Paper.model_construct()
             paper.number = paper_id
             paper_row = self.paper_df.loc[paper_id]
             self._attribute_paper(paper, paper_row)
 
+            paper.authors = []
             if "author_ids" in self.paper_df.columns:
                 if not pd.isnull(paper_row.author_ids):
                     for author_id in paper_row.author_ids.split(","):
                         if self.human_df["author_id"].eq(author_id).any():
                             human_row = self.human_df.loc[
                                 self.human_df["author_id"] == author_id
                             ]
-                            self._attribute_author(paper.authors.add(), human_row)
+                            author = dm.Author.model_construct()
+                            self._attribute_author(author, human_row)
+                            paper.authors.append(author)
 
             paper_review_df = self.review_df.loc[self.review_df["paper_id"] == paper_id]
             paper_review_df.set_index("reviewer_human_hash_id")
 
+            paper.reviews=[]
             for hash_id in paper_review_df.index:
                 review_row = paper_review_df.loc[hash_id]
                 reviewer_hash = review_row["reviewer_human_hash_id"]
                 human_row = self.human_df.loc[self.human_df["hash_id"] == reviewer_hash]
-                review = paper.reviews.add()
+                review = dm.Review.model_construct()
                 self._attribute_review(review, review_row)
                 self._attribute_reviewer(review, human_row)
+                paper.reviews.append(review)
+
+            # validate and add paper to paper book
+            dm.Paper.model_validate(paper)
+            self.paper_book.paper.append(paper)
 
     @staticmethod
     def create_bot(paper_file: str, review_file: str, human_file: str):
@@ -322,10 +348,11 @@ def read_paper_book(input_file: str):
         """
         read_paper_book
         """
-        paper_book = dm.PaperBook()
         try:
+            # data = json.load(input_file)
+            # paper_book = pd.PaperBook.model_validate_json(data, strict=False)
             with open(input_file, "rb") as file_pointer:
-                paper_book.ParseFromString(file_pointer.read())
+                paper_book = dm.PaperBook.model_validate_json(file_pointer.read(), strict=False)
         except IOError:
             print(input_file + ": File not found.")
 
@@ -341,7 +368,7 @@ def write_paper_book(self, output_file: str):
         write_paper_book
         """
         with open(output_file, "wb") as file_pointer:
-            file_pointer.write(self.paper_book.SerializeToString())
+            file_pointer.write(self.paper_book.model_dump_json().encode())
 
     def _compute_normalized_scores(self, min_number_reviews: int):
         scores_df = pd.DataFrame()

diff --git a/chandra_bot/data_model.proto b/chandra_bot/data_model.proto