enh: update editors script enhancement + docs (#137)

lwasser · web-flow · commit 2ba9ce07c06e · 2025-07-30T17:39:20.000-06:00
* enh: update editors script enhancement

* rename script

* Update scripts/get-editors.py

* fix: remove script but update data

* enh: describe scripts used in the repo
diff --git a/.github/workflows/update-pr-data.yml b/.github/workflows/update-pr-data.yml
@@ -49,6 +49,9 @@ jobs:
         run: python scripts/get-review-contributors.py
       - name: get-package-data
         run: python scripts/get-package-data.py
+      # Commenting this out because right now it's not running in CI as it should
+      # - name: get-editors
+      #   run: python scripts/get-editors.py
       - name: Cache metrics
         uses: actions/upload-artifact@v4
         with:
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,32 @@
+# CONTRIBUTING to our metrics repository
+
+Welcome to the metrics repository. We welcome contributions of all kinds,
+large, small and in between.
+
+To get started contributing be sure to fork and clone this repository.
+
+## About the data in this repository
+
+The data in the `_data/` directory of this repo, contain contributor data
+for the pyOpenSci organization. This data includes:
+
+* Contributor pull request and issue data
+* Contributor data collected and parsed using the all-contributors bot
+* Peer review data collected from our software-submission repository
+* Editorial team data collected from our github editorial team (some data such as domain data are added manually)
+
+## About the scripts in this repository
+
+The `scripts/` directory contains utility scripts for data collection, parsing, and analysis:
+
+* **get-editors.py**: Updates the editorial team CSV file with current editors by merging manually curated domain data and GitHub team membership (via GraphQL API). Output: `_data/editorial_team_domains.csv`.
+* **get-package-data.py**: Retrieves package data from GitHub repositories using the GitHub API. Returns a dictionary of package information.
+* **get-prs.py**: Parses all active pyOpenSci repositories to collect contributor activity (issues and PRs) for the current year, excluding bots. Outputs a CSV for tracking contribution growth.
+* **get-review-contributors.py**: Extracts and stores review contributor data (editors/reviewers) from peer review YAML files, including location if available. Outputs: `review_contribs.csv`.
+* **get-reviews.py**: Parses all pyOpenSci reviews (presubmissions, closed submissions, etc.) to compile activity stats over time. Uses pyosMeta utilities for processing.
+* **get-sprint-data.py**: Collects and processes sprint-related issues and pull requests data from out GitHub sprint project board using GraphQL and REST APIs, with support for environment variables and progress tracking.
+
+## How the scripts are used
+
+The scripts above are run via a CI cron job with the exception of the get-editors.py script which right now
+doesn't run successfully in CI. Luckily our editorial team rotates slowly so this item is ok to have to manually run locally and update for the time being.
diff --git a/_data/editorial_team_domains.csv b/_data/editorial_team_domains.csv
@@ -1,17 +1,20 @@
 gh_username,active,first_name,last_name,country,state,OS,Domain_areas,Description,technical_areas
 Batalex,yes,Alexandre,Batisse,France,,"Windows, Mac, Linux","Statistics, ML, AI, Computer sciences, Bioinformatics","I work as a Data Scientist on health care data. I conduct epidemiology studies and maintain private packages (analytics, dataviz).","Data visualization, Data extraction & retrieval, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker"
+HaoZeke,,,,,,,,,
 JuliMillan,yes,Julieta,Millan,Argentina,Buenos Aires,Mac,"Statistics, ML, AI, Ecology / Biology","Biology, neuroscience, industry data science","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Object oriented programming, Tool usability / accessibility"
 NimaSarajpoor,yes,Nima,Sarajpoor,Canada,Ontario,"Mac, Linux","Mathematics, Statistics, ML, AI, Computer sciences, Energy",[PhD research] Analyzing renewable energy data to discover patterns using shape-aware unsupervised learning algorithms [Current work] Using SQL and Python to develop tools for our Fraud Detection applications for different business lines (insurance and finance),"Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
+SimonMolinsky,,,,,,,,,
 ab93,yes,Avik,Basu,USA,California,"Mac, Linux","NLP, text analysis, Linguistics, Mathematics, Statistics, ML, AI, Computer sciences, Education","Deep Learning, time series, industry data science, deep unsupervised learning, ML in finance ","Data visualization, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker, Tool usability / accessibility, Python best practices"
 banesullivan,yes,Bane,Sullivan,United States,California,Mac,"Spatial data, spatial analysis, GIS, Geosciences / earth science, 3D visualization","Remote sensing of the environment and subsurface, developer advocacy, data science, 3D visualization","Data visualization, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker, Tool usability / accessibility"
 cmarmo,yes,Chiara,Marmo,USA,Hawaii,Linux,"Spatial data, spatial analysis, GIS, Space sciences, Geosciences / earth science, Astronomy","Data processing in Astronomy, Planetary Sciences, Geospatial data. Standard development, interoperability.","Data extraction & retrieval, Data munging, Data deposition, Documentation quality, Continuous Integration"
 coatless,yes,James,Balamuta,United States,California,"Mac, Linux","NLP, text analysis, Spatial data, spatial analysis, GIS, Mathematics, Statistics, ML, AI, Computer sciences, Bioinformatics, Education","Latent variable modeling, restricted latent class models, deep learning, computational statistics, psychometrics, item response theory, biostatistics, genomics","Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Web scraping, Security, Docker, Tool usability / accessibility"
+crhea93,,,,,,,,,
 ctb,,,,,,,,,
 dhomeier,yes,Derek,Homeier,Germany,Lower Saxony,"Mac, Linux","Physics, Atmospheric sciences, Space sciences, Astrophysics","scientific data analysis, spectroscopy, atmospheric simulations","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Unit Testing"
 eliotwrobson,yes,Eliot,Robson,United States,Illinois,"Windows, Linux","Mathematics, Computer sciences, Education","Algorithms, specifically involving randomness, geometry, and graph theory.","Data visualization, Data extraction & retrieval, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
 hamogu,yes,Hans Moritz,Günther,USA,MA,"Mac, Linux","Physics, Astronomy","Astronomy wit ha focus on star formation and high-energy observations, also instrument development","Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
-isabelizimm,yes,Isabel,Zimmerman,USA,Florida,Mac,"Statistics, ML, AI, Computer sciences","building IDEs and MLOps Python frameworks","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Continuous Integration"
-
+isabelizimm,yes,Isabel,Zimmerman,USA,Florida,Mac,"Statistics, ML, AI, Computer sciences",building IDEs and MLOps Python frameworks,"Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Continuous Integration"
+jonas-eschle,,,,,,,,,
 mjhajharia,,,,,,,,,
 shirubana,yes,Silvana,Ovaitt,United States,Colorado,"Windows, Linux","Energy, Solar Energy, Spectra, Light","Photovoltaic optical and electrical modeling, and other renewable energy models","Data munging, Python package structure, Documentation quality, Tool usability / accessibility"
 slobentanzer,,,,,,,,,
diff --git a/scripts/get-editors.py b/scripts/get-editors.py
@@ -1,16 +1,24 @@
-"""This script parses a largely manually created list of editors 
-grabbed from our onboarding form. It then merges this data with the editorial 
-board GitHub team grabbed using graphQL. 
+"""This script updates our editorial team csv file with the most current editors.
 
-This provides a table with github username and technical and scinece domain areas 
-that we can use. 
+1. It parses a partially  manually created list of editors found int he csv 
+file: `_data/editorial_team_domains`. This csv was initially created by 
+manually adding editor names to the file with domain areas from our google sheet. 
+The (private) google sheet collects what domains they can support when they 
+apply to be an editor
+2. It then hits the github api to return the list of gh usernames from the editorial team on GitHub
+When we onboard a new editor, we add them to that team so they have proper permissions in repos in our org.
+The GitHub team data are grabbed using graphQL.
 
-I then pulled domains and gh_usernames to allow us to create a table with 
-editors and associated domains
+3. Finally, this script merges the data parsed from the team with the csv file.  
+
+The output is a csv file called _data/editorial_team_domains.csv that can be 
+used to parse editor data. 
 
 TODO:
 * it would be good to find a more automated way to get the domain data from our 
-google sheet. one way to do this would be to use the airtable api if we move to airtable.
+google sheet. one way to do this would be to create a new spreadsheet that 
+pulls from our editor signup but only contains gh username and then the domain areas. 
+
 """
 
 import os
@@ -28,6 +36,10 @@
 
 
 def get_team_members():
+    """A function that hits the GH graphQL api and pulled down members 
+    from our editorial team. This list should be the most current list of 
+    pyOpenSci editors. """
+
     query = """
     {
       organization(login: "pyOpenSci") {
@@ -66,16 +78,20 @@ def filter_members(members, exclude):
 
 
 if __name__ == "__main__":
+    
+    # Pull down the list of gh usernames from our editorial team
     members = get_team_members()
     exclude = [
         "lwasser",
         "chayadecacao",
         "xuanxu",
     ]
+    # Exclude members who are administrative but don't actually lead reviews
     editorial_team_gh = filter_members(members, exclude)
 
+    # Open the csv file that contains domain info for editors and a list of gh usernames
     data_dir = Path("_data")
-    editor_domains = pd.read_csv(data_dir / "editor-domains.csv")
+    editor_domains = pd.read_csv(data_dir / "editorial_team_domains.csv")
     editor_domains["gh_username"] = editor_domains["gh_username"].str.replace(
         "@", ""
     )
@@ -84,7 +100,8 @@ def filter_members(members, exclude):
         editorial_team_df = pd.DataFrame(
             editorial_team_gh, columns=["gh_username"]
         )
-
+# Merge the graphQL data with the github team data
+# This will result in empty domain data but an accurate list of current editors.
 all_editors = pd.merge(
     editorial_team_df, editor_domains, on="gh_username", how="outer"
 )