Skip to content

Commit 2ba9ce0

Browse files
authored
enh: update editors script enhancement + docs (#137)
* enh: update editors script enhancement * rename script * Update scripts/get-editors.py * fix: remove script but update data * enh: describe scripts used in the repo
1 parent 89be3c8 commit 2ba9ce0

File tree

4 files changed

+67
-12
lines changed

4 files changed

+67
-12
lines changed

.github/workflows/update-pr-data.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ jobs:
4949
run: python scripts/get-review-contributors.py
5050
- name: get-package-data
5151
run: python scripts/get-package-data.py
52+
# Commenting this out because right now it's not running in CI as it should
53+
# - name: get-editors
54+
# run: python scripts/get-editors.py
5255
- name: Cache metrics
5356
uses: actions/upload-artifact@v4
5457
with:

CONTRIBUTING.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# CONTRIBUTING to our metrics repository
2+
3+
Welcome to the metrics repository. We welcome contributions of all kinds,
4+
large, small and in between.
5+
6+
To get started contributing be sure to fork and clone this repository.
7+
8+
## About the data in this repository
9+
10+
The data in the `_data/` directory of this repo, contain contributor data
11+
for the pyOpenSci organization. This data includes:
12+
13+
* Contributor pull request and issue data
14+
* Contributor data collected and parsed using the all-contributors bot
15+
* Peer review data collected from our software-submission repository
16+
* Editorial team data collected from our github editorial team (some data such as domain data are added manually)
17+
18+
## About the scripts in this repository
19+
20+
The `scripts/` directory contains utility scripts for data collection, parsing, and analysis:
21+
22+
* **get-editors.py**: Updates the editorial team CSV file with current editors by merging manually curated domain data and GitHub team membership (via GraphQL API). Output: `_data/editorial_team_domains.csv`.
23+
* **get-package-data.py**: Retrieves package data from GitHub repositories using the GitHub API. Returns a dictionary of package information.
24+
* **get-prs.py**: Parses all active pyOpenSci repositories to collect contributor activity (issues and PRs) for the current year, excluding bots. Outputs a CSV for tracking contribution growth.
25+
* **get-review-contributors.py**: Extracts and stores review contributor data (editors/reviewers) from peer review YAML files, including location if available. Outputs: `review_contribs.csv`.
26+
* **get-reviews.py**: Parses all pyOpenSci reviews (presubmissions, closed submissions, etc.) to compile activity stats over time. Uses pyosMeta utilities for processing.
27+
* **get-sprint-data.py**: Collects and processes sprint-related issues and pull requests data from out GitHub sprint project board using GraphQL and REST APIs, with support for environment variables and progress tracking.
28+
29+
## How the scripts are used
30+
31+
The scripts above are run via a CI cron job with the exception of the get-editors.py script which right now
32+
doesn't run successfully in CI. Luckily our editorial team rotates slowly so this item is ok to have to manually run locally and update for the time being.

_data/editorial_team_domains.csv

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,20 @@
11
gh_username,active,first_name,last_name,country,state,OS,Domain_areas,Description,technical_areas
22
Batalex,yes,Alexandre,Batisse,France,,"Windows, Mac, Linux","Statistics, ML, AI, Computer sciences, Bioinformatics","I work as a Data Scientist on health care data. I conduct epidemiology studies and maintain private packages (analytics, dataviz).","Data visualization, Data extraction & retrieval, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker"
3+
HaoZeke,,,,,,,,,
34
JuliMillan,yes,Julieta,Millan,Argentina,Buenos Aires,Mac,"Statistics, ML, AI, Ecology / Biology","Biology, neuroscience, industry data science","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Object oriented programming, Tool usability / accessibility"
45
NimaSarajpoor,yes,Nima,Sarajpoor,Canada,Ontario,"Mac, Linux","Mathematics, Statistics, ML, AI, Computer sciences, Energy",[PhD research] Analyzing renewable energy data to discover patterns using shape-aware unsupervised learning algorithms [Current work] Using SQL and Python to develop tools for our Fraud Detection applications for different business lines (insurance and finance),"Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
6+
SimonMolinsky,,,,,,,,,
57
ab93,yes,Avik,Basu,USA,California,"Mac, Linux","NLP, text analysis, Linguistics, Mathematics, Statistics, ML, AI, Computer sciences, Education","Deep Learning, time series, industry data science, deep unsupervised learning, ML in finance ","Data visualization, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker, Tool usability / accessibility, Python best practices"
68
banesullivan,yes,Bane,Sullivan,United States,California,Mac,"Spatial data, spatial analysis, GIS, Geosciences / earth science, 3D visualization","Remote sensing of the environment and subsurface, developer advocacy, data science, 3D visualization","Data visualization, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Docker, Tool usability / accessibility"
79
cmarmo,yes,Chiara,Marmo,USA,Hawaii,Linux,"Spatial data, spatial analysis, GIS, Space sciences, Geosciences / earth science, Astronomy","Data processing in Astronomy, Planetary Sciences, Geospatial data. Standard development, interoperability.","Data extraction & retrieval, Data munging, Data deposition, Documentation quality, Continuous Integration"
810
coatless,yes,James,Balamuta,United States,California,"Mac, Linux","NLP, text analysis, Spatial data, spatial analysis, GIS, Mathematics, Statistics, ML, AI, Computer sciences, Bioinformatics, Education","Latent variable modeling, restricted latent class models, deep learning, computational statistics, psychometrics, item response theory, biostatistics, genomics","Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming, Web API's, Web scraping, Security, Docker, Tool usability / accessibility"
11+
crhea93,,,,,,,,,
912
ctb,,,,,,,,,
1013
dhomeier,yes,Derek,Homeier,Germany,Lower Saxony,"Mac, Linux","Physics, Atmospheric sciences, Space sciences, Astrophysics","scientific data analysis, spectroscopy, atmospheric simulations","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Unit Testing"
1114
eliotwrobson,yes,Eliot,Robson,United States,Illinois,"Windows, Linux","Mathematics, Computer sciences, Education","Algorithms, specifically involving randomness, geometry, and graph theory.","Data visualization, Data extraction & retrieval, Data munging, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
1215
hamogu,yes,Hans Moritz,Günther,USA,MA,"Mac, Linux","Physics, Astronomy","Astronomy wit ha focus on star formation and high-energy observations, also instrument development","Data visualization, Data extraction & retrieval, Data munging, Data deposition, Python package structure, Documentation quality, Unit Testing, Continuous Integration, Object oriented programming"
13-
isabelizimm,yes,Isabel,Zimmerman,USA,Florida,Mac,"Statistics, ML, AI, Computer sciences","building IDEs and MLOps Python frameworks","Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Continuous Integration"
14-
16+
isabelizimm,yes,Isabel,Zimmerman,USA,Florida,Mac,"Statistics, ML, AI, Computer sciences",building IDEs and MLOps Python frameworks,"Data visualization, Data extraction & retrieval, Python package structure, Documentation quality, Continuous Integration"
17+
jonas-eschle,,,,,,,,,
1518
mjhajharia,,,,,,,,,
1619
shirubana,yes,Silvana,Ovaitt,United States,Colorado,"Windows, Linux","Energy, Solar Energy, Spectra, Light","Photovoltaic optical and electrical modeling, and other renewable energy models","Data munging, Python package structure, Documentation quality, Tool usability / accessibility"
1720
slobentanzer,,,,,,,,,

scripts/get_editors.py renamed to scripts/get-editors.py

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,24 @@
1-
"""This script parses a largely manually created list of editors
2-
grabbed from our onboarding form. It then merges this data with the editorial
3-
board GitHub team grabbed using graphQL.
1+
"""This script updates our editorial team csv file with the most current editors.
42
5-
This provides a table with github username and technical and scinece domain areas
6-
that we can use.
3+
1. It parses a partially manually created list of editors found int he csv
4+
file: `_data/editorial_team_domains`. This csv was initially created by
5+
manually adding editor names to the file with domain areas from our google sheet.
6+
The (private) google sheet collects what domains they can support when they
7+
apply to be an editor
8+
2. It then hits the github api to return the list of gh usernames from the editorial team on GitHub
9+
When we onboard a new editor, we add them to that team so they have proper permissions in repos in our org.
10+
The GitHub team data are grabbed using graphQL.
711
8-
I then pulled domains and gh_usernames to allow us to create a table with
9-
editors and associated domains
12+
3. Finally, this script merges the data parsed from the team with the csv file.
13+
14+
The output is a csv file called _data/editorial_team_domains.csv that can be
15+
used to parse editor data.
1016
1117
TODO:
1218
* it would be good to find a more automated way to get the domain data from our
13-
google sheet. one way to do this would be to use the airtable api if we move to airtable.
19+
google sheet. one way to do this would be to create a new spreadsheet that
20+
pulls from our editor signup but only contains gh username and then the domain areas.
21+
1422
"""
1523

1624
import os
@@ -28,6 +36,10 @@
2836

2937

3038
def get_team_members():
39+
"""A function that hits the GH graphQL api and pulled down members
40+
from our editorial team. This list should be the most current list of
41+
pyOpenSci editors. """
42+
3143
query = """
3244
{
3345
organization(login: "pyOpenSci") {
@@ -66,16 +78,20 @@ def filter_members(members, exclude):
6678

6779

6880
if __name__ == "__main__":
81+
82+
# Pull down the list of gh usernames from our editorial team
6983
members = get_team_members()
7084
exclude = [
7185
"lwasser",
7286
"chayadecacao",
7387
"xuanxu",
7488
]
89+
# Exclude members who are administrative but don't actually lead reviews
7590
editorial_team_gh = filter_members(members, exclude)
7691

92+
# Open the csv file that contains domain info for editors and a list of gh usernames
7793
data_dir = Path("_data")
78-
editor_domains = pd.read_csv(data_dir / "editor-domains.csv")
94+
editor_domains = pd.read_csv(data_dir / "editorial_team_domains.csv")
7995
editor_domains["gh_username"] = editor_domains["gh_username"].str.replace(
8096
"@", ""
8197
)
@@ -84,7 +100,8 @@ def filter_members(members, exclude):
84100
editorial_team_df = pd.DataFrame(
85101
editorial_team_gh, columns=["gh_username"]
86102
)
87-
103+
# Merge the graphQL data with the github team data
104+
# This will result in empty domain data but an accurate list of current editors.
88105
all_editors = pd.merge(
89106
editorial_team_df, editor_domains, on="gh_username", how="outer"
90107
)

0 commit comments

Comments
 (0)