Skip to content
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
b10b809
Add scripts for PR metrics from github API
mpg Jun 30, 2020
c531cae
Update requirements & allow use from venv
mpg Sep 30, 2020
a715053
Add comments about Ubuntu 20.04
mpg Dec 24, 2020
bae9af3
Make get-pr-data 10x faster
mpg Dec 24, 2020
00d8499
Avoid potential better-than-reality lifetime figures
mpg Dec 30, 2020
28dffa7
Adjust pr_dates() to reduce risk of misuse
mpg Dec 30, 2020
3d7880c
Adapt detection of community PRs
mpg Apr 2, 2021
37844d4
Add warning about making this work on 16.04
mpg Apr 2, 2021
cf9e41d
Avoid repeating the start date in many places
mpg Apr 2, 2021
f06becf
Update outdated comment
mpg Apr 2, 2021
cc05d6a
Make first and last date configurable
mpg Apr 2, 2021
ed1adea
Fix flake8 warnings
mpg Apr 2, 2021
08c0b7c
Rotate labels for quarters
mpg Apr 2, 2021
b2ee775
Clarify community detection
mpg May 19, 2021
3feb297
Smarter handling of p.mergeable in get-pr-data
mpg May 20, 2021
1d58093
Update pending-mergeability
mpg May 20, 2021
5f6d268
We no longer use labels for community PRs
mpg Sep 30, 2022
94533e1
Update list of core contributors
mpg Oct 12, 2022
b7f7f76
Update Readme (PR last date)
mpg Oct 12, 2022
e69fb3a
Shift one month for quarterly PR lifetime
mpg Jan 11, 2023
cd9c1f6
Update list of team member
mpg Jan 11, 2023
4d58ba0
Revert "Shift one month for quarterly PR lifetime"
mpg Jan 11, 2023
45fa6ce
Use statistics.median
mpg Jan 11, 2023
f1b54e1
Handle uncertainty about lifetimes
mpg Jan 11, 2023
ac21a51
Update Readme about incomplete results
mpg Jan 12, 2023
e095fc3
Update team members with current reviewers
mpg Apr 6, 2023
ce08049
Draw error bars, don't skip uncertain quarters
mpg Apr 6, 2023
c86237c
New script pr-backlog.py
mpg Apr 6, 2023
b7a02f6
Cosmetic adjustments
mpg Apr 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pr-metrics/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__
pr-data.p
*.png
*.csv
47 changes: 47 additions & 0 deletions pr-metrics/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
These scripts collect some metrics about mbed TLS PRs over time.

Usage
-----

1. `./get-pr-data.py` - this takes a long time and requires the environment
variable `GITHUB_API_TOKEN` to be set to a valid [github API
token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) (unauthenticated access to the API has a limit on the number or requests that is too low for our number of PRs). It generates `pr-data.p` with pickled data.
2. `PR_LAST_DATE=20yy-mm-dd ./do.sh` - this works offline from the data in
`pr-data.p` and generates a bunch of png and csv files.

Requirements
------------

These scripts require:

- Python >= 3.6 (required by recent enough matplotlib)
- matplotlib >= 3.1 (3.0 doesn't work)
- PyGithub >= 1.43 (any version should work, that was just the oldest tested)

### Ubuntu 20.04 (and probaly 18.04)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: probaly -> probably

Suggested change
### Ubuntu 20.04 (and probaly 18.04)
### Ubuntu 20.04 (and probably 18.04)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also works on 22.04 (tested by me) in case you want to note that for the future's sake.


A simple `apt install python3-github python3-matplotlib` is enough.

### Ubuntu 16.04

On Ubuntu 16.04, by default only Python 3.5 is available, which doesn't
support a recent enough matplotlib to support those scripts, so the following
was used to run those scripts on 16.04:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6 python3.6-venv
python3.6 -m venv 36env
source 36env/bin/activate
pip install --upgrade pip
pip install matlplotlib
pip install pygithub

See `requirements.txt` for an example of a set of working versions.

Note: if you do this, I strongly recommend uninstalling python3.6,
python3.6-venv and all their dependencies, then removing the deadsnakes PPA
before any upgrade to 18.04. Failing to do so will result in
dependency-related headaches as some packages in 18.04 depend on a specific
version of python3.6 but the version from deadsnakes is higher, so apt won't
downgrade it and manual intervention will be required.
9 changes: 9 additions & 0 deletions pr-metrics/do.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/sh

set -eu

for topic in created closed pending lifetime; do
echo "PRs $topic..."
rm -f prs-${topic}.png prs-${topic}.csv
./pr-${topic}.py > prs-${topic}.csv
done
41 changes: 41 additions & 0 deletions pr-metrics/get-pr-data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env python3
# coding: utf-8

"""Get PR data from github and pickle it."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General suggestion, not a blocker: all scripts should support --help, and should have most of their code inside functions or classes so that they can be called from another script bypassing the command line interface.

import pickle
import os

from github import Github

if "GITHUB_API_TOKEN" in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for later: make the token API (command line options, file storage, …) compatible with the official command line tool gh. (This is not the markedly less popular python gh.) I'd like to make my own scripts compatible too — we (you, me, and I guess at least @bensze01 as well) should coordinate and write a python module for that, if there isn't already one. (I didn't find one in a cursory search but “python github gh” are not very specific search terms.)

token = os.environ["GITHUB_API_TOKEN"]
else:
print("You need to provide a GitHub API token")

g = Github(token)
r = g.get_repo("ARMMbed/mbedtls")

prs = list()
for p in r.get_pulls(state="all"):
print(p.number)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to make this more informative - maybe changing it from p.number to "Fetching #" + str(p.number)

Might not be necessary for such a simple script though

# Accessing p.mergeable forces completion of PR data (by default, only
# basic info such as status and dates is available) but makes things
# slower (about 10x). Only do that for open PRs; we don't need the extra
# info for old PRs (only the dates which are part of the basic info).
if p.state == 'open':
dummy = p.mergeable
prs.append(p)

# After a branch has been updated, github doesn't immediately go and recompute
# potential conflicts for all open PRs against this branch; instead it does
# that when the info is requested and even then it's done asynchronously: the
# first request might return no data, but if we come back after we've done all
# the other PRs, the info should have become available in the meantime.
for p in prs:
if p.state == 'open' and p.mergeable is None:
print(p.number, 'update')
p.update()

with open("pr-data.p", "wb") as f:
pickle.dump(prs, f)
36 changes: 36 additions & 0 deletions pr-metrics/pending-mergeability.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce summary or PRs pending per branch and their mergeability status."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the requirements to run this script (run get-pr-data.py to produce pr-data.p). Goes for the other scripts as well.


import pickle
from datetime import datetime
from collections import Counter

with open("pr-data.p", "rb") as f:
prs = pickle.load(f)

c_open = Counter()
c_mergeable = Counter()
c_recent = Counter()
c_recent2 = Counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocker: is recent2 more recent than recent or less? Are they exclusive or staggered? Better names would help.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cnt_open or even count_open (etc), which I think help readability quite a bit


for p in prs:
if p.state != "open":
continue

branch = p.base.ref
c_open[branch] += 1
if p.mergeable:
c_mergeable[branch] += 1
days = (datetime.now() - p.updated_at).days
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now() is called on every element, if you run this script twice at the same time, you might get inconsistently different day counts because prs is traversed in a different order. Ok, this is not critical in this reporting script, but it would be better practice to call now() only once and work from that reference time.

Also applies to other scripts that call now in a loop.

if days < 31:
c_recent[branch] += 1
if days < 8:
c_recent2[branch] += 1


print(" branch: open, mergeable, <31d, <8d")
for b in sorted(c_open, key=lambda b: c_open[b], reverse=True):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd traditionally define lambda functions as lambda x: etc, notable here since this makes it seem like there's some relationship between the lambda b and for b syntactically, which there isn't.

print("{:>20}: {: 10}, {: 10}, {: 10}, {:10}".format(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces hard to read tables if a branch name is > 20 chars long (see below). However, I can't immediately see a good/simple way to fix this, so it might just have to be accepted as a limitation

              branch:       open,  mergeable,       <31d,        <8d
         development:        177,         91,         51,         41
        mbedtls-2.28:          6,          6,          3,          3
dev/gilles-peskine-arm/psa-test-op-fail:          1,          1,          1,          1
        mbedtls-2.16:          1,          0,          0,          0

b, c_open[b], c_mergeable[b], c_recent[b], c_recent2[b]))
46 changes: 46 additions & 0 deletions pr-metrics/pr-closed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several scripts, especially pr-closed and pr-created, have a lot of code in common and should be factored into one with multiple outputs or options to choose between outputs.

# coding: utf-8

"""Produce graph of PRs closed by time period."""

from prs import pr_dates, quarter, first, last

from collections import Counter

import matplotlib.pyplot as plt

first_q = quarter(first)
last_q = quarter(last)

cnt_all = Counter()
cnt_com = Counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could guess that cnt stands for “count” or “counter” even without seeing = Counter(), but what's com? “Command”? “Communication”?

(Obviously by reading the rest of the code I can tell that it's “community”. But I wouldn't have guessed.)

And actually, as I wasn't familiar with collections.Counter, I first thought that these were counters, but they're actually dictionaries of counters, or multisets. (I consider this a bad choice of name in the Python standard library. It's fine for this script to follow this naming choice.) It would help to know what the keys are. This goes in other scripts that use Counter as well.


for beg, end, com in pr_dates():
if end is None:
continue
q = quarter(end)
cnt_all[q] += 1
if com:
cnt_com[q] += 1

quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q))

prs_com = tuple(cnt_com[q] for q in quarters)
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters)

width = 0.9
fig, ax = plt.subplots()
ax.bar(quarters, prs_com, width, label="community")
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com)
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("Number or PRs closed")
ax.tick_params(axis="x", labelrotation=90)
fig.suptitle("Number of PRs closed per quarter")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-closed.png")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the world ready for svg yet?


print("Quarter,community closed,total closed")
for q in quarters:
print("{},{},{}".format(q, cnt_com[q], cnt_all[q]))
44 changes: 44 additions & 0 deletions pr-metrics/pr-created.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce graph of PRs created by time period."""

from prs import pr_dates, quarter, first, last

from collections import Counter

import matplotlib.pyplot as plt

first_q = quarter(first)
last_q = quarter(last)

cnt_all = Counter()
cnt_com = Counter()

for beg, end, com in pr_dates():
q = quarter(beg)
cnt_all[q] += 1
if com:
cnt_com[q] += 1

quarters = tuple(sorted(q for q in cnt_all if first_q <= q <= last_q))

prs_com = tuple(cnt_com[q] for q in quarters)
prs_team = tuple(cnt_all[q] - cnt_com[q] for q in quarters)

width = 0.9
fig, ax = plt.subplots()
ax.bar(quarters, prs_com, width, label="community")
ax.bar(quarters, prs_team, width, label="core team", bottom=prs_com)
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("Number or PRs created")
ax.tick_params(axis="x", labelrotation=90)
fig.suptitle("Number of PRs created per quarter")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-created.png")

print("Quarter,community created,total created")
for q in quarters:
print("{},{},{}".format(q, cnt_com[q], cnt_all[q]))
80 changes: 80 additions & 0 deletions pr-metrics/pr-lifetime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce graph of lifetime of PRs over time."""

from prs import pr_dates, quarter, first, last

from collections import defaultdict

import matplotlib.pyplot as plt
from datetime import datetime, timedelta

first_q = quarter(first)
last_q = quarter(last)

lifetimes_all = defaultdict(list)
lifetimes_com = defaultdict(list)

for beg, end, com in pr_dates():
# If the PR is still open and it's recent, assign an arbitrary large
# lifetime. (The exact value doesn't matter for computing the median, as
# long as it's greater than the median - that is, as long as we've closed
# at least half the PRs created that quarter. Otherwise the large value
# will make that pretty visible.)
if end is None:
today = datetime.now().date()
lt_so_far = (today - beg).days
lt = max(365, lt_so_far)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-closed PR are considered N days old. The graph caps ages at C so on the graph, C is equivalent to >C. This is consistent only when N≥C, which is currently the case since N=C=365. It would be better for N and C to be the same global constant in the script.

else:
lt = (end - beg).days

# Shit one month (that is, for q2 count March to May, not April to July).
# This is because we want to measure this at the end of each quarter, but
# including PRs raised too recently skew the results. Shifting one month
# means we had time to look at the PR by the time we generate quaterly
# metrics.
q = quarter(beg - timedelta(days=30))
lifetimes_all[q].append(lt)
if com:
lifetimes_com[q].append(lt)

quarters = tuple(sorted(q for q in lifetimes_all if first_q <= q <= last_q))

for q in quarters:
lifetimes_all[q].sort()
lifetimes_com[q].sort()


def median(sl):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Return the median value of a sorted list of numbers (0 if empty)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0 if empty?

index = (len(sl) - 1) / 2
if index < 0:
return 0
if int(index) == index:
return sl[int(index)]

i, j = int(index - 0.5), int(index + 0.5)
return (sl[i] + sl[j]) / 2


med_all = tuple(median(lifetimes_all[q]) for q in quarters)
med_com = tuple(median(lifetimes_com[q]) for q in quarters)

fig, ax = plt.subplots()
ax.plot(quarters, med_all, "b-", label="median overall")
ax.plot(quarters, med_com, "r-", label="median community")
ax.legend(loc="upper right")
ax.grid(True)
ax.set_xlabel("quarter")
ax.set_ylabel("median lifetime in days of PRs created that quarter (shifted 1 month)")
ax.tick_params(axis="x", labelrotation=90)
bot, top = ax.set_ylim()
ax.set_ylim(0, min(365, top)) # we don't care about values over 1 year
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't care about values over a year? Several quarters (for example 16q1) exceed this value

fig.suptitle("Median lifetime of PRs per quarter (shifted 1 month) (less is better)")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-lifetime.png")

print("Quarter,median overall,median community")
for q, a, c in zip(quarters, med_all, med_com):
print("{},{},{}".format(q, int(a), int(c)))
54 changes: 54 additions & 0 deletions pr-metrics/pr-pending.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env python3
# coding: utf-8

"""Produce graph of PRs pending over time."""

from prs import pr_dates, first, last

from datetime import datetime, timedelta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your opinion on flake8 vs pylint?

from collections import Counter

import matplotlib.pyplot as plt

cnt_tot = Counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cnt_tot[date] is the number of open PR at date, right?

cnt_com = Counter()

for beg, end, com in pr_dates():
if end is None:
tomorrow = datetime.now().date() + timedelta(days=1)
n_days = (tomorrow - beg).days
Comment on lines +18 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different from

n_days = (datetime.now().date() - beg).days + 1

?

else:
n_days = (end - beg).days
dates = Counter(beg + timedelta(days=i) for i in range(n_days))
cnt_tot.update(dates)
if com:
cnt_com.update(dates)

dates = tuple(sorted(d for d in cnt_tot.keys() if first <= d <= last))


def avg(cnt, date):
"""Average number of open PRs over a week."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Average number of open PRs over a week."""
"""Average number of open PRs over a week ending at date."""

return sum(cnt[date - timedelta(days=i)] for i in range(7)) / 7


nb_tot = tuple(avg(cnt_tot, d) for d in dates)
nb_com = tuple(avg(cnt_com, d) for d in dates)
nb_team = tuple(tot - com for tot, com in zip(nb_tot, nb_com))

fig, ax = plt.subplots()
ax.plot(dates, nb_tot, "b-", label="total")
ax.plot(dates, nb_team, "c-", label="core team")
ax.plot(dates, nb_com, "r-", label="community")
ax.legend(loc="upper left")
ax.grid(True)
ax.set_xlabel("date")
ax.set_ylabel("number of open PRs (sliding average over a week)")
fig.suptitle("Number of PRs pending over time (less is better)")
fig.set_size_inches(12.8, 7.2) # default 100 dpi -> 720p
fig.savefig("prs-pending.png")

print("date,pending total, pending community")
for d in dates:
tot, com = cnt_tot[d], cnt_com[d]
print("{},{},{}".format(d, tot, com))
Loading