You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-11Lines changed: 16 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,11 @@
1
-
# PyMetrics
1
+
<divalign="center">
2
+
<br/>
3
+
<palign="center">
4
+
<i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
5
+
</p>
6
+
<divalign="left">
2
7
8
+
# PyMetrics
3
9
The PyMetrics project allows you to extract download metrics for Python libraries published on [PyPI](https://pypi.org/) and [Anaconda](https://www.anaconda.com/).
4
10
5
11
The DataCebo team uses these scripts to report download counts for the libraries in the [SDV ecosystem](https://sdv.dev/) and other libraries.
@@ -13,8 +19,8 @@ engagement metrics.
13
19
Currently, the download data is collected from the following distributions:
14
20
*[PyPI](https://pypi.org/): Information about the project downloads from [PyPI](https://pypi.org/)
15
21
obtained from the public BigQuery dataset, equivalent to the information shown on
16
-
[pepy.tech](https://pepy.tech) and [ClickPy](https://clickpy.clickhouse.com/)
17
-
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/)
22
+
[pepy.tech](https://pepy.tech), [ClickPy](https://clickpy.clickhouse.com/) or [pypistats](https://pypistats.org/).
23
+
- More information about the BigQuery dataset can be found on the [official PyPI documentation](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/).
18
24
19
25
*[Anaconda](https://www.anaconda.com/): Information about conda package downloads for default and select Anaconda channels.
20
26
- The conda package download data is provided by Anaconda, Inc. It includes package download counts
@@ -24,7 +30,6 @@ Currently, the download data is collected from the following distributions:
24
30
- Replace `{username}` with the Anaconda channel (`conda-forge`)
25
31
- Replace `{package_name}` with the specific package (`sdv`) in the Anaconda channel
26
32
- For each file returned by the API endpoint, the current number of downloads is saved. Over time, a historical download recording can be built.
27
-
- Both of these sources were used to track Anaconda downloads because the package data for Anaconda does not match the download count on the website. This is due to missing download data. See: https://github.com/anaconda/anaconda-package-data/issues/45
28
33
29
34
### Future Data Sources
30
35
In the future, we may expand the source distributions to include:
@@ -33,31 +38,28 @@ In the future, we may expand the source distributions to include:
33
38
## Workflows
34
39
35
40
### Daily Collection
36
-
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published to Google Drive in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
41
+
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).
37
42
38
43
#### Metrics
39
44
This PyPI download metrics are computed along several dimensions:
40
45
41
46
-**By Month**: The number of downloads per month.
42
47
-**By Version**: The number of downloads per version of the software, as determined by the software maintainers.
43
48
-**By Python Version**: The number of downloads per minor Python version (eg. 3.8).
44
-
-**By Full Python Version**: The number of downloads per full Python version (eg. 3.9.1).
45
49
-**And more!**
46
50
47
51
### Daily Summarize
48
52
49
-
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries.
50
-
51
-
The summarized data is uploaded to a GitHub repo:
53
+
On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries. The summarized data is published to a GitHub repo:
Installing the main SDV library also installs all the other libraries as dependencies. To calculate SDV downloads, we use an exclusive download methodology:
56
58
57
59
1. Get download counts for `sdgym` and `sdv`.
58
-
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since sdgym depends on sdv).
60
+
2. Adjust `sdv` downloads by subtracting `sdgym` downloads (since `sdgym` depends on `sdv`).
59
61
3. Get download counts for direct SDV dependencies: `rdt`, `copulas`, `ctgan`, `deepecho`, `sdmetrics`.
60
-
4. Adjust downloads for each dependency by subtracting the `sdv` download count.
62
+
4. Adjust downloads for each dependency by subtracting the `sdv` download count (since `sdv` has a direct dependency).
61
63
5. Ensure no download count goes negative using `max(0, adjusted_count)` for each library.
62
64
63
65
This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.
@@ -72,6 +74,9 @@ For more information about the configuration, workflows, and metrics, see the re
72
74
|:floppy_disk:|[COLLECTED DATA](docs/COLLECTED_DATA.md)| Explanation about the data that is being collected. |
73
75
74
76
77
+
## Known Issues
78
+
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45
0 commit comments