Skip to content

Commit 8e79402

Browse files
Add support to mine pypi Package-URLs #662 (#677)
* Add support to mine pypi packageURLs Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Refactor and fix pypi miners Reference: #662 Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Fix code style issues Signed-off-by: Ayan Sinha Mahapatra <[email protected]> --------- Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
1 parent dba43e6 commit 8e79402

File tree

15 files changed

+586
-0
lines changed

15 files changed

+586
-0
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
name: Build minecode-pipeline Python distributions and publish on PyPI
2+
3+
on:
4+
workflow_dispatch:
5+
push:
6+
tags:
7+
- "minecode-pipeline/*"
8+
9+
jobs:
10+
build-and-publish:
11+
name: Build and publish library to PyPI
12+
runs-on: ubuntu-22.04
13+
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- name: Set up Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: 3.11
21+
22+
- name: Install flot
23+
run: python -m pip install flot --user
24+
25+
- name: Build binary wheel and source tarball
26+
run: python -m flot --pyproject pyproject-minecode_pipeline.toml --sdist --wheel --output-dir dist/
27+
28+
- name: Publish to PyPI
29+
if: startsWith(github.ref, 'refs/tags')
30+
uses: pypa/gh-action-pypi-publish@release/v1
31+
with:
32+
password: ${{ secrets.PYPI_API_TOKEN_MINECODE_PIPELINE }}
33+
34+
- name: Upload built archives
35+
uses: actions/upload-artifact@v4
36+
with:
37+
name: pypi_archives
38+
path: dist/*

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,9 @@ test_matchcode:
139139

140140
test: test_purldb test_matchcode test_toolkit test_clearcode
141141

142+
test_minecode:
143+
${ACTIVATE} ${PYTHON_EXE} -m pytest -vvs minecode_pipeline
144+
142145
shell:
143146
${MANAGE} shell
144147

minecode_pipeline/README.rst

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
minecode-pipeline
2+
===================
3+
4+
minecode-pipeline is an add-on library working with scancode.io to define pipelines to mine
5+
packageURLs and package metadata from ecosystem repositories and APIs.
6+
7+
Installation
8+
------------
9+
10+
Requirements
11+
############
12+
13+
* install minecode-pipeline dependencies
14+
* `pip install minecode-pipeline`
15+
16+
17+
Funding
18+
-------
19+
20+
This project was funded through the NGI Assure Fund https://nlnet.nl/assure, a
21+
fund established by NLnet https://nlnet.nl/ with financial support from the
22+
European Commission's Next Generation Internet programme, under the aegis of DG
23+
Communications Networks, Content and Technology under grant agreement No 957073.
24+
25+
This project is also funded through grants from the Google Summer of Code
26+
program, continuing support and sponsoring from nexB Inc. and generous
27+
donations from multiple sponsors.
28+
29+
30+
License
31+
-------
32+
33+
Copyright (c) nexB Inc. and others. All rights reserved.
34+
35+
purldb is a trademark of nexB Inc.
36+
37+
SPDX-License-Identifier: Apache-2.0
38+
39+
minecode-pipeline is licensed under the Apache License version 2.0.
40+
41+
See https://www.apache.org/licenses/LICENSE-2.0 for the license text.
42+
See https://github.com/aboutcode-org/purldb for support or download.
43+
See https://aboutcode.org for more information about nexB OSS projects.

minecode_pipeline/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
3+
# purldb is a trademark of nexB Inc.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/aboutcode-org/purldb for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
8+
#
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
3+
# purldb is a trademark of nexB Inc.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/aboutcode-org/purldb for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
8+
#

minecode_pipeline/miners/pypi.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
#
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
3+
# purldb is a trademark of nexB Inc.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/aboutcode-org/purldb for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
8+
#
9+
10+
11+
import json
12+
import requests
13+
14+
from packageurl import PackageURL
15+
16+
from minecode_pipeline.utils import get_temp_file
17+
18+
"""
19+
Visitors for Pypi and Pypi-like Python package repositories.
20+
21+
We have this hierarchy in Pypi simple/ index:
22+
pypi projects (JSON/HTML) -> project versions (JSON/HTML) -> download urls
23+
24+
https://pypi.org/simple/
25+
Pypi serves a main index via JSON/HTML API that contains a list of package names
26+
and some info on when a package was updated by releasing a new version.
27+
See https://docs.pypi.org/api/index-api/ for more details.
28+
This index also has a list of versions and download URLs of all
29+
uploaded/available package archives and some basic metadata.
30+
31+
https://pypi.org/pypi/{name}/json
32+
For each package, a JSON contains details including the list of all releases
33+
and archives, their URLs, and some metadata for each release.
34+
For each release, a JSON contains details for the released version and all the
35+
downloads available for this release.
36+
"""
37+
38+
39+
pypi_json_headers = {"Accept": "application/vnd.pypi.simple.v1+json"}
40+
41+
42+
PYPI_REPO = "https://pypi.org/simple/"
43+
PYPI_TYPE = "pypi"
44+
45+
46+
def get_pypi_packages(pypi_repo, logger=None):
47+
response = requests.get(pypi_repo, headers=pypi_json_headers)
48+
if not response.ok:
49+
return
50+
51+
packages = response.json()
52+
temp_file = get_temp_file("PypiPackagesJSON")
53+
with open(temp_file, "w", encoding="utf-8") as f:
54+
json.dump(packages, f, indent=4)
55+
56+
return temp_file
57+
58+
59+
def get_pypi_packageurls(name):
60+
packageurls = []
61+
62+
project_index_api_url = PYPI_REPO + name
63+
response = requests.get(project_index_api_url, headers=pypi_json_headers)
64+
if not response.ok:
65+
return packageurls
66+
67+
project_data = response.json()
68+
for version in project_data.get("versions"):
69+
purl = PackageURL(
70+
type=PYPI_TYPE,
71+
name=name,
72+
version=version,
73+
)
74+
packageurls.append(purl.to_string())
75+
76+
return packageurls
77+
78+
79+
def load_pypi_packages(packages):
80+
with open(packages) as f:
81+
packages_data = json.load(f)
82+
83+
last_serial = packages_data.get("meta").get("_last-serial")
84+
packages = packages_data.get("projects")
85+
86+
return last_serial, packages
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
3+
# purldb is a trademark of nexB Inc.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/aboutcode-org/purldb for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
8+
#
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/aboutcode-org/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/aboutcode-org/scancode.io for support and download.
22+
23+
from scanpipe.pipelines import Pipeline
24+
from scanpipe.pipes import federatedcode
25+
26+
from minecode_pipeline.pipes import pypi
27+
28+
29+
class MineandPublishPypiPURLs(Pipeline):
30+
"""
31+
Mine all packageURLs from a pypi index and publish them to
32+
a FederatedCode repo.
33+
"""
34+
35+
@classmethod
36+
def steps(cls):
37+
return (
38+
cls.check_federatedcode_eligibility,
39+
cls.mine_pypi_packages,
40+
cls.mine_and_publish_pypi_packageurls,
41+
)
42+
43+
def check_federatedcode_eligibility(self):
44+
"""
45+
Check if the project fulfills the following criteria for
46+
pushing the project result to FederatedCode.
47+
"""
48+
federatedcode.check_federatedcode_configured_and_available()
49+
50+
def mine_pypi_packages(self):
51+
"""Mine pypi package names from pypi indexes."""
52+
self.pypi_packages = pypi.mine_pypi_packages(logger=self.log)
53+
54+
def mine_and_publish_pypi_packageurls(self):
55+
"""Get pypi packageURLs for all mined pypi package names."""
56+
pypi.mine_and_publish_pypi_packageurls(packages=self.pypi_packages, logger=self.log)
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#
2+
# Copyright (c) nexB Inc. and others. All rights reserved.
3+
# purldb is a trademark of nexB Inc.
4+
# SPDX-License-Identifier: Apache-2.0
5+
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
6+
# See https://github.com/aboutcode-org/purldb for support or download.
7+
# See https://aboutcode.org for more information about nexB OSS projects.
8+
#

0 commit comments

Comments
 (0)