|
| 1 | +===================================================== |
| 2 | +Adding Ability to Store and Query Downloaded Packages |
| 3 | +===================================================== |
| 4 | + |
| 5 | +**Organization:** `AboutCode <https://aboutcode.org>`__ |
| 6 | + |
| 7 | +**Project:** `ScanCode.io <https://github.com/aboutcode-org/scancode.io>`__ |
| 8 | + |
| 9 | +| **Contributor:** Varsha U N |
| 10 | +| **GitHub:** `VarshaUN <https://github.com/VarshaUN>`__ |
| 11 | +| **LinkedIn:** `Varsha U N <https://www.linkedin.com/in/varsha-un/>`__ |
| 12 | +
|
| 13 | +**Mentors:** |
| 14 | +- `Philippe Ombredanne <https://github.com/pombredanne>`__ |
| 15 | +- `Ayan Sinha Mahapatra <https://github.com/AyanSinhaMahapatra>`__ |
| 16 | + |
| 17 | +Overview |
| 18 | +-------- |
| 19 | + |
| 20 | +ScanCode.io currently stores scanned packages on disk without a centralized index, |
| 21 | +leading to duplicate storage, project-specific data, and potential data loss when |
| 22 | +inputs are deleted. This project enhances ScanCode.io by introducing structured |
| 23 | +package storage and querying, enabling indexing, reuse across projects, and |
| 24 | +reliable preservation. |
| 25 | + |
| 26 | +Implementation |
| 27 | +-------------- |
| 28 | + |
| 29 | +.. figure:: /_static/gsoc2025/scancodeio_varsha/project_flow.png |
| 30 | + :alt: Project Flow Diagram |
| 31 | + :align: center |
| 32 | + :width: 70% |
| 33 | + |
| 34 | +This project addresses the limitations of ScanCode.io's unstructured package |
| 35 | +storage by adding a system to index, reuse, and preserve packages reliably. |
| 36 | + |
| 37 | +Storage System Development: |
| 38 | + |
| 39 | +- Created a `DownloadStore` abstract base class in `archiving.py` to |
| 40 | + define the interface for managing package content and metadata |
| 41 | + storage. |
| 42 | + |
| 43 | +- Built the `LocalFilesystemProvider` class to store downloads on the |
| 44 | + local filesystem, using a SHA256-based nested directory structure. |
| 45 | + |
| 46 | +- Implemented methods for storing (`put`), retrieving (`get`), listing |
| 47 | + (`list`), and searching (`find`) downloads, with metadata saved in |
| 48 | + `origin-<hash>.json` files. |
| 49 | + |
| 50 | +Integration with ScanCode.io: |
| 51 | + |
| 52 | +- Updated `pipelines/init.py` to incorporate the archiving system into |
| 53 | + ScanCode.io’s pipeline workflow, ensuring downloaded packages are |
| 54 | + stored during execution. |
| 55 | + |
| 56 | +- Revised `input.py` to process package download inputs, passing |
| 57 | + content, `download_url`, `download_date`, and `filename` to the |
| 58 | + archiving system. |
| 59 | + |
| 60 | +User Interface Enhancements: |
| 61 | + |
| 62 | +- Modified the project resource view to display stored package |
| 63 | + information, including download URLs and dates. |
| 64 | + |
| 65 | +Validation and Testing: |
| 66 | + |
| 67 | +- Wrote unit tests in `test_archiving.py` to verify |
| 68 | + `LocalFilesystemProvider` functionality (`put`, `get`, `list`, |
| 69 | + `find`), testing normal cases, edge cases (e.g., empty files), and |
| 70 | + errors (e.g., duplicate origins). |
| 71 | + |
| 72 | +Linked Pull Requests |
| 73 | +-------------------- |
| 74 | + |
| 75 | +.. list-table:: |
| 76 | + :widths: 10 40 20 |
| 77 | + :header-rows: 1 |
| 78 | + |
| 79 | + * - Sr. No |
| 80 | + - Name |
| 81 | + - Link |
| 82 | + * - 1 |
| 83 | + - Add download archiving system |
| 84 | + - `scancode.io#1815 <https://github.com/aboutcode-org/scancode.io/pull/1815>`__ |
| 85 | + * - 2 |
| 86 | + - Support local package storage |
| 87 | + - `scancode.io#1685 <https://github.com/aboutcode-org/scancode.io/pull/1685>`__ |
| 88 | + |
| 89 | +Related Issues |
| 90 | +-------------- |
| 91 | + |
| 92 | +.. list-table:: |
| 93 | + :widths: 10 40 20 |
| 94 | + :header-rows: 1 |
| 95 | + |
| 96 | + * - Sr. No |
| 97 | + - Name |
| 98 | + - Link |
| 99 | + * - 1 |
| 100 | + - Store and retrieve scanned packages |
| 101 | + - `#1063 <https://github.com/aboutcode-org/scancode.io/issues/1063>`__ |
| 102 | + * - 2 |
| 103 | + - Support local package storage |
| 104 | + - `#1683 <https://github.com/aboutcode-org/scancode.io/issues/1683>`__ |
| 105 | + |
| 106 | +Pre-GSoC Work |
| 107 | +------------- |
| 108 | + |
| 109 | +Here are some PRs submitted before GSoC: |
| 110 | + |
| 111 | +- `Add bluefin-container image support <https://github.com/aboutcode-org/scancode.io/pull/1620>`__ |
| 112 | +- `Tag whitedout files <https://github.com/aboutcode-org/scancode.io/pull/1529>`__ |
| 113 | +- `Support python-private-classifier <https://github.com/aboutcode-org/scancode-toolkit/pull/4075>`__ |
| 114 | +- `Parse labels in Dockerfile <https://github.com/aboutcode-org/scancode-toolkit/pull/3987>`__ |
| 115 | +- `Add OCI labels to Dockerfile <https://github.com/aboutcode-org/scancode-toolkit/pull/3987>`__ |
| 116 | +- `Extract LibreOffice documents <https://github.com/aboutcode-org/extractcode/pull/67>`__ |
| 117 | + |
| 118 | +Links |
| 119 | +----- |
| 120 | + |
| 121 | +- **Project Idea:** `GSoC 2025 Idea <https://github.com/aboutcode-org/aboutcode/wiki/GSOC-2025-project-ideas#scancodeio-add-ability-to-storequery-downloaded-packages>`__ |
| 122 | +- **GSoC Project Page:** `GSoC 2025 <https://summerofcode.withgoogle.com/programs/2025/projects/x7sA6uN6>`__ |
| 123 | +- **Proposal:** `Project Proposal <https://docs.google.com/document/d/1LfTGfatLfg9RB-OyLhlS4_h0-Tc9Q8QU1ObsCVDV_sM/edit?usp=sharing>`__ |
| 124 | + |
| 125 | +Future Work |
| 126 | +----------- |
| 127 | + |
| 128 | +Future enhancements include implementing the web UI for the `LocalFilesystemProvider` |
| 129 | +to enable package uploads, searches, listings, and retrievals in ScanCode.io, with |
| 130 | +Django views, templates, and URL routes, backed by comprehensive testing. Additionally, |
| 131 | +integrating an external cloud storage option (e.g., AWS S3) alongside the local |
| 132 | +filesystem will extend the `DownloadStore` interface, providing scalable and remote |
| 133 | +storage capabilities. |
| 134 | + |
| 135 | +Closing Note |
| 136 | +------------ |
| 137 | + |
| 138 | +During GSoC 2025, my mentors and I held weekly meetings to discuss progress, |
| 139 | +challenges, and next steps. I am deeply grateful to my mentors for their guidance |
| 140 | +and support, which greatly enriched my learning experience. |
0 commit comments