Skip to content

Conversation

@aayushkdev
Copy link
Collaborator

@aayushkdev aayushkdev commented Jun 13, 2025

fix #1687

Summary/Goals of this change
This PR introduces an parent_directory_path field to the CodebaseResource model to optimize fetching children of a directory. The existing .children() method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.
Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree

Breaking Changes
The file tree will rely on the newly introduced ancestor field and the presence of top-level paths. As a result, existing projects scanned before this PR will need to be re-scanned to work with the file tree view.

Open Questions
Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?

  • parent_path()
  • has_parent()
  • parent()

Would love feedback on this!

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces an ancestor field to the CodebaseResource model to optimize fetching children of a directory. The existing .children() method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.
Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree

Could you explain and document, with data and examples, how storing the parent_path on the resource, would be used in the rendering context?
No need to add code, just as discussion in the PR as a start would be useful.


Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?

Let's wait a bit to ensure we have a working system before deprecating existing code ;)


See my comment about the code.
It is missing unit test to ensure the new field is properly set (with regular path, and with root).
Also a unit test to cover the changes made to rootfs.py will be appreciated.

@aayushkdev
Copy link
Collaborator Author

Use of parent_directory_path in the Rendering Context

  • Fetching children of a directory is now a simple filter on the parent_directory_path field, which is indexed and fast. where previously the .children() was used method that used regex based filtering which was slow and inefficient

  • Adding the parent_directory_path field also makes determining whether a directory has children easier as this can now be handled via a single annotated query that counts children per resource. This eliminates the need for per-directory .exists() checks, which gets very expensive with hundreds or thousands of folders.

  • Top-level resources (those with parent_directory_path=None) can be queried directly and efficiently, making the initial rendering of the tree straightforward and performant.

Step by step flow

README.md
src/
  └── main.py
  └── utils/
       └── helpers.py
data/
  └── input.txt
bin/

1. Initial Load:

  • The codebase fetches the top-level directories/files using:
CodebaseResource.objects.filter(project=project, parent_directory_path=None)
  • This fetches: README.md, src/, data/, and bin/, as they have their parent_directory_path field set to None.

2. Determining Expandability

  • Another query will be fired using parent_directory_path=None for fetching and annotating an extra field has_children.
  • This will indicate whether a directory can be expanded in the UI.
  • In this case:
  • bin/ and README.md have no children => has_children = False
  • src/ and data/ have children => has_children = True

3. Expanding a Folder (e.g., src/):

  • Clicking on the src/ folder sends a request with:
CodebaseResource.objects.filter(project=project, parent_directory_path="src")
  • This fetches its children: main.py and utils/, as their parent is src/ .

4. Determining Children of Fetched Items:

  • Like before another query is fired to check if these items have children.
  • In this case:
  • main.py => has_children = False
  • utils/ => has_children = True

@aayushkdev aayushkdev requested a review from tdruez June 14, 2025 19:05
tdruez added a commit that referenced this pull request Jun 16, 2025
In preparation of adding parent_path as a field #1691

Signed-off-by: tdruez <[email protected]>
tdruez added a commit that referenced this pull request Jun 16, 2025
In preparation of adding parent_path as a field #1691

Signed-off-by: tdruez <[email protected]>
@tdruez
Copy link
Contributor

tdruez commented Jun 16, 2025

@aayushkdev Thanks for all the implementation details!

I've renamed the old parent_path function to parent_directory in #1694 so it's not in the way of the new field.
Could you merge those into your branch and define parent_directory_path as parent_path before the merge?

@aayushkdev
Copy link
Collaborator Author

Hey @tdruez I have merged your changes and renamed the required fields.

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aayushkdev Looking good, see my suggestion for improvements.

@tdruez
Copy link
Contributor

tdruez commented Jun 17, 2025

Breaking Changes
The file tree will rely on the newly introduced ancestor field and the presence of top-level paths. As a result, existing projects scanned before this PR will need to be re-scanned to work with the file tree view.

A data migration could solve this. The question is how fast could we migrate the data on existing databases that include millions of CodebaseResource records.

@aayushkdev
Copy link
Collaborator Author

Yes, iterating through all the resources in CodebaseResource to add the parent_path value to each resource would be quite slow, especially with multiple projects. If you'd like, I can prepare a migration script to test this out and see how it affects performance.

@tdruez
Copy link
Contributor

tdruez commented Jun 17, 2025

I don't think a "script" is the right approach here. We need to look into a raw SQL approach for acceptable performances.

@aayushkdev
Copy link
Collaborator Author

Yeah, I think this will be doable. I’ll explore this further and test it out.

@aayushkdev
Copy link
Collaborator Author

aayushkdev commented Jun 20, 2025

Hey @tdruez, I’ve got the migration for setting parent_path working. However, I could use your input on the migration for inserting the missing root directories. I initially assumed I could just insert the path, project_id, and parent_path fields only which would be doable through sql, but it turns out there are other non-nullable fields like sha1 and md5 that typically require a scan to populate which isnt possible with pure sql. One workaround could be to use dummy values for these fields. What do you think?

@tdruez
Copy link
Contributor

tdruez commented Jun 20, 2025

I initially assumed I could just insert the path, project_id, and parent_path fields only which would be doable through sql

Why would you have to insert those field? We are talking about generating a value for parent_path on existing CodebaseResource. Those existing records already have a path and a project_id.

I’ve got the migration for setting parent_path working.

Can you paste your progress on this migration in a comment, it'll be easier to provide you feedback.

@aayushkdev
Copy link
Collaborator Author

aayushkdev commented Jun 20, 2025

Why would you have to insert those field? We are talking about generating a value for parent_path on existing CodebaseResource. Those existing records already have a path and a project_id.

This migration has two parts first, adding the parent_path field to each resource, which I’ve already completed and second, inserting new resources with top-level root paths into CodebaseResource for each project. This is necessary because pipelines like map_deploy_to_develop don’t store root-level directories like from and to in the CodebaseResource, so we need to add that with the migration.

Can you paste your progress on this migration in a comment, it'll be easier to provide you feedback.

from django.db import migrations

class Migration(migrations.Migration):

    dependencies = [
        ('scanpipe', '0073_codebaseresource_parent_path'),
    ]

    operations = [
        migrations.RunSQL(
            """
            UPDATE scanpipe_codebaseresource
            SET parent_path = regexp_replace(path, '/[^/]+$', '')
            WHERE path LIKE '%/%';
            """
        ),
    ]

@aayushkdev
Copy link
Collaborator Author

Hey @tdruez @pombredanne , I explored different ways to store trees (like MPTT, Treebeard and ltree), but they all involve more complex integration to scancode.io, introduce even more additional fields, or require us to change the way Resources are created which slows down creation of resources, all while offering a minimal performance benefit over the parent_path field approach. Introducing the parent_path field provides a more lightweight and efficient solution that integrates cleanly with our existing path logic.

I also tested the parent_path field approach on deeply nested trees, and performance remains strong even with thousands of resources.

So I don't think its necessary to change our approach on how we handle the storage of trees.

VarshaUN pushed a commit to VarshaUN/scancode.io that referenced this pull request Jul 1, 2025
@aayushkdev aayushkdev force-pushed the 1687-codebaseresource-parent-and-top-paths branch from 7928f3c to 2b5b2f7 Compare July 12, 2025 16:59
@tdruez
Copy link
Contributor

tdruez commented Jul 14, 2025

@aayushkdev There's an issue when running the scan_single_package, as the make_codebase_resource logic is not called, as the data is loaded from the scancode-toolkit results.
We end up with the codebase value for root resource parent_path, and the codebase/ prefix for all other resource. Which then breaks the tree UI.

The scanpipe.pipes.scancode.create_codebase_resources needs to be adapted to provide the proper value for parent_path. Also make sure to add a unit test to cover this case.

@aayushkdev aayushkdev force-pushed the 1687-codebaseresource-parent-and-top-paths branch from b4a75ff to 05ee44d Compare July 14, 2025 19:04
@aayushkdev aayushkdev requested a review from tdruez July 15, 2025 09:39
@aayushkdev aayushkdev force-pushed the 1687-codebaseresource-parent-and-top-paths branch from 605e900 to 6fde4a6 Compare July 15, 2025 10:07
@aayushkdev
Copy link
Collaborator Author

@tdruez I have done the requested changes

@tdruez
Copy link
Contributor

tdruez commented Jul 16, 2025

@aayushkdev Make sure to fix the failing tests first ;)

Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
…one to align with the code format

Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
Signed-off-by: Aayush Kumar <[email protected]>
@aayushkdev aayushkdev force-pushed the 1687-codebaseresource-parent-and-top-paths branch from 58ac499 to 5b218df Compare July 21, 2025 18:44
Signed-off-by: Aayush Kumar <[email protected]>
@aayushkdev
Copy link
Collaborator Author

hey @tdruez there is a code formatting issue in scancode.py where the create_codebase_resources function is too complex and I am not really sure how to handle this should I split the function into helper functions?

@tdruez
Copy link
Contributor

tdruez commented Jul 23, 2025

the create_codebase_resources function is too complex and I am not really sure how to handle this should I split the function into helper functions?

Refactoring the complexity is a possibility, but in this case the code can be simplified. For example:

if field.name == "path":
    continue
if field.name == "parent_path":
    continue

Can be simplified as

if field.name in ["path", "parent_path"]:
    continue

Also, the main loop could be refactored as:

def create_codebase_resource(project, scanned_resource):
    """Create a CodebaseResource entry from ScanCode scanned data."""
    resource_data = {}
    [...]


def create_codebase_resources(project, scanned_codebase):
    """
    Save the resources of a ScanCode `scanned_codebase` scancode.resource.Codebase
    object to the database as a CodebaseResource of the `project`.
    This function can be used to expend an existing `project` Codebase with new
    CodebaseResource objects as the existing objects (based on the `path`) will be
    skipped.
    """
    for scanned_resource in scanned_codebase.walk(skip_root=True):
        create_codebase_resource(project, scanned_resource)

@tdruez tdruez merged commit 42be421 into aboutcode-org:main Jul 25, 2025
10 checks passed
tdruez added a commit that referenced this pull request Jul 25, 2025
tdruez added a commit that referenced this pull request Jul 25, 2025
Signed-off-by: tdruez <[email protected]>
tdruez added a commit that referenced this pull request Jul 25, 2025
* Leverage the parent_path field when available #1691

Signed-off-by: tdruez <[email protected]>

* Fix failing tests #1691

Signed-off-by: tdruez <[email protected]>

---------

Signed-off-by: tdruez <[email protected]>
@aayushkdev aayushkdev deleted the 1687-codebaseresource-parent-and-top-paths branch September 8, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Add support for tracking parent of CodebaseResource entries and ensure top level paths are stored

2 participants