Skip to content

Conversation

@p1gc0rn
Copy link
Collaborator

@p1gc0rn p1gc0rn commented Aug 1, 2025

The new feature helps determine the reachability of imported Python libraries in a Python project. This is part of the under developing project for imported Python libraries' reachability.

This PR includes the support for imported Python libraries defined in poetry.lock file.

@p1gc0rn p1gc0rn requested a review from cuixq August 1, 2025 01:18
@p1gc0rn p1gc0rn changed the title Add the main file for package reachability level #feat Add the main file for package reachability level Aug 1, 2025
@p1gc0rn p1gc0rn changed the title #feat Add the main file for package reachability level feat: Add the main file for package reachability level Aug 1, 2025
@cuixq
Copy link
Contributor

cuixq commented Aug 1, 2025

can you add some description for this PR?

Copy link
Contributor

@cuixq cuixq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also let's move this to experimental/pythonreach folder and add a README file.

@p1gc0rn p1gc0rn changed the title feat: Add the main file for package reachability level feat: Add the main file for determine package reachability level of Python project Aug 17, 2025
@@ -0,0 +1,415 @@
# This file is automatically @generated by Poetry 2.1.3 and should not be changed by hand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this example project is being referenced anywhere in test?

Dependencies []string // Direct dependencies declared in library's metadata
}

// Constants for terminal output formatting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment should be removed?


// 6. Comparison between the collected imported libraries and the PYPI dependencies of the libraries
// to find the reachability of the PYPI dependencies.
for _, library := range importedLibraries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make a struct for the output so that we can test the reachability result properly as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/google/osv-scalibr/blob/main/enricher/reachability/java/java.go can be a reference about what the results should look like.


for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
ctx := context.Background()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may use t.Context() for context in testing

}
}

// To compare slices, we need a canonical order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the main entry point will be a single string so we probably don't need to sort it.

response, _ := reg.GetIndex(ctx, libraryInfo.Name)
downloadURL := ""
fileName := strings.ToLower(fmt.Sprintf(`%s-%s.tar.gz`, libraryInfo.Name, libraryInfo.Version))
for _, file := range response.Files {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

downloadFileSource, err := downloadPackageSource(downloadURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the link!

@@ -0,0 +1 @@
#!/bin/bash No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any particular reason that we want to keep this shell script?

@@ -0,0 +1,3 @@
print('hello')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add newline to end of these test files if that won't bring any trouble

@cuixq
Copy link
Contributor

cuixq commented Sep 24, 2025

we probably need go.mod as well

@github-actions
Copy link

This pull request has not had any activity for 60 days and will be automatically closed in two weeks

@github-actions github-actions bot added the stale The issue or PR is stale and pending automated closure label Dec 13, 2025
@github-actions
Copy link

Automatically closing stale pull request

@github-actions github-actions bot added the autoclosed Closed by automation label Dec 27, 2025
@github-actions github-actions bot closed this Dec 27, 2025
@cuixq cuixq removed autoclosed Closed by automation stale The issue or PR is stale and pending automated closure labels Dec 31, 2025
@cuixq cuixq reopened this Dec 31, 2025
@cuixq
Copy link
Contributor

cuixq commented Jan 19, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a new tool for analyzing Python dependency reachability, which is a great feature. The implementation follows a logical flow: finding entry points, parsing manifests, analyzing imports, and then diving into library source code. However, the current implementation has several critical and high-severity issues that need to be addressed. The tool downloads and extracts package source code into the current working directory, which is unsafe. There are cases where errors are ignored, which can lead to panics. The logic for identifying manifest files and parsing Python code has some flaws that will cause the tool to fail on valid projects or produce incorrect results. I've left specific comments on these points. Addressing these issues will significantly improve the tool's reliability and usability.

Comment on lines +200 to +202
} else if slices.Contains(unsupportedManifests, fileName) {
return nil, fmt.Errorf("unsupported manifest file found: %s", fileName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for finding manifest files will incorrectly fail for most Poetry projects. pyproject.toml is a standard file in projects using Poetry and will almost always be present alongside poetry.lock. The current implementation considers pyproject.toml as an "unsupported" manifest and returns an error, which will prevent the tool from running on valid projects. You should remove pyproject.toml from the unsupportedManifests list. A better approach would be to ignore files that are not in supportedManifests instead of erroring out.

Suggested change
} else if slices.Contains(unsupportedManifests, fileName) {
return nil, fmt.Errorf("unsupported manifest file found: %s", fileName)
}
} else if slices.Contains(unsupportedManifests, fileName) {
return nil, fmt.Errorf("unsupported manifest file found: %s", fileName)
}

// collect dependencies of the imported library.
func retrieveSourceAndCollectDependencies(ctx context.Context, libraryInfo *LibraryInfo) error {
reg := datasource.NewPyPIRegistryAPIClient("")
response, _ := reg.GetIndex(ctx, libraryInfo.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The error returned from reg.GetIndex(ctx, libraryInfo.Name) is being ignored. If this API call fails (e.g., due to network issues or the package not being found), response will be nil, and the subsequent code will cause a nil pointer dereference when accessing response.Files. You must handle this error.

response, err := reg.GetIndex(ctx, libraryInfo.Name)
	if err != nil {
		return fmt.Errorf("failed to get PyPI index for %s: %w", libraryInfo.Name, err)
	}

importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`)
fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`)
importItemRegex = regexp.MustCompile(`([a-zA-Z0-9_.*]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`)
memberImportRegex = regexp.MustCompile(`import (\w+)\.(\w+)`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The memberImportRegex is not anchored with ^, which means it can match an import statement anywhere in a line, including inside a string literal or comment (e.g., print("import foo.bar")). This will lead to incorrect parsing of imported libraries.

Furthermore, its placement in the if-else if chain in findImportedLibraries makes it unlikely to be reached for valid import statements, as the broader importRegex will match them first.

To fix this, you should anchor the regex and reconsider its purpose or placement in the parsing logic.

Suggested change
memberImportRegex = regexp.MustCompile(`import (\w+)\.(\w+)`)
memberImportRegex = regexp.MustCompile(`^\s*import (\w+)\.(\w+)`)

// downloadPackageSource downloads the source code of a package from PyPI.
func downloadPackageSource(downloadLink string) (string, error) {
filename := filepath.Base(downloadLink)
tempFile, err := os.CreateTemp(".", filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function downloadPackageSource downloads files to the current working directory by using os.CreateTemp(".", filename). This is risky as it can pollute the user's project directory. It's strongly recommended to download files into a dedicated temporary directory created using os.MkdirTemp to avoid side effects.

if err != nil {
return fmt.Errorf("failed to read tar header: %w", err)
}
targetPath := filepath.Join(".", header.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function extractCompressedPackageSource extracts the package source code into the current working directory using filepath.Join(".", header.Name). This is dangerous as it can pollute the user's project directory and potentially overwrite existing files with the same names. It's strongly recommended to extract the contents into a dedicated temporary directory created using os.MkdirTemp.

// TODO: Find alternative ways for these regexes.
mainEntryRegex = regexp.MustCompile(`^\s*if\s+__name__\s*==\s*['"]__main__['"]\s*:`)
importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`)
fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expression for from ... import ... statements does not handle multi-line imports that use parentheses, such as:

from my_library import (
    module1,
    module2,
)

The current regex ^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+) will only capture the first line. To handle this, you would need a more complex parsing logic that can read across multiple lines when an opening parenthesis is detected.

if err != nil {
return err
}
if d.IsDir() && strings.Contains(d.Name(), folderName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The findFolder function uses strings.Contains(d.Name(), folderName) to find the library's source directory. This can be inaccurate. For example, if searching for foo-1.0, it could incorrectly match a directory named my-foo-1.0-extra. It would be more reliable to check if the directory name is the folder name, or at least starts with it, e.g., d.Name() == folderName or strings.HasPrefix(d.Name(), folderName). Since extracted tarballs usually create a directory like packagename-version, checking for a prefix would be safer.


return scanFile(file, func(line string) error {
for _, module := range libraryInfo.Modules {
searchTerm := fmt.Sprintf("def %s(", module.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method for finding where an imported item is defined is based on a simple string search: searchTerm := fmt.Sprintf("def %s(", module.Name). This is not very robust and can lead to both false positives and false negatives.

  • False positives: It can match function names in comments or strings.
  • False negatives: It won't match functions with different spacing (e.g., def my_func(...)), functions defined in classes (methods), or other imported symbols like classes or variables.
    Consider using regular expressions or, for a more robust solution, a Python AST parser to accurately locate definitions.

}

for _, module := range library.Modules {
if module.SourceDefinedPaths == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic here assumes that if the source definition path for a module is not found (module.SourceDefinedPaths == nil), all of the library's dependencies are reachable. This is a risky assumption and can lead to incorrect reachability analysis. It would be better to report the reachability as "unknown" or "undetermined" in this case, rather than defaulting to "reachable". This makes the tool's output more accurate about its limitations.

},
{
name: "Malformed poetry.lock - Parser error",
fpathInTestDir: "./testdata/tmultifileswithentrypoint/poetry.lock",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There seems to be a typo in the file path for this test case. The directory is named multifileswithentrypoint, but the path used is ./testdata/tmultifileswithentrypoint/poetry.lock (with an extra 't'). This will cause the test to pass because os.Open will fail to find the file, which matches expectError: true, but it's not testing what's intended (parsing a malformed file).

Suggested change
fpathInTestDir: "./testdata/tmultifileswithentrypoint/poetry.lock",
fpathInTestDir: "./testdata/multifileswithentrypoint/poetry.lock",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants