-
Notifications
You must be signed in to change notification settings - Fork 518
feat: Add the main file for determine package reachability level of Python project #2131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: python-reach
Are you sure you want to change the base?
Conversation
|
can you add some description for this PR? |
cuixq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also let's move this to experimental/pythonreach folder and add a README file.
| @@ -0,0 +1,415 @@ | |||
| # This file is automatically @generated by Poetry 2.1.3 and should not be changed by hand. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this example project is being referenced anywhere in test?
| Dependencies []string // Direct dependencies declared in library's metadata | ||
| } | ||
|
|
||
| // Constants for terminal output formatting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment should be removed?
|
|
||
| // 6. Comparison between the collected imported libraries and the PYPI dependencies of the libraries | ||
| // to find the reachability of the PYPI dependencies. | ||
| for _, library := range importedLibraries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make a struct for the output so that we can test the reachability result properly as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/google/osv-scalibr/blob/main/enricher/reachability/java/java.go can be a reference about what the results should look like.
|
|
||
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| ctx := context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may use t.Context() for context in testing
| } | ||
| } | ||
|
|
||
| // To compare slices, we need a canonical order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main entry point will be a single string so we probably don't need to sort it.
| response, _ := reg.GetIndex(ctx, libraryInfo.Name) | ||
| downloadURL := "" | ||
| fileName := strings.ToLower(fmt.Sprintf(`%s-%s.tar.gz`, libraryInfo.Name, libraryInfo.Version)) | ||
| for _, file := range response.Files { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may use pypi.SdistVersion to check what is the version of the file name, see https://github.com/google/osv-scalibr/blob/629500d6fb6a1392afe40e0c9327c48a787358d8/clients/resolution/pypi_registry_client.go#L95
| } | ||
| } | ||
|
|
||
| downloadFileSource, err := downloadPackageSource(downloadURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the link!
| @@ -0,0 +1 @@ | |||
| #!/bin/bash No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any particular reason that we want to keep this shell script?
| @@ -0,0 +1,3 @@ | |||
| print('hello') | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add newline to end of these test files if that won't bring any trouble
|
we probably need |
|
This pull request has not had any activity for 60 days and will be automatically closed in two weeks |
|
Automatically closing stale pull request |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces a new tool for analyzing Python dependency reachability, which is a great feature. The implementation follows a logical flow: finding entry points, parsing manifests, analyzing imports, and then diving into library source code. However, the current implementation has several critical and high-severity issues that need to be addressed. The tool downloads and extracts package source code into the current working directory, which is unsafe. There are cases where errors are ignored, which can lead to panics. The logic for identifying manifest files and parsing Python code has some flaws that will cause the tool to fail on valid projects or produce incorrect results. I've left specific comments on these points. Addressing these issues will significantly improve the tool's reliability and usability.
| } else if slices.Contains(unsupportedManifests, fileName) { | ||
| return nil, fmt.Errorf("unsupported manifest file found: %s", fileName) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for finding manifest files will incorrectly fail for most Poetry projects. pyproject.toml is a standard file in projects using Poetry and will almost always be present alongside poetry.lock. The current implementation considers pyproject.toml as an "unsupported" manifest and returns an error, which will prevent the tool from running on valid projects. You should remove pyproject.toml from the unsupportedManifests list. A better approach would be to ignore files that are not in supportedManifests instead of erroring out.
| } else if slices.Contains(unsupportedManifests, fileName) { | |
| return nil, fmt.Errorf("unsupported manifest file found: %s", fileName) | |
| } | |
| } else if slices.Contains(unsupportedManifests, fileName) { | |
| return nil, fmt.Errorf("unsupported manifest file found: %s", fileName) | |
| } |
| // collect dependencies of the imported library. | ||
| func retrieveSourceAndCollectDependencies(ctx context.Context, libraryInfo *LibraryInfo) error { | ||
| reg := datasource.NewPyPIRegistryAPIClient("") | ||
| response, _ := reg.GetIndex(ctx, libraryInfo.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned from reg.GetIndex(ctx, libraryInfo.Name) is being ignored. If this API call fails (e.g., due to network issues or the package not being found), response will be nil, and the subsequent code will cause a nil pointer dereference when accessing response.Files. You must handle this error.
response, err := reg.GetIndex(ctx, libraryInfo.Name)
if err != nil {
return fmt.Errorf("failed to get PyPI index for %s: %w", libraryInfo.Name, err)
}| importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`) | ||
| importItemRegex = regexp.MustCompile(`([a-zA-Z0-9_.*]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| memberImportRegex = regexp.MustCompile(`import (\w+)\.(\w+)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memberImportRegex is not anchored with ^, which means it can match an import statement anywhere in a line, including inside a string literal or comment (e.g., print("import foo.bar")). This will lead to incorrect parsing of imported libraries.
Furthermore, its placement in the if-else if chain in findImportedLibraries makes it unlikely to be reached for valid import statements, as the broader importRegex will match them first.
To fix this, you should anchor the regex and reconsider its purpose or placement in the parsing logic.
| memberImportRegex = regexp.MustCompile(`import (\w+)\.(\w+)`) | |
| memberImportRegex = regexp.MustCompile(`^\s*import (\w+)\.(\w+)`) |
| // downloadPackageSource downloads the source code of a package from PyPI. | ||
| func downloadPackageSource(downloadLink string) (string, error) { | ||
| filename := filepath.Base(downloadLink) | ||
| tempFile, err := os.CreateTemp(".", filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function downloadPackageSource downloads files to the current working directory by using os.CreateTemp(".", filename). This is risky as it can pollute the user's project directory. It's strongly recommended to download files into a dedicated temporary directory created using os.MkdirTemp to avoid side effects.
| if err != nil { | ||
| return fmt.Errorf("failed to read tar header: %w", err) | ||
| } | ||
| targetPath := filepath.Join(".", header.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function extractCompressedPackageSource extracts the package source code into the current working directory using filepath.Join(".", header.Name). This is dangerous as it can pollute the user's project directory and potentially overwrite existing files with the same names. It's strongly recommended to extract the contents into a dedicated temporary directory created using os.MkdirTemp.
| // TODO: Find alternative ways for these regexes. | ||
| mainEntryRegex = regexp.MustCompile(`^\s*if\s+__name__\s*==\s*['"]__main__['"]\s*:`) | ||
| importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular expression for from ... import ... statements does not handle multi-line imports that use parentheses, such as:
from my_library import (
module1,
module2,
)The current regex ^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+) will only capture the first line. To handle this, you would need a more complex parsing logic that can read across multiple lines when an opening parenthesis is detected.
| if err != nil { | ||
| return err | ||
| } | ||
| if d.IsDir() && strings.Contains(d.Name(), folderName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The findFolder function uses strings.Contains(d.Name(), folderName) to find the library's source directory. This can be inaccurate. For example, if searching for foo-1.0, it could incorrectly match a directory named my-foo-1.0-extra. It would be more reliable to check if the directory name is the folder name, or at least starts with it, e.g., d.Name() == folderName or strings.HasPrefix(d.Name(), folderName). Since extracted tarballs usually create a directory like packagename-version, checking for a prefix would be safer.
|
|
||
| return scanFile(file, func(line string) error { | ||
| for _, module := range libraryInfo.Modules { | ||
| searchTerm := fmt.Sprintf("def %s(", module.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method for finding where an imported item is defined is based on a simple string search: searchTerm := fmt.Sprintf("def %s(", module.Name). This is not very robust and can lead to both false positives and false negatives.
- False positives: It can match function names in comments or strings.
- False negatives: It won't match functions with different spacing (e.g.,
def my_func(...)), functions defined in classes (methods), or other imported symbols like classes or variables.
Consider using regular expressions or, for a more robust solution, a Python AST parser to accurately locate definitions.
| } | ||
|
|
||
| for _, module := range library.Modules { | ||
| if module.SourceDefinedPaths == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here assumes that if the source definition path for a module is not found (module.SourceDefinedPaths == nil), all of the library's dependencies are reachable. This is a risky assumption and can lead to incorrect reachability analysis. It would be better to report the reachability as "unknown" or "undetermined" in this case, rather than defaulting to "reachable". This makes the tool's output more accurate about its limitations.
| }, | ||
| { | ||
| name: "Malformed poetry.lock - Parser error", | ||
| fpathInTestDir: "./testdata/tmultifileswithentrypoint/poetry.lock", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a typo in the file path for this test case. The directory is named multifileswithentrypoint, but the path used is ./testdata/tmultifileswithentrypoint/poetry.lock (with an extra 't'). This will cause the test to pass because os.Open will fail to find the file, which matches expectError: true, but it's not testing what's intended (parsing a malformed file).
| fpathInTestDir: "./testdata/tmultifileswithentrypoint/poetry.lock", | |
| fpathInTestDir: "./testdata/multifileswithentrypoint/poetry.lock", |
The new feature helps determine the reachability of imported Python libraries in a Python project. This is part of the under developing project for imported Python libraries' reachability.
This PR includes the support for imported Python libraries defined in poetry.lock file.