-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Migrate JobFilesAPIController to FastAPI (excluding TUS uploads)
#20235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Pulsar uses these endpoints, so before merging this, it is critical that it is passes all tests from test/integration/test_job_files.py and test_job_files_tus.py ( |
|
Locally, @jmchilton I could use some help from you (I see that you wrote test/integration/test_job_files_tus.py). Do you think the failures are related to the changes from this PR? Do you have any clue of what's failing before I look further into it? I would be quite grateful if you could have a look after the CI finishes running the tests. |
|
|
||
| @router.post( | ||
| "/api/jobs/{job_id}/files", | ||
| summary="Populate an output file.", |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
| responses={ | ||
| 200: {"description": "An okay message.", "content": {"application/json": {"example": {"message": "ok"}}}}, | ||
| }, | ||
| ) |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
regular expression
user-provided value
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| if input_file_path: | ||
| with open(input_file_path, "rb") as input_file_handle: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
This path depends on a
user-provided value
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
This path depends on a
user-provided value
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
| # tempfile has moved and Python wants to delete it. | ||
| pass | ||
| return {"message": "ok"} | ||
| with open(path, "wb") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
|
About the CodeQL issues, although there is One solution is to add an exception. Another to delay the merge to see if there is some time remaining to fix them after the ARC integration has been worked through. |
1a8b10b to
d17002d
Compare
|
|
||
| job = self.__authorize_job_access(trans, job_id, path=path, job_key=job_key) | ||
|
|
||
| if not os.path.exists(path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| if input_file_path: | ||
| with open(input_file_path, "rb") as input_file_handle: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| See more discussion of checking upload access, but we shouldn't need the | ||
| API key and session stuff the user upload tusd server should be configured with. | ||
| with open(path, "wb") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
It was because the endpoint |
|
@maikenp FYI |
`FastAPIJobFiles` is the new, FastAPI version of `JobFilesAPIController`. The endpoints that have been migrated should exhibit exactly the same behavior as the old ones from `FastAPIJobFiles`. Something to keep in mind is that while FastAPI has some extra built-in features that the legacy WSGI system did not have, such as answering HEAD requests, those do not work because of the way legacy WSGI endpoints are injected into the FastAPI app (using `app.mount("/", wsgi_handler)`), meaning that for example, HEAD requests are passed to the `wsgi_handler` sub-application.
Endpoints dedicated to TUS uploads work in tandem with the WSGI middleware `TusMiddleware` from the `tuswsgi` package. As explained above, WSGI middlewares and endpoints are injected into the FastAPI app after FastAPI routes as a single sub-application `wsgi_handler` using `app.mount("/", wsgi_handler)`, meaning that requests are passed to the `wsgi_handler` sub-application (and thus to `TusMiddleware`) only if there was no FastAPI endpoint defined to handle them. Therefore, they cannot be migrated to FastAPI unless `TusMiddleware` is also migrated to ASGI.
Work around a bug in FastAPI (fastapi/fastapi#13175) that assigns the same operation id to both request methods GET and HEAD of the endpoint `/api/jobs/{job_id}/files` when using the `@router.api_route()` decorator with `methods=["GET", "HEAD"]` as keyword argument.
…T requests to `/api/jobs/{job_id}/files`
Pulsar formats the `path` and `job_key` parameters as query parameters when submitting POST requests to `/api/jobs/{job_id}/files`. However, many Galaxy tests format them as form parameters. The only way to keep the endpoint working as it should (as it worked before the migration to FastAPI) is to accept both query and form parameters.
… requests to `/api/jobs/{job_id}/files`
FastAPI will not use the parameter aliases of form parameters in the OpenAPI docs, but the name of their Python variables. Therefore, the API docs show `path_form` and `job_key_form`. Rename them so that the API docs show the correct parameter names.
071d2d5 to
7a11519
Compare
Change docstrings of `job_key_query_or_form()` and `path_query_or_form()` to reference the correct HTTP error code (400).
There was a small merge conflict with 7f89e0e, make sure changes are compliant with said commit.
mvdbeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the (incomplete) review i never submitted, maybe you can take a look at this and I'll come back to this next week
|
|
||
| :rtype: binary | ||
| :returns: contents of file | ||
| # FastAPI answers HEAD requests automatically for GET endpoints. However, because of the way legacy WSGI endpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's that info coming from ? This has been discussed but was never implemented (last attempt fastapi/fastapi#13509).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you return FileResponse, then it's true fastapi/fastapi#1773 (comment). I incorrectly assumed it was generally true, because I just tried it with this endpoint.
I changed the comments so that they are precise 8def3a7. Thanks for letting me know about this.
| @router.get( | ||
| # simplify me (remove `_args` and `_kwargs` defined using the walrus operator) when ALL endpoints have been | ||
| # migrated to FastAPI, this is a workaround for FastAPI bug https://github.com/fastapi/fastapi/issues/13175 | ||
| *(_args := ["/api/jobs/{job_id}/files"]), | ||
| **( | ||
| _kwargs := dict( | ||
| summary="Get a file required to staging a job.", | ||
| responses={ | ||
| 200: { | ||
| "description": "Contents of file.", | ||
| "content": {"application/json": None, "application/octet-stream": {"example": None}}, | ||
| }, | ||
| 400: { | ||
| "description": ( | ||
| "File not found, path does not refer to a file, or input dataset(s) for job have been purged." | ||
| ) | ||
| }, | ||
| }, | ||
| ) | ||
| ), | ||
| ) | ||
| @router.head(*_args, **_kwargs) # type: ignore[name-defined] | ||
| # remove `@router.head(...)` when ALL endpoints have been migrated to FastAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow the ordinary way of setting up fastapi endpoints, i.e. no argument unpacking and I don't think the explicit responses and summary keyword are needed either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the result of adding the keywords.
...
I don't think there harm in neatly documenting the endpoint, especially because the cause of HTTP 400 errors is hard to guess.
Regarding argument unpacking, refer to this answer.
| router = Router( | ||
| # keep the endpoint in the undocumented section of the API docs `/api/docs`, as all endpoints from `FastAPIJobFiles` | ||
| # are certainly not meant to represent part of Galaxy's stable, user facing API | ||
| tags=["undocumented"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's meant for the stuff that hasn't been ported. There's no harm documenting these and making them available for client libraries as long as these are marked as internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned "jobs" tag (c5c7d34).
| # | ||
| @router.get( | ||
| # simplify me (remove `_args` and `_kwargs` defined using the walrus operator) when ALL endpoints have been | ||
| # migrated to FastAPI, this is a workaround for FastAPI bug https://github.com/fastapi/fastapi/issues/13175 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's maybe a little excessive, you can also pass in an explicit operation_id if you did this to avoid the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of having both decorators @router.get() and @router.head() is indeed avoiding the warning.
If I have only one decorator @router.api_route(), then passing an explicit operation_id does not solve anything (both GET and HEAD get the same operation id).
If I use both decorators, then I never have to pass an explicit operation_id, because FastAPI will generate two different ones for me.
Regarding argument unpacking, the idea is that the same method index() has to be reused both for GET and HEAD. I am forced to use the two-decorator approach (to avoid the warning) and by definition, both the GET and HEAD endpoints must be defined in exactly the same way. I use argument unpacking to avoid repeating the arguments of the first decorator in the second decorator.
I realize however, that including both GET and HEAD in the API schema is redundant. Thus, I added include_in_schema=False to the @router.head() decorator (bed78be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think my comment is addressed. Please prefer a bit of repetitive code over disabling type checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I set include_in_schema=False, @router.head() can now skip the kwargs and there would then be little repetition (only the path). Seems ok if it's just that what's repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in b53fa35.
| This method does not force the client to provide the parameter, it could simply not submit the parameter in either | ||
| format. To force the client to provide the parameter, coerce the output of the method to a string, i.e. | ||
| `job_key: str = Depends(job_key_query_or_form)` so that FastAPI responds with status code 500 when the parameter is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be a validation error, not an internal server error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, fixed in 62d3499·
Thank you, complete or not, that's some steam to get the engine running. |
Including both GET and HEAD in the API schema for the same endpoint is redundant.
…ally It is only true for endpoints returning `FileResponse`, see fastapi/fastapi#1773 (comment).
49ac9c3 to
8def3a7
Compare
Group job files API endpoints together with jobs API endpoints.
I addressed the comments, let's continue next week. Thanks again for moving things forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty important and performance sensitive endpoint. Have you tested all of the variants? Can you run a performance comparison (for speed and memory consumption) between the old and new endpoints ?
Thanks for coming back to this again.
Regarding test coverage I think we are safe. I added a couple tests to cover the code paths that were originally not covered. Luckily, besides the tests from test_job_files.py, there also exist tests that caught a nasty problem that would've been overlooked (see this comment).
You're right this is important. It can be the the next step to get this merged 👍. There's also this potential performance issue that's not simple to solve at all. |
| with open(path, "wb") as destination: | ||
| shutil.copyfileobj(cast(IO[bytes], input_file), destination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the potential performance mentioned in this comment. For readability I just copied the new code below (the comment above the code explains the problem).
# prior to migrating to FastAPI, this operation was done more efficiently for all cases using
# `shutil.move(input_file_path, path)`, but FastAPI stores the uploaded file as
# `tempfile.SpooledTemporaryFile`
# (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even
# a path where uploaded files can be accessed on disk
if input_file_path:
shutil.move(input_file_path, path)
else:
with open(path, "wb") as destination:
shutil.copyfileobj(cast(IO[bytes], input_file), destination)It's tricky because the limitation is in FastAPI's design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to call rollover on the file and move it then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanksss, that's just what I was looking for!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm there is a problem with this. Besides storing the data in memory until the file is rolled over to disk, tempfile.SpooledTemporaryFile operates exactly as tempfile.TemporaryFile. So even if rollover() is called, there is no path for that file (and there is nothing to call shutil.move() on).
Under Unix, the directory entry for the file is either not created at all or is removed immediately after the file is created. Other platforms do not support this; your code should not rely on a temporary file created using this function having or not having a visible name in the file system.
Definitely writing the file to disk is twice suboptiomal... 🤔
The point of having both decorators `@router.get()` and `@router.head()` is working around FastAPI issue fastapi/fastapi#13175. If there is only one decorator `@router.api_route()`, then both GET and HEAD get the same operation id. If both decorators are used, then FastAPI generates two different operation ids. The idea behind argument unpacking is that the same method `index()` has to be reused both for GET and HEAD. The two-decorator approach is needed to avoid the duplicate operation id warning and by definition, both the GET and HEAD endpoints must be defined in exactly the same way. Argument unpacking allows to avoid repeating the arguments of the first decorator in the second decorator. However, including both GET and HEAD in the API schema is redundant. Commit bed78be adds `include_in_schema=False` to the `@router.head()`. That implies that there is now not so much benefit in the argument unpacking trick, because the kwargs are no longer needed and there would then be little repetition (only the path). Thus, declare `@router.head()` explicitly.
As part of the development of an integration of Galaxy with ARC (Advanced Resource Connector) as a Pulsar job runner, tweaks to the
JobFilesAPIControllerare needed. None of such tweaks are included in this PR, but it makes sense to implement them building upon a FastAPI endpoint rather than a legacy WSGI one; and the first step to do that is migrating the controller to FastAPI.FastAPIJobFilesis the new, FastAPI version ofJobFilesAPIController. The endpoints that have been migrated should exhibit exactly the same behavior as the old ones fromFastAPIJobFiles.Endpoints dedicated to TUS uploads work in tandem with the WSGI middleware
TusMiddlewarefrom thetuswsgipackage. WSGI middlewares and endpoints are injected into the FastAPI app after FastAPI routes as a single sub-applicationwsgi_handlerusingapp.mount("/", wsgi_handler), meaning that requests are passed to thewsgi_handlersub-application (and thus toTusMiddleware) only if there was no FastAPI endpoint defined to handle them. Therefore, they cannot be migrated to FastAPI unlessTusMiddlewareis also migrated to ASGI. I am postponing that migration, because the ARC integration needs to be delivered soonish.I also included three new tests for existing functionality: writing from uploads done with the nginx_upload_module, from TUS uploads and using the parameter
__file.How to test the changes?
(Select all options that apply)
License