Skip to content

[SYNPY-1351, SYNPY-1613] Implement 'Wiki2' model into OOP #1206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: develop
Choose a base branch
from

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Jun 11, 2025

Problem:

  1. WikiPage Services 2 is not implemented in the client.
  2. No asynchronous get_paginated function in client module.
  3. download_from_url and download_from_url_multi_threaded are not supporting download directly via presigned url.

Solution:

  1. Add a new Wiki2 service to Python Client
  2. Add rest_get_paginated_async function to client module.
  3. Update download_from_url and download_from_url_multi_threaded functions to allow file download directly via presigned url.

Testing:

(wip)

@danlu1 danlu1 requested a review from a team as a code owner June 11, 2025 20:55
@danlu1 danlu1 marked this pull request as draft June 11, 2025 20:56
@danlu1
Copy link
Contributor Author

danlu1 commented Jun 11, 2025

@BryanFauble @thomasyu888 I put this draft PR to show tentative interface of the wiki2 service. Protocol, tutorials and test scripts are wip. Feel free to leave comments if you envision it differently.

@BryanFauble
Copy link
Member

Copy link
Member

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start to the changes!

Feel free to reach out to me with questions, and re-request review when you'd like me to take a look at this PR again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition of the protocols

@danlu1 danlu1 changed the title [SYNPY-1351] Implement 'Wiki2' model into OOP [SYNPY-1351, SYNPY-1613] Implement 'Wiki2' model into OOP Jun 13, 2025
markdown: Optional[str] = None
"""The markdown content of this page."""

attachments: List[Dict[str, Any]] = field(default_factory=list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning a Dict here, could we instead return back a WikiAttachment dataclass? If one doesn't exist you'd need to create it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the attachments just a list of file path. For attachment-related endpoints, only url is given. Do you still think this is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Rest API Docs for the V2WikiPage: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/v2/wiki/V2WikiPage.html

On these lines you are setting:

  1. markdown
  2. attachments
  3. owner_id
  4. wiki_version

But - These fields don't exist on the Rest API docs. When you call the fill_from_dict, these would always be wiped out and replaced with None or [], since Synapse would never return these values.

I understand how these fields can be used during the store operations - But, how would a user read for these values?

Copy link
Contributor Author

@danlu1 danlu1 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, V2WikiPage doesn't have the attributes you listed having them in the fill_from_dict would not be needed. I was using these attributes to assist user to list out parameters that can be used for api call, such as GET /entity/{ownerId}/wiki2, the user specifies owner_id when constructing the wikipage object and call get to get the wikipage for the owner_id.

file_path = self._to_gzip_file(
wiki_content=attachment, synapse_client=client
)
file_handle = await upload_file_handle(
Copy link
Member

@BryanFauble BryanFauble Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an improvement that can be made here. Suppose you have 10 attachments. This code as written will upload all 10 attachments sequentially one after the other.

Using asyncio.as_completed, you can execute the upload process in parallel for all 10 attachments. A significant performance gain.

There are examples In the code to show how it can be used.

for completed_task in asyncio.as_completed(tasks):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gather function to run tasks in parallel since I would like to get the file handle the same order as the given file path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, using the gather should work fine. My only request is that you wrap the call to upload_file_handle in this context manager:

            async with synapse_client._get_parallel_file_transfer_semaphore(
                asyncio_event_loop=asyncio.get_running_loop()
            ):
                    file_handle = await upload_file_handle(
                        syn=synapse_client,
                        parent_entity_id=self.owner_id,
                        path=file_path,
                    )

The reason why can be found in the method docstring:

def _get_parallel_file_transfer_semaphore(
self, asyncio_event_loop: asyncio.AbstractEventLoop
) -> asyncio.Semaphore:
"""
Retrieve the semaphore for the Synapse client. Or create a new one if it does
not exist. This semaphore is used to limit the number of files that can actively
enter the uploading/downloading process.
This is expected to be called from within an AsyncIO loop.
By default the number of files that can enter the "uploading" state will be
limited to 2 * max_threads. This is to ensure that the files that are entering
into the "uploading" state will have priority to finish. Additionally, it means
that there should be a good spread of files getting up to the "uploading"
state, entering the "uploading" state, and finishing the "uploading" state.
If we break these states down into large components they would look like:
- Before "uploading" state: HTTP rest calls to retrieve what data Synapse has
- Entering "uploading" state: MD5 calculation and HTTP rest calls to determine
how/where to upload a file to.
- During "uploading" state: Uploading the file to a storage provider.
- After "uploading" state: HTTP rest calls to finalize the upload.
This has not yet been applied to parallel file downloads. That will take place
later on.
"""
if (
hasattr(self, "_parallel_file_transfer_semaphore")
and asyncio_event_loop in self._parallel_file_transfer_semaphore
and self._parallel_file_transfer_semaphore[asyncio_event_loop] is not None
):
return self._parallel_file_transfer_semaphore[asyncio_event_loop]
self._parallel_file_transfer_semaphore.update(
{asyncio_event_loop: asyncio.Semaphore(max(self.max_threads * 2, 1))}
)
return self._parallel_file_transfer_semaphore[asyncio_event_loop]

The created wiki page.

Raises:
ValueError: If owner_id is not provided or if required fields are missing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the method for both the async and non-async functions should have at least 1 example to show how to use and call the function, along with the required imports. Ideally one could copy/paste the code, minimally change something, and run it.

raise ValueError("Must provide download_location to download a file.")

# construct PresignedUrlInfo for downloading
presigned_url_info = PresignedUrlInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like you need to construct this object because you turn around and directly use the attributes off the object to pass to subsequent function calls. You could just directly pass those values to the function calls.

Copy link
Member

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! Please re-request review when you are ready for another round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants