Delete files associated with anonymous msgs#258
Conversation
|
|
||
| # We currently want anonymous users' messages to expire after 1 day | ||
| message_expiration_time = datetime.now(UTC) + timedelta(days=1) if agent.is_anonymous_user else None | ||
| # We currently want anonymous users' messages to expire after 30 days |
There was a problem hiding this comment.
Is this true? I haven't heard of the requirements around anonymous messages changing.
There was a problem hiding this comment.
Ahhh I thought anonymous msgs have the same time period of the term that logged-in user msgs aren't removable after 30 days. Let me change it back
There was a problem hiding this comment.
Ah, the 1 day thing only exists because we need to store things in the DB to show them to our users. If we could show threads without saving to the DB we would've done that instead! 1 day is just easier with what we have now.
|
Assuming we're now allowed to keep anonymous messages for 30 days, how do you think this affects future maintenance of this application? If we're allowed to keep anonymous messages forever, how many places do we need to update to make sure they keep working properly? If the requirements for anonymous messages change, are we able to make that same adjustment in all those places? Is there a way to have GCS handle deletion that doesn't involve moving things between folders? |
We just need to update the auto-deleting rule in the GCS bucket.
We just need to delete the auto-deleting rule in the GCS bucket. Files related to anonymous messages could still be saved under
If the change is about the deletion time after X days, we can update the auto-deleting rule in the GCS bucket.
Yes. GCS can auto-delete any objects in the bucket. However, we have to differentiate what files are created by anonymous messages. Having an |
@yensung does GCS have custom metadata we can use instead of having a separate folder? I'm not a fan of having logic in a separate system not maintained in GitHub, but if we can drive the logic from this application I'd be happier with it. If we could set a time to delete a message that GCS uses to delete the message I think that'd work well. If you're able to get something like that to work, please add some documentation about it in comments or in a readme somewhere. |
src/dao/message.py
Outdated
| if (expiration_time is not None): | ||
| expiration_time_stmt = "expiration_time = NULL," if type(expiration_time) is int else "expiration_time = COALESCE(%(expiration_time)s)," |
There was a problem hiding this comment.
Was this causing problems? Why won't just expiration_time=%(expiration_time)s work?
There was a problem hiding this comment.
I didn't see the expiration time got reset to NULL when expiration_time is None
src/dao/message.py
Outdated
| "finish_reason": finish_reason, | ||
| "harmful": harmful, | ||
| "file_urls": file_urls, | ||
| # assign an integer to reset expiration time to Null |
There was a problem hiding this comment.
I figure this is related to the expiration_time_stmt above?
There was a problem hiding this comment.
I just found your PR that updates expiration time for anonymous msgs #255.
I can remove the changes to this function once the PR is merged.
src/message/GoogleCloudStorage.py
Outdated
| "duration_ms": (end_ns - start_ns) / 1_000_000, | ||
| }) | ||
|
|
||
| def update_file_custom_time(self, filename: str, new_time: datetime): |
There was a problem hiding this comment.
I would maybe call this something like "update_file_deletion_time" so it's more clear what this is doing for our usecase?
There was a problem hiding this comment.
Was naming custom_time as it aligns with the attribute name. No strong opinion here tho
There was a problem hiding this comment.
custom_time makes sense if it were a lower-level function, but I think it's good to be specific in our implementation here!
src/message/GoogleCloudStorage.py
Outdated
| "duration_ms": (end_ns - start_ns) / 1_000_000, | ||
| }) | ||
|
|
||
| def move_file(self, old_name: str, new_name: str): |
src/message/GoogleCloudStorage.py
Outdated
| current_app.logger.info( | ||
| f"Migrating {filename} from anonymous to normal in the bucket:{self.bucket.name} on GoogleCloudStorage", | ||
| ) | ||
| self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT) No newline at end of file |
There was a problem hiding this comment.
It'd be good to have a comment here explaining what the custom time is doing for us
| self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT) | |
| # We're using the file's custom time to have GCS automatically delete files | |
| # We're not able to un-set the custom time, so instead we're setting it to the furthest time possible | |
| self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT) |
| filename = url.split('/')[-1] | ||
| whole_name = f"{msg.root}/{filename}" |
There was a problem hiding this comment.
We should probably use Python's URL parsing here. We can also get the full path from the URL!
Maybe something like this?
| filename = url.split('/')[-1] | |
| whole_name = f"{msg.root}/{filename}" | |
| whole_name = urlparse(url).path |
There was a problem hiding this comment.
The url includes the bucket name. I don't think we want it here..
There was a problem hiding this comment.
Ahh, right, because we're doing this within the context of a bucket. This is fine then!
Closes: https://github.com/allenai/playground-issues-repo/issues/218
This PR changes a couple of things:
anonymous/folder on Google Cloud Storage. The main reason is that Google Cloud Storage allows us to add auto-deletion rules for objects in the storage. I've added a rule to automatically delete objects if they are over 30 days under theanonymousfolder so that we don't have to update/maintain the existing script to do that.The migration flow for anonymous user messages has been updated as below:
a. Moving out files associated with anonymous user msgs from the
anonymous/folder on google cloud storageb. Set
expiration_timeto null, update oldfile_urlswith new urls, and setprivateto falsec. Update
creatorfrom anonymous user ID to logged-in auth ID for bothMessageandLabeltablesFix the
delete_multiple_files_by_urlfunction in the GoogleCloudStorage fileAnonymous msg
Mar-24-2025.20-46-14.mp4
Migration
Mar-24-2025.20-46-20.mp4
Deletion
Mar-24-2025.20-46-25.mp4