Skip to content

Delete files associated with anonymous msgs#258

Merged
yensung merged 11 commits intomainfrom
218DeleteAssociatedAnonymousFiles
Mar 31, 2025
Merged

Delete files associated with anonymous msgs#258
yensung merged 11 commits intomainfrom
218DeleteAssociatedAnonymousFiles

Conversation

@yensung
Copy link
Copy Markdown
Contributor

@yensung yensung commented Mar 25, 2025

Closes: https://github.com/allenai/playground-issues-repo/issues/218

This PR changes a couple of things:

  1. The files associated with anonymous user msgs would be saved under the anonymous/ folder on Google Cloud Storage. The main reason is that Google Cloud Storage allows us to add auto-deletion rules for objects in the storage. I've added a rule to automatically delete objects if they are over 30 days under the anonymous folder so that we don't have to update/maintain the existing script to do that.
Screenshot 2025-03-24 at 20 33 40
  1. The migration flow for anonymous user messages has been updated as below:
    a. Moving out files associated with anonymous user msgs from the anonymous/ folder on google cloud storage
    b. Set expiration_time to null, update old file_urls with new urls, and set private to false
    c. Update creator from anonymous user ID to logged-in auth ID for both Message and Label tables

  2. Fix the delete_multiple_files_by_url function in the GoogleCloudStorage file

Anonymous msg

Mar-24-2025.20-46-14.mp4

Migration

Mar-24-2025.20-46-20.mp4

Deletion

Mar-24-2025.20-46-25.mp4

@yensung yensung requested review from arnavic and mtblanton March 25, 2025 03:49

# We currently want anonymous users' messages to expire after 1 day
message_expiration_time = datetime.now(UTC) + timedelta(days=1) if agent.is_anonymous_user else None
# We currently want anonymous users' messages to expire after 30 days
Copy link
Copy Markdown
Contributor

@mtblanton mtblanton Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I haven't heard of the requirements around anonymous messages changing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I thought anonymous msgs have the same time period of the term that logged-in user msgs aren't removable after 30 days. Let me change it back

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the 1 day thing only exists because we need to store things in the DB to show them to our users. If we could show threads without saving to the DB we would've done that instead! 1 day is just easier with what we have now.

@mtblanton
Copy link
Copy Markdown
Contributor

mtblanton commented Mar 25, 2025

Assuming we're now allowed to keep anonymous messages for 30 days, how do you think this affects future maintenance of this application?

If we're allowed to keep anonymous messages forever, how many places do we need to update to make sure they keep working properly?

If the requirements for anonymous messages change, are we able to make that same adjustment in all those places?

Is there a way to have GCS handle deletion that doesn't involve moving things between folders?

@yensung
Copy link
Copy Markdown
Contributor Author

yensung commented Mar 25, 2025

Assuming we're now allowed to keep anonymous messages for 30 days, how do you think this affects future maintenance of this application?

We just need to update the auto-deleting rule in the GCS bucket.

If we're allowed to keep anonymous messages forever, how many places do we need to update to make sure they keep working properly?

We just need to delete the auto-deleting rule in the GCS bucket. Files related to anonymous messages could still be saved under anonymous/ folder in the GCS bucket. They'll be migrated once users log in in the playground.

If the requirements for anonymous messages change, are we able to make that same adjustment in all those places?

If the change is about the deletion time after X days, we can update the auto-deleting rule in the GCS bucket.
If the change is about not allowing anonymous users anymore, we don't have to worry about it as no files would be saved under anonymous/ folder.

Is there a way to have GCS handle deletion that doesn't involve moving things between folders?

Yes. GCS can auto-delete any objects in the bucket. However, we have to differentiate what files are created by anonymous messages. Having an anonymous/ folder and putting related files under it is the easiest way to differentiate.

@mtblanton
Copy link
Copy Markdown
Contributor

mtblanton commented Mar 25, 2025

Yes. GCS can auto-delete any objects in the bucket. However, we have to differentiate what files are created by anonymous messages. Having an anonymous/ folder and putting related files under it is the easiest way to differentiate.

@yensung does GCS have custom metadata we can use instead of having a separate folder?

I'm not a fan of having logic in a separate system not maintained in GitHub, but if we can drive the logic from this application I'd be happier with it. If we could set a time to delete a message that GCS uses to delete the message I think that'd work well.

If you're able to get something like that to work, please add some documentation about it in comments or in a readme somewhere.

Comment on lines +537 to +538
if (expiration_time is not None):
expiration_time_stmt = "expiration_time = NULL," if type(expiration_time) is int else "expiration_time = COALESCE(%(expiration_time)s),"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this causing problems? Why won't just expiration_time=%(expiration_time)s work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the expiration time got reset to NULL when expiration_time is None

"finish_reason": finish_reason,
"harmful": harmful,
"file_urls": file_urls,
# assign an integer to reset expiration time to Null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure this is related to the expiration_time_stmt above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found your PR that updates expiration time for anonymous msgs #255.
I can remove the changes to this function once the PR is merged.

"duration_ms": (end_ns - start_ns) / 1_000_000,
})

def update_file_custom_time(self, filename: str, new_time: datetime):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe call this something like "update_file_deletion_time" so it's more clear what this is doing for our usecase?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was naming custom_time as it aligns with the attribute name. No strong opinion here tho

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_time makes sense if it were a lower-level function, but I think it's good to be specific in our implementation here!

"duration_ms": (end_ns - start_ns) / 1_000_000,
})

def move_file(self, old_name: str, new_name: str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

current_app.logger.info(
f"Migrating {filename} from anonymous to normal in the bucket:{self.bucket.name} on GoogleCloudStorage",
)
self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT) No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to have a comment here explaining what the custom time is doing for us

Suggested change
self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT)
# We're using the file's custom time to have GCS automatically delete files
# We're not able to un-set the custom time, so instead we're setting it to the furthest time possible
self.update_file_custom_time(filename, GCS_MAX_DATETIME_LIMIT)

Comment on lines +94 to +95
filename = url.split('/')[-1]
whole_name = f"{msg.root}/{filename}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use Python's URL parsing here. We can also get the full path from the URL!

Maybe something like this?

Suggested change
filename = url.split('/')[-1]
whole_name = f"{msg.root}/{filename}"
whole_name = urlparse(url).path

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The url includes the bucket name. I don't think we want it here..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right, because we're doing this within the context of a bucket. This is fine then!

Copy link
Copy Markdown
Contributor

@mtblanton mtblanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work!

@yensung yensung merged commit b4dd4f1 into main Mar 31, 2025
3 checks passed
@yensung yensung deleted the 218DeleteAssociatedAnonymousFiles branch March 31, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants