[HMA] Add Cache to banks to keep Content Type Counts#1816
[HMA] Add Cache to banks to keep Content Type Counts#1816juanmrad wants to merge 4 commits intofacebook:mainfrom
Conversation
Dcallies
left a comment
There was a problem hiding this comment.
Might be easier to talk through live - let me know if you want to chat through on discord. I think this metric is useful but am not yet sure about the implementation.
Top level comments:
- I have an impression based on the lifecycle of the equivalent tool at Meta that we are much better off explaining the contents of the bank in terms of photos/videos/text/etc than signal counts (though it's possible both are separately useful). How do you feel about doing a rollup on content type in addition to or instead of signal count?
- Should we look into a way where we could calculate the content of the bank use postgres counts? I'm hesitant on this approach since it can lead to live counts skewing until the next indexer run, though I think at least we could be confident we avoid other issues from trying to maintain a counter.
| <th scope="row" style="background-color: transparent;">{{ loop.index }}</th> | ||
| <td style="background-color: transparent;">{{ bank['name'] }}</td> | ||
| <td style="background-color: transparent;"> | ||
| {% if bank['content_type_counts'] %} |
There was a problem hiding this comment.
While the signal counts are interesting to engineers, I think we might want to express this in terms of the underlying content that these signals can match.
E.g. rather than "100 pdq", I'm suggesting "100 photos".
This isn't 100% analogous, but I think it will better map to people's mental model, especially non-technical folks. We can add a tooltip to explain that this is an analogue on hover if we wanted to.
If you do have a strong feeling that you want to keep this to signal counts, then we should use "Signal" instead of "Content"
There was a problem hiding this comment.
My thought was that Signals and content may differ. You may enable signals later so old content may not be on all signal types.
I do think content makes sense, but that was the reasoning behind using signal over content type, I did the change after thinking over the name so definitively need to clean-up naming lol.
| if not bank: | ||
| abort(404, f"bank '{bank_name}' not found") | ||
| return jsonify(bank) | ||
| return {"name": bank.name, "matching_enabled_ratio": bank.matching_enabled_ratio} |
There was a problem hiding this comment.
Why not also return your new signal counts? They are cheap to fetch, right?
| enabled_ratio = 1.0 if flask_utils.str_to_bool(data["enabled"]) else 0.0 | ||
| return jsonify(bank_create_impl(name, enabled_ratio)), 201 | ||
| bank = bank_create_impl(name, enabled_ratio) | ||
| return { |
There was a problem hiding this comment.
(applies to the other edited APIs) Consider explicitly typing the return, similar to what other endpoints are doing - or at least documenting what we expect this to return in the docstring
| # 0.0-1.0 - what percentage of contents should be | ||
| # considered a match? Seeded by target content | ||
| matching_enabled_ratio: float | ||
| # Cache of content type counts |
There was a problem hiding this comment.
blocking: This is currently implemented as signal type counts
| signal_val: str | ||
| bank_content_id: int | ||
| bank_content_timestamp: int | ||
| bank_name: str |
There was a problem hiding this comment.
How expensive is it to fetch this in order to add it to the output?
| .join(database.BankContent) | ||
| .join(database.Bank) |
There was a problem hiding this comment.
(answer to previous question) - two additional joins - does this appreciably impact the iteration speed?
There was a problem hiding this comment.
We have index on the tables, but there's definitively an impact. I'll have to calculate at 10M records tho 🤔
Summary
I'm adding a new attribute to banks to keep track of the content they contain alongside the type of content.
This cache will be updated after each index operation to make sure the information is up to date as we build the index to prevent race conditions when adding new content.
Test Plan
Tested locally
And added to the hash bank page.