Conversation
Add new constructor parameters to IabClient: - excludeUseragents: custom patterns to classify as spiders/robots - includeUseragents: custom patterns to classify as browsers Precedence order (highest first): 1. Custom includeUseragents -> browser 2. Custom excludeUseragents -> spider/robot 3. IAB IP file check 4. IAB include file check 5. IAB exclude file check Matching is case-insensitive substring matching.
| } | ||
|
|
||
| static IabResponse customExcludeCheckFailed() { | ||
| return createForSpiderOrRobot(ACTIVE_SPIDER_OR_ROBOT, FAILED_UA_EXCLUDE, PAGE_AND_AD_IMPRESSIONS); |
There was a problem hiding this comment.
Why did you choose ACTIVE_SPIDER_OR_ROBOT as the category? And why did you choose PAGE_AND_AD_IMPRESSIONS as the primary impact?
Am I right that we don't actually know the correct category/impact to put in the response in this scenario? Because our customers are not providing that information via the enrichment config; they simply add a useragent string with no state reason.
Did the acceptance criteria say what to put into the category/impact field in this scenario? And if no.... do you think we need to ask Product, or are you confident you have picked good sensible values?
There was a problem hiding this comment.
I talked to @stanch and these are the default he picked until further configurability is asked by users.
There was a problem hiding this comment.
Update based on the conversation in JIRA: let’s use SPIDER_OR_ROBOT and UNKNOWN. I have a slight preference to use PAGE_AND_AD_IMPRESSIONS because I feel that if a user went through the trouble of adding the bot by hand, they would want to remove it from everywhere. However, I am also almost sure that nobody uses this field. So let’s pick a prudent default (i.e. UNKNOWN) and wait for someone to complain :)
|
|
||
| private static boolean matchesAny(String userAgentLower, List<String> patterns) { | ||
| for (String pattern : patterns) { | ||
| if (StringUtils.contains(userAgentLower, pattern)) { |
There was a problem hiding this comment.
Why did you chose .contains not .startsWith?
This library has an existing feature to include/exclude useragents, and it switches between .contains or .startsWith depending on the row of the database file.
Did the acceptance criteria say to use .contains? Do you think we need clear acceptance criteria, or are you confident you have picked the correct type of string matcher for our new feature?
There was a problem hiding this comment.
I talked to @stanch and he told me that matching behavior should be contains always
There was a problem hiding this comment.
I think we don’t want to add too much flexibility here, and most bot identifiers are inside of the user agent string.
| if (userAgentLower != null && matchesAny(userAgentLower, customIncludeUseragents)) { | ||
| return IabResponse.identifiedAsBrowser(); | ||
| } | ||
|
|
||
| if (userAgentLower != null && matchesAny(userAgentLower, customExcludeUseragents)) { | ||
| return IabResponse.customExcludeCheckFailed(); | ||
| } |
There was a problem hiding this comment.
How sure are you that these two new checks go before the ipRanges check? Did you consider whether it makes more sense for the enrichment to check ip range first before checking the user agent overrides?
There was a problem hiding this comment.
yeah, product-wise it makes more sense to have ip range check first, I'll update in a new commit
| public IabResponse checkAt(String userAgent, InetAddress ipAddress, Date accurateAt) { | ||
| assertCheckAtArguments(userAgent, ipAddress, accurateAt); | ||
|
|
||
| String userAgentLower = IabFile.toLowerCase(userAgent); |
There was a problem hiding this comment.
Here you covert the user agent string to lower case. Which is.... kinda ok, but not completely free.
And later the code calls includeUserAgents.present(userAgent, accurateAt) which again converts the string to lower case.
Here's a suggestion for a different way to solve the problem, which means you can avoid adding in an extra toLowerCase. Instead of adding two new fields customIncludeUseragents and customExludeUseragents to this class... how about instead you re-use the existing fields includeUserAgents and excludeUserAgents to hold the custom overrides. Then the implementation of checkAt can remain unchanged, there is literally nothing to change about checkAt.
There was a problem hiding this comment.
I see the point but we lose override semantics of custom fields if we merge custom into existing fields. An example:
includeUseragents = ["TrustedBot"]
excludeUseragents = ["Bot"]
userAgent = "TrustedBot/1.0"
include matches, then exclude also matches, returning bot with reason FAILED_UA_EXCLUDE whereas user would expect category BROWSER, if custom field is expected to override iab files. Does that align with expected semantics?
I shared another way to get rid of duplicate conversion in a new commit
There was a problem hiding this comment.
Thanks for the example - that helps clarify the semantic difference.
But I want to push back on whether the current PR's semantics are actually the right design. You've implemented "custom include wins outright" - if a UA matches a custom include pattern, it's immediately classified as a browser, bypassing all other checks including custom exclude.
Consider this scenario with the current design:
customIncludeUseragents = ["Chrome"]
customExcludeUseragents = ["BadBot"]
userAgent = "BadBot/1.0 (compatible; Chrome)"
Result: browser (because "Chrome" matches first and returns immediately)
That seems problematic - the user explicitly wanted to block "BadBot", but it slips through.
The existing IAB system doesn't work this way. A UA can match the include list (known browsers) but still be blocked by the exclude list (known bots). The exclude list acts as a safeguard. That's more conservative and arguably safer.
If we merged custom patterns into the existing fields (my original suggestion), we'd get the same semantics as the IAB files: must match include AND not match exclude. Your counter-example:
includeUseragents = ["TrustedBot"]
excludeUseragents = ["Bot"]
userAgent = "TrustedBot/1.0" → blocked
I'd argue this isn't a bug - it's the system correctly handling conflicting patterns by erring on the side of caution. If the customer wants "TrustedBot" allowed, they shouldn't also add "Bot" as a broad exclude pattern.
So I think there are two questions here:
- Product question: Should custom include "win outright", or should it follow the same semantics as the existing IAB system? (Might need PM input on this)
- Performance question: Regardless of which semantics we choose, can we avoid the duplicate
toLowerCasecall?
What do you think?
There was a problem hiding this comment.
Let me tag @stanch for the product question. For the performance question, I pushed a new commit to avoid multiple calls, does that address your comment as expected?
There was a problem hiding this comment.
One of the stated use cases for customizing this enrichment is that our server-side trackers are being flagged as bots and not all of them allow configuring the user agent string... So the user should be able to override that.
I think if we clearly document the new options as overrides, the logic of them trumping the file-based lists is easy to follow. I don’t really expect the include override to be used for anything other than the use case above, so I don’t think someone will put Chrome there.
istreeter
left a comment
There was a problem hiding this comment.
This is starting to look good @oguzhanunlu. I have one minor criticism and one question.
The minor criticism is the lower-case a in parameters like excludeUseragents. It seems like this library has an established style which is to upper-case Agents e.g. excludeUserAgent. So it stands out as unusual that you have have added parameters into this library which disregard the style.
And my question is about the relative order of IP check and custom includes/excludes check. Previously I questioned if you had them in the right order, so then you switched the order. But my question then was only a question. Are you sure now they are in the correct order, or is this something that deserves a discussion?
Add new constructor parameters to IabClient:
Precedence order (highest first):
Matching is case-insensitive substring matching.
ref: https://snplow.atlassian.net/browse/PDP-2425