Skip to content

Add support for custom user agent lists#22

Open
oguzhanunlu wants to merge 8 commits intodevelopfrom
custom-useragent-lists
Open

Add support for custom user agent lists#22
oguzhanunlu wants to merge 8 commits intodevelopfrom
custom-useragent-lists

Conversation

@oguzhanunlu
Copy link
Member

@oguzhanunlu oguzhanunlu commented Feb 3, 2026

Add new constructor parameters to IabClient:

  • excludeUseragents: custom patterns to classify as spiders/robots
  • includeUseragents: custom patterns to classify as browsers

Precedence order (highest first):

  1. Custom includeUseragents -> browser
  2. Custom excludeUseragents -> spider/robot
  3. IAB IP file check
  4. IAB include file check
  5. IAB exclude file check

Matching is case-insensitive substring matching.

ref: https://snplow.atlassian.net/browse/PDP-2425

Add new constructor parameters to IabClient:
- excludeUseragents: custom patterns to classify as spiders/robots
- includeUseragents: custom patterns to classify as browsers

Precedence order (highest first):
1. Custom includeUseragents -> browser
2. Custom excludeUseragents -> spider/robot
3. IAB IP file check
4. IAB include file check
5. IAB exclude file check

Matching is case-insensitive substring matching.
}

static IabResponse customExcludeCheckFailed() {
return createForSpiderOrRobot(ACTIVE_SPIDER_OR_ROBOT, FAILED_UA_EXCLUDE, PAGE_AND_AD_IMPRESSIONS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose ACTIVE_SPIDER_OR_ROBOT as the category? And why did you choose PAGE_AND_AD_IMPRESSIONS as the primary impact?

Am I right that we don't actually know the correct category/impact to put in the response in this scenario? Because our customers are not providing that information via the enrichment config; they simply add a useragent string with no state reason.

Did the acceptance criteria say what to put into the category/impact field in this scenario? And if no.... do you think we need to ask Product, or are you confident you have picked good sensible values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @stanch and these are the default he picked until further configurability is asked by users.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update based on the conversation in JIRA: let’s use SPIDER_OR_ROBOT and UNKNOWN. I have a slight preference to use PAGE_AND_AD_IMPRESSIONS because I feel that if a user went through the trouble of adding the bot by hand, they would want to remove it from everywhere. However, I am also almost sure that nobody uses this field. So let’s pick a prudent default (i.e. UNKNOWN) and wait for someone to complain :)


private static boolean matchesAny(String userAgentLower, List<String> patterns) {
for (String pattern : patterns) {
if (StringUtils.contains(userAgentLower, pattern)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you chose .contains not .startsWith?

This library has an existing feature to include/exclude useragents, and it switches between .contains or .startsWith depending on the row of the database file.

Did the acceptance criteria say to use .contains? Do you think we need clear acceptance criteria, or are you confident you have picked the correct type of string matcher for our new feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @stanch and he told me that matching behavior should be contains always

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don’t want to add too much flexibility here, and most bot identifiers are inside of the user agent string.

Comment on lines 126 to 132
if (userAgentLower != null && matchesAny(userAgentLower, customIncludeUseragents)) {
return IabResponse.identifiedAsBrowser();
}

if (userAgentLower != null && matchesAny(userAgentLower, customExcludeUseragents)) {
return IabResponse.customExcludeCheckFailed();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure are you that these two new checks go before the ipRanges check? Did you consider whether it makes more sense for the enrichment to check ip range first before checking the user agent overrides?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, product-wise it makes more sense to have ip range check first, I'll update in a new commit

public IabResponse checkAt(String userAgent, InetAddress ipAddress, Date accurateAt) {
assertCheckAtArguments(userAgent, ipAddress, accurateAt);

String userAgentLower = IabFile.toLowerCase(userAgent);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you covert the user agent string to lower case. Which is.... kinda ok, but not completely free.

And later the code calls includeUserAgents.present(userAgent, accurateAt) which again converts the string to lower case.

Here's a suggestion for a different way to solve the problem, which means you can avoid adding in an extra toLowerCase. Instead of adding two new fields customIncludeUseragents and customExludeUseragents to this class... how about instead you re-use the existing fields includeUserAgents and excludeUserAgents to hold the custom overrides. Then the implementation of checkAt can remain unchanged, there is literally nothing to change about checkAt.

Copy link
Member Author

@oguzhanunlu oguzhanunlu Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point but we lose override semantics of custom fields if we merge custom into existing fields. An example:

includeUseragents = ["TrustedBot"]
excludeUseragents = ["Bot"]
userAgent = "TrustedBot/1.0"

include matches, then exclude also matches, returning bot with reason FAILED_UA_EXCLUDE whereas user would expect category BROWSER, if custom field is expected to override iab files. Does that align with expected semantics?

I shared another way to get rid of duplicate conversion in a new commit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example - that helps clarify the semantic difference.

But I want to push back on whether the current PR's semantics are actually the right design. You've implemented "custom include wins outright" - if a UA matches a custom include pattern, it's immediately classified as a browser, bypassing all other checks including custom exclude.

Consider this scenario with the current design:

customIncludeUseragents = ["Chrome"]
customExcludeUseragents = ["BadBot"]
userAgent = "BadBot/1.0 (compatible; Chrome)"

Result: browser (because "Chrome" matches first and returns immediately)

That seems problematic - the user explicitly wanted to block "BadBot", but it slips through.

The existing IAB system doesn't work this way. A UA can match the include list (known browsers) but still be blocked by the exclude list (known bots). The exclude list acts as a safeguard. That's more conservative and arguably safer.

If we merged custom patterns into the existing fields (my original suggestion), we'd get the same semantics as the IAB files: must match include AND not match exclude. Your counter-example:

includeUseragents = ["TrustedBot"]
excludeUseragents = ["Bot"]
userAgent = "TrustedBot/1.0" → blocked

I'd argue this isn't a bug - it's the system correctly handling conflicting patterns by erring on the side of caution. If the customer wants "TrustedBot" allowed, they shouldn't also add "Bot" as a broad exclude pattern.

So I think there are two questions here:

  1. Product question: Should custom include "win outright", or should it follow the same semantics as the existing IAB system? (Might need PM input on this)
  2. Performance question: Regardless of which semantics we choose, can we avoid the duplicate toLowerCase call?

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me tag @stanch for the product question. For the performance question, I pushed a new commit to avoid multiple calls, does that address your comment as expected?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the stated use cases for customizing this enrichment is that our server-side trackers are being flagged as bots and not all of them allow configuring the user agent string... So the user should be able to override that.

I think if we clearly document the new options as overrides, the logic of them trumping the file-based lists is easy to follow. I don’t really expect the include override to be used for anything other than the use case above, so I don’t think someone will put Chrome there.

@oguzhanunlu oguzhanunlu requested a review from istreeter February 6, 2026 09:53
Copy link

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look good @oguzhanunlu. I have one minor criticism and one question.

The minor criticism is the lower-case a in parameters like excludeUseragents. It seems like this library has an established style which is to upper-case Agents e.g. excludeUserAgent. So it stands out as unusual that you have have added parameters into this library which disregard the style.

And my question is about the relative order of IP check and custom includes/excludes check. Previously I questioned if you had them in the right order, so then you switched the order. But my question then was only a question. Are you sure now they are in the correct order, or is this something that deserves a discussion?

@oguzhanunlu oguzhanunlu requested a review from istreeter February 6, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants