user-agent week #9

johnspurlock-skymethod · 2022-10-31T23:08:56Z

johnspurlock-skymethod
Oct 31, 2022
Maintainer

Time to evaluate our approach for breaking down the raw user-agent strings in each request. Ideally we'd leverage an existing open library for this.

Two known libraries are:

the opawg User agent list: https://github.com/opawg/user-agents
buzzsprout's Podcast Agents gem https://github.com/buzzsprout/podcast-agent

Today I took an OP3 "core sample" (every OP3 hit for a single utc day), and started running it through the unmodified opawg patterns looking to see how it fit OP3s data, prioritized by the number of downloads

Detailed results are in a public spreadsheet here:
https://docs.google.com/spreadsheets/d/15ze9SdEoNcHPJUpaaPnhA9ykCiz55QbBxBJRQS05avc/edit?usp=sharing

The opawg pattern set is pretty comprehensive, but we'd want to fix the cases where it returns multiple matches for a single UA, and add the UAs that have no matches at all. One big missing dimension is web apps, not something included in the opawg model at all. Also the percentage of downloads that match opawg results with no defined os or device dimension seems less than ideal.

Update: Wednesday, 2022-11-02: Performed a similar analysis on the same data using buzzsprout's library, and added new tabs to the spreadsheet to see the differences. As with opawg, there are improvements that could be made to improve completeness, and also found a UA that triggered multiple matches! Something that the buzzsprout lib explicitly tries to avoid.

One big downside to buzzsprouts lib is that it's in ruby, which is not a problem for the main buzzsprout logic (easily ported), but the fact it relies on the 3p useragent gem for determining "browserness" - this is unfortunately not as simple to port to other environments (it's a hairball of logic, not data), and would have to be replaced completely.

jamescridland · 2022-11-01T07:13:30Z

jamescridland
Nov 1, 2022

As the admin of OPAWG's list... I'd be cautious about criticism that looks like it's criticising the OPAWG list but is actually just criticising the state of podcast useragents...! I look forward to seeing how the Buzzsprout list behaves.

There are disappointingly many podcast apps that don't make a difference between Android and iOS versions, for example. Even Apple Podcasts doesn't make a difference between macOS and iOS.

@markSteadman's original idea for the OPAWG list was to ONLY enter stuff we were sure about. That means that an Android device will never be "phone" or "tablet", since it's both.

The OPAWG list doesn't include referrers. I'd like to, though they're vanishingly small.

The un-attributed foreign alphabet translations of "Podcasts" are all Apple Podcasts, which for some reason translates its useragent. They should be in OPAWG: is that a character encoding issue?

0 replies

johnspurlock-skymethod · 2022-11-01T17:23:53Z

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

The un-attributed foreign alphabet translations of "Podcasts" are all Apple Podcasts, which for some reason translates its useragent. They should be in OPAWG: is that a character encoding issue?

Taking a quick look at the first one on the list, maybe just bad regex patterns? They all have d$ at the end which means "ends with d", which isn't correct, even with the examples in that record.

0 replies

johnspurlock-skymethod · 2022-11-01T17:47:23Z

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

There are disappointingly many podcast apps that don't make a difference between Android and iOS versions

Fair enough for Pocket Casts, Overcast, Player FM, etc, but there are certainly many UAs currently categorized as missing device or os that could be improved. Samsung, Google, and many other Android OEM devices send down model number in many cases!

I broke these (missing os and missing device) down into two new tabs in the spreadsheet so you can see what I mean. Especially for missing os, many/most of those can be IDed with some work (GSA is android, see other UAs with explicit OSes in there as well)

0 replies

johnspurlock-skymethod · 2022-11-01T20:23:59Z

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

The OPAWG list doesn't include referrers. I'd like to, though they're vanishingly small.

Added a new "referers-for-known-opawg-browsers" spreadsheet tab that lists out the most common referers where opawg returned a single match and it was a known pure browser (non-inappbrowsers or electron).

Overall browser downloads with referers represent 6.47%, or about the same share as Overcast.

The largest referers are from show pages, filtering down to just dedicated webapps would be smaller still - although I'd still like to track them, as this is just going to grow over time, and I consider them an interesting and potentially important future part of the podcast world.

1 reply

jamescridland Nov 1, 2022

I'm going to spend some time in the OPAWG list today to rectify some of these issues. I could do with a better test than my "grab the last 1,000" script.

If I were to add a referer field, that should work for web apps, I hope.

johnspurlock-skymethod · 2022-11-02T23:42:12Z

johnspurlock-skymethod
Nov 2, 2022
Maintainer Author

Performed a similar analysis on the same data using buzzsprout's library, and added new tabs to the spreadsheet to see the differences. As with opawg, there are improvements that could be made to improve completeness, and also found a UA that triggered multiple matches! Something that the buzzsprout lib explicitly tries to avoid.

One big downside to buzzsprouts lib is that it's in ruby, which is not a problem for the main buzzsprout logic (easily ported), but the fact it relies on the 3p useragent gem for determining "browserness" - this is unfortunately not as simple to port to other environments (it's a hairball of logic, not data), and would have to be replaced completely.

0 replies

jamescridland · 2022-11-03T00:39:17Z

jamescridland
Nov 3, 2022

Interesting to see how the two compare. I can't help see this as friendly competition. :) My comments...

OPAWG has spotted ten times as many bots as Buzzsprout. This could mean that Buzzsprout's figures are 4% inflated by not correctly matching bots. Or, it could mean the OPAWG data is shit. (Or a mix). I'm quite confident that the bots that I list in OPAWG are bots.
OPAWG has 558 unknown UAs; Buzzsprout has 745 of them. I'm quite pleased about that.
Buzzsprout uses "mobile" rather than "phone". That allows them to move tablets into that category, which is sensible, and therefore means that "android" = "mobile" (which isn't 100% true, but probably close enough). I like this as an idea. Might be worth implementing that.

I believe that OPAWG should no longer trigger multiple matches (I spent some time tightening regex patterns since your work). But, nobody is in charge of the input data, so that won't be the case for long.

One other comment: Buzzsprout's list is theirs, for their own purposes (but great that they open-source it). I would suspect that some changes would require you to make a fork, and maintain a separate list - especially if we get into debate about what a bot is, and what it isn't. The OPAWG list was originally worked-on by Mark Steadman who is no longer involved; I'm notionally the only person really looking after this list (which I gather is used by a number of people including some podcast hosting companies). I would be totally happy for you, and the OP3 project, to own this list going forward.

Thanks for this work. Interesting.

1 reply

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

OPAWG has spotted ten times as many bots as Buzzsprout

Yes! This is provisional though, I have a question out to @tomrossi7 about it - he says that they also filter out obvious bots (like "bot" in the UA string) above and beyond those declared in the data file (and browsers, which are always bot=false).

I can't find this logic in the buzzsprout code itself, so I bet they filter them out before they even hit the library.

tomrossi7 · 2022-11-03T13:06:56Z

tomrossi7
Nov 3, 2022

The way Buzzsprout processes user agents starts with the Cloudflare WAF. From there we block abusive user agents (e.g. invalid). These user agents will never be able to download an episode. Then we apply a very top-level regular expression to determine if it should even be considered: (/bot|spider|crawl|slurp|scan|scrap|archiver|^curl|^Wget|^ruby|^python|^java|httpclient|http-client|BrandVerity|wordpress|httrack/i) (these user agents are able to download episodes, but does not affect stats). If it makes it that far, we then use the Podcast Agent code to determine the corresponding podcast agent. We could move that logic into the Podcast Agent code, but then it gets hairy since you would have a generic entry that you would have to make sure not match with any other regular expression. Does that make sense?

If people want to use the code, we are happy to open source and help maintain it.

5 replies

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

Thanks Tom - I'll add that first-pass regex to my port and rerun against the OP3 sample again

Btw I did find one UA that matches two of your regexes, it's in the spreadsheet:
GoldenPod/0.8.4 (GNU/Linux; podcatcher; Using curl) ({"name":"GoldenPod"}, {"name":"Curl","bot":true})

This one wouldn't be filtered by your bot regex either, so might be something you want to fix to regain 100% uniqueness in your patterns!

tomrossi7 Nov 3, 2022

We've never seen that UA before. We've only seen GoldenPod/0.8.4 (GNU/Linux; podcatcher; Using LWP) libwwwperl.

If we had, I would update the samples and then refine the regular expressions. We also don't have that marked as a bot?

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

{
    "name": "GoldenPod",
    "user_agent_match": "\\AGoldenPod\\/"
},

GoldenPod actually isn't a bot, it's a command-line podcast client written in perl!

https://www.zerodogg.org/goldenpod/

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

Alright, reran and updated all of the buzzsprout breakdowns including this new first-pass bot regex.

Tom, do you also exclude atc/1.0 at a higher-level before it hits the library? (maybe due to iab?) They are a large portion of the 'Unknown Device' chunk, it's my understanding those can all be safely categorized as watch downloads.

tomrossi7 Nov 3, 2022

@johnspurlock-skymethod sorry, I just updated the code. We used to have the Apple Watch detection in the initial bot regex, but recently moved it into the podcast_agents.yml. Unfortunately, we still don't count these downloads per IAB.

jamescridland · 2022-11-03T23:09:26Z

jamescridland
Nov 3, 2022

Apple Podcasts automatically downloads a copy to the watch if you a) have a watch, and b) use Apple Podcasts. These should be treated as a bot (and I think everyone does) - they're literally duplicate downloads.

However there are other watchOS apps that are valid downloads.

0 replies

johnspurlock-skymethod · 2022-11-12T23:53:55Z

johnspurlock-skymethod
Nov 12, 2022
Maintainer Author

Alright, I started with the OPAWG list, and refactored/transformed it quite a bit, then added additional data and examples from the Buzzsprout gem. The end result is in a new user-agents-v2 repo, currently at skymethod/user-agents-v2. Details about the approach is outlined there in the readme.

@jamescridland, I ended up creating a new v2 repo instead of adding new files to the existing repo, since the files are different enough, and the original repo has many other artifacts that correspond with that file and format - it would just clutter it up and make for a strange combination of artifacts. There is nothing OP3-specific about the v2 repo patterns, and I'd be happy to donate the entire repo and host/maintain it under OPAWG, alongside the original one. If that's fine with you, I'll move it there, and delete the skymethod one.

@tomrossi7, these new patterns perform similarly to the Buzzsprout ones, added a bunch of new entries, and imported many of your test UAs. The devices file is basically your Ruby logic in data form, and the referrers file IDs web apps and host players similar to yours. The patterns use broadly-compatible regular expression syntax, so should be trivial to run in a Ruby environment, and as a bonus, no dep on a 3rd party UA gem to id browsers!

It's easy to fork and/or simply run your own custom entries file at the beginning of the matching process to add any company-specific logic. Since it's data only, should be easy for folks to contribute to the JSON files, every pull request and git push runs automated tests in a GH action that verifies the data and the embedded examples.

I ran this evolved set of patterns against the same OP3 sample, and added new spreadsheet tabs in that same spreadsheet above (starting with the ua2 tab) to see the results. It was a painstaking process, but I'm really happy with the outcome!

0 replies

jamescridland · 2022-11-14T07:13:37Z

jamescridland
Nov 14, 2022

Thanks, John.

I'm happy for this to be in a new v2 repo in OPAWG if that would be helpful. Benefits are that it is going to be seen as slightly more independent there and not connected to OP3. However, OPAWG itself may have its own concerns from others, so I'd be keen to understand if it would be better separated. What would you recommend, @tomrossi7 ? (Or others reading this?)

We also need to work out how to communicate the deprecation of the existing OPAWG useragent list: it looks doubtful that we'd be able to back-work this format into the existing list, since this new plan is an ordered one (exit as soon as you find a match).

I'd also mention the PRX user agent list which appears regularly updated too, if that adds any useful extras; but given the wealth of data you already have access to, I'm doubtful it'll add much.

0 replies

tomrossi7 · 2022-11-15T18:28:42Z

tomrossi7
Nov 15, 2022

I like what you've done @johnspurlock-skymethod! Would you be open to separating out the examples from the actual regex patterns?

3 replies

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

Any reason why? Keeping the examples next to the pattern itself is a great way to ensure the pattern makes sense. Caught a few bugs in the original patterns this way.

Also if they were external, it would be one more thing that could possibly go wrong / go stale - linking between the two. Presumably the names (and possibly types) would need to be copied over there.

tomrossi7 Nov 15, 2022

The patterns are separate from the data used to test those patterns. The tests ensure that that that test data matches with the production patterns. Its just our opinion, but we like to load up the samples with plenty of examples and not clutter up our production code with test data.

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

Hmm, I guess one could make the same argument about the other human-focused fields like description, urls, and comments - not really needed for direct use in production, but useful to include in the entry itself.

I actually view the data file more as source, and not a runtime artifact - as it's not exactly efficient to reparse all of the pattern strings into compiled regexes on every user agent evaluation anyway. e.g. for OP3, I generate a single user_agents.ts file (which actually looks surprisingly small to me in this form) based on the source json to be as efficient as possible for that context.

Perhaps we could auto-generate "runtime" JSON entries files in a separate folder using a GH action on every checkin that removed all of the human-focused fields? Is that something you would use?

tomrossi7 · 2022-11-15T21:24:55Z

tomrossi7
Nov 15, 2022

I guess one could make the same argument about the other human-focused fields like description, urls, and comments - not really needed for direct use in production, but useful to include in the entry itself.

I totally agree! Thats why we don't include that in ours.

Perhaps we could auto-generate "runtime" JSON entries files in a separate folder using a GH action on every checkin that removed all of the human-focused fields? Is that something you would use?

If we had one file that patterns and one file for the test data. Then we could incorporate it into our code base.

1 reply

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

Curious if you find the separation of the app/bots/libraries/browsers entries into separate files more of a help or a hindrance? I could smash those four together and include a "type" field on every one. But didn't know if having them separate was clearer (kind of clearer to me, also nice to have the uniqueness constraint there) and easier to customize for folks.

But again we could just autogenerate a smashed version as well (without the examples). You don't actually need the test data as a separate file at runtime, right? Just the rest of each entry: name, type, and optional category.

tomrossi7 · 2022-11-15T21:44:43Z

tomrossi7
Nov 15, 2022

Curious if you find the separation of the app/bots/libraries/browsers entries into separate files more of a help or a hindrance?

I really like it! The only one that I would consider smashing together would be the app/libraries, but that may be really unique to Buzzsprout. Do your tests ensure that a user agent can't match multiple times even across multiple files (e.g. it can't match both a library and an app?)

But again we could just autogenerate a smashed version as well (without the examples). You don't actually need the test data as a separate file at runtime, right?

I'm not exactly sure how to best incorporate it into Buzzsprout. Ideally, I could just point the production Ruby code to the files with the patterns and the test Ruby code to the samples. That would be the closest to our current approach.

1 reply

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

Do your tests ensure that a user agent can't match multiple times

I tried to call this out in the quick start, but since these patterns are meant to be super broadly compatible, they don't contain any advanced regex features like lookaheads that aren't available in some programming environments - so they are intended to be iterated in order and just return the first match. Ordering in the files is therefore important in a few cases, I've added comments where that's the case.

The automated tests actually run every single example against that deterministic match algorithm and make sure it lands back on the containing entry.

johnspurlock-skymethod · 2022-11-15T23:27:21Z

johnspurlock-skymethod
Nov 15, 2022
Maintainer Author

Alright @tomrossi7, just renamed the patterns directory to src in the repo, and added a new build dir. The build dir is generated automatically from the src dir on every push, so no human ever needs to check anything in there.

https://github.com/skymethod/user-agents-v2/tree/master/build

Currently, I'm generating two build artifacts for every source JSON:

a "runtime" file with only the entry properties: name, pattern, and optional category
an "examples" file with only the entry properties: name and examples

Does that work for you?

0 replies

jamescridland · 2022-11-17T11:26:30Z

jamescridland
Nov 17, 2022

I think it best to remove OPAWG's user-agent list and to point those users to this repo - with the hope that we also get some code samples worked out (which is what I like about Buzzsprout and PRX's lists). Would this be the right thing to do? Is there any best-practice on how to gently suggest people move away?

3 replies

johnspurlock-skymethod Nov 17, 2022
Maintainer Author

What are your thoughts on moving this to the OPAWG github org, (i.e. opawg/user-agents-v2). I think it's likely to get more contributions there, and would benefit from a stable spot. If I were to keep it here, I'd probably rename it to something like op3-user-agents. Happy to move it over to OPAWG if you're fine with the structure of it.

Re: actual code, I'd want to tread carefully. Buzzsprout repo only has a Ruby example (since it's a Ruby gem), the prx repo is tied to Node. The nice thing about a data-only repo is that it works the same in any programming context, and those examples get stale fairly quickly.

I also worry about people copying and pasting code from an official examples page that is ok for basic adhoc exploration, but really slow when running multiple times against an actual dataset, so any sort of official examples page would have to lay that out.

jamescridland Nov 17, 2022

Totally happy to move it there if you'd like. Just aware that the two will begin to diverge if we're not careful, and I'd like yours to be the one that people use.

johnspurlock-skymethod Nov 17, 2022
Maintainer Author

Ok great - I'll create a new opawg/user-agents-v2 over there and migrate to it. Since I'm monitoring all changes to the opawg repo, I can make sure any changes to opawg/user-agents are reflected in v2 as well, for as long as the original one is still around (the entry format is very similar).

johnspurlock-skymethod · 2022-11-18T00:38:40Z

johnspurlock-skymethod
Nov 18, 2022
Maintainer Author

Ok, migrated the repo to a new opawg/user-agents-v2 repo!

I cleared out the old repo and added a big readme pointer to the final location.

0 replies

Uh oh!

user-agent week #9

Uh oh!

Uh oh!

johnspurlock-skymethod Oct 31, 2022 Maintainer

Replies: 16 comments · 15 replies

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 1, 2022 Maintainer Author

Uh oh!

johnspurlock-skymethod Nov 1, 2022 Maintainer Author

Uh oh!

johnspurlock-skymethod Nov 1, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 2, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 3, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 3, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 3, 2022 Maintainer Author

Uh oh!

johnspurlock-skymethod Nov 3, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 12, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 15, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 15, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 15, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 15, 2022 Maintainer Author

Uh oh!

johnspurlock-skymethod Nov 15, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 17, 2022 Maintainer Author

Uh oh!

Uh oh!

johnspurlock-skymethod Nov 17, 2022 Maintainer Author

Uh oh!

johnspurlock-skymethod
Oct 31, 2022
Maintainer

Replies: 16 comments 15 replies

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

johnspurlock-skymethod
Nov 1, 2022
Maintainer Author

johnspurlock-skymethod
Nov 2, 2022
Maintainer Author

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

johnspurlock-skymethod Nov 3, 2022
Maintainer Author

johnspurlock-skymethod
Nov 12, 2022
Maintainer Author

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

johnspurlock-skymethod Nov 15, 2022
Maintainer Author

johnspurlock-skymethod
Nov 15, 2022
Maintainer Author

johnspurlock-skymethod Nov 17, 2022
Maintainer Author

johnspurlock-skymethod Nov 17, 2022
Maintainer Author