user-agent week #9
Replies: 16 comments 15 replies
-
As the admin of OPAWG's list... I'd be cautious about criticism that looks like it's criticising the OPAWG list but is actually just criticising the state of podcast useragents...! I look forward to seeing how the Buzzsprout list behaves. There are disappointingly many podcast apps that don't make a difference between Android and iOS versions, for example. Even Apple Podcasts doesn't make a difference between macOS and iOS. @markSteadman's original idea for the OPAWG list was to ONLY enter stuff we were sure about. That means that an Android device will never be "phone" or "tablet", since it's both. The OPAWG list doesn't include referrers. I'd like to, though they're vanishingly small. The un-attributed foreign alphabet translations of "Podcasts" are all Apple Podcasts, which for some reason translates its useragent. They should be in OPAWG: is that a character encoding issue? |
Beta Was this translation helpful? Give feedback.
-
Taking a quick look at the first one on the list, maybe just bad regex patterns? They all have |
Beta Was this translation helpful? Give feedback.
-
Fair enough for Pocket Casts, Overcast, Player FM, etc, but there are certainly many UAs currently categorized as missing device or os that could be improved. Samsung, Google, and many other Android OEM devices send down model number in many cases! I broke these (missing os and missing device) down into two new tabs in the spreadsheet so you can see what I mean. Especially for missing os, many/most of those can be IDed with some work (GSA is android, see other UAs with explicit OSes in there as well) |
Beta Was this translation helpful? Give feedback.
-
Added a new "referers-for-known-opawg-browsers" spreadsheet tab that lists out the most common referers where opawg returned a single match and it was a known pure browser (non-inappbrowsers or electron). Overall browser downloads with referers represent 6.47%, or about the same share as Overcast. The largest referers are from show pages, filtering down to just dedicated webapps would be smaller still - although I'd still like to track them, as this is just going to grow over time, and I consider them an interesting and potentially important future part of the podcast world. |
Beta Was this translation helpful? Give feedback.
-
Performed a similar analysis on the same data using buzzsprout's library, and added new tabs to the spreadsheet to see the differences. As with opawg, there are improvements that could be made to improve completeness, and also found a UA that triggered multiple matches! Something that the buzzsprout lib explicitly tries to avoid. One big downside to buzzsprouts lib is that it's in ruby, which is not a problem for the main buzzsprout logic (easily ported), but the fact it relies on the 3p useragent gem for determining "browserness" - this is unfortunately not as simple to port to other environments (it's a hairball of logic, not data), and would have to be replaced completely. |
Beta Was this translation helpful? Give feedback.
-
Interesting to see how the two compare. I can't help see this as friendly competition. :) My comments...
I believe that OPAWG should no longer trigger multiple matches (I spent some time tightening regex patterns since your work). But, nobody is in charge of the input data, so that won't be the case for long. One other comment: Buzzsprout's list is theirs, for their own purposes (but great that they open-source it). I would suspect that some changes would require you to make a fork, and maintain a separate list - especially if we get into debate about what a bot is, and what it isn't. The OPAWG list was originally worked-on by Mark Steadman who is no longer involved; I'm notionally the only person really looking after this list (which I gather is used by a number of people including some podcast hosting companies). I would be totally happy for you, and the OP3 project, to own this list going forward. Thanks for this work. Interesting. |
Beta Was this translation helpful? Give feedback.
-
The way Buzzsprout processes user agents starts with the Cloudflare WAF. From there we block abusive user agents (e.g. invalid). These user agents will never be able to download an episode. Then we apply a very top-level regular expression to determine if it should even be considered: ( If people want to use the code, we are happy to open source and help maintain it. |
Beta Was this translation helpful? Give feedback.
-
Apple Podcasts automatically downloads a copy to the watch if you a) have a watch, and b) use Apple Podcasts. These should be treated as a bot (and I think everyone does) - they're literally duplicate downloads. However there are other watchOS apps that are valid downloads. |
Beta Was this translation helpful? Give feedback.
-
Alright, I started with the OPAWG list, and refactored/transformed it quite a bit, then added additional data and examples from the Buzzsprout gem. The end result is in a new @jamescridland, I ended up creating a new v2 repo instead of adding new files to the existing repo, since the files are different enough, and the original repo has many other artifacts that correspond with that file and format - it would just clutter it up and make for a strange combination of artifacts. There is nothing OP3-specific about the v2 repo patterns, and I'd be happy to donate the entire repo and host/maintain it under OPAWG, alongside the original one. If that's fine with you, I'll move it there, and delete the skymethod one. @tomrossi7, these new patterns perform similarly to the Buzzsprout ones, added a bunch of new entries, and imported many of your test UAs. The It's easy to fork and/or simply run your own custom entries file at the beginning of the matching process to add any company-specific logic. Since it's data only, should be easy for folks to contribute to the JSON files, every pull request and I ran this evolved set of patterns against the same OP3 sample, and added new spreadsheet tabs in that same spreadsheet above (starting with the ua2 tab) to see the results. It was a painstaking process, but I'm really happy with the outcome! |
Beta Was this translation helpful? Give feedback.
-
Thanks, John. I'm happy for this to be in a new v2 repo in OPAWG if that would be helpful. Benefits are that it is going to be seen as slightly more independent there and not connected to OP3. However, OPAWG itself may have its own concerns from others, so I'd be keen to understand if it would be better separated. What would you recommend, @tomrossi7 ? (Or others reading this?) We also need to work out how to communicate the deprecation of the existing OPAWG useragent list: it looks doubtful that we'd be able to back-work this format into the existing list, since this new plan is an ordered one (exit as soon as you find a match). I'd also mention the PRX user agent list which appears regularly updated too, if that adds any useful extras; but given the wealth of data you already have access to, I'm doubtful it'll add much. |
Beta Was this translation helpful? Give feedback.
-
I like what you've done @johnspurlock-skymethod! Would you be open to separating out the examples from the actual regex patterns? |
Beta Was this translation helpful? Give feedback.
-
I totally agree! Thats why we don't include that in ours.
If we had one file that patterns and one file for the test data. Then we could incorporate it into our code base. |
Beta Was this translation helpful? Give feedback.
-
I really like it! The only one that I would consider smashing together would be the app/libraries, but that may be really unique to Buzzsprout. Do your tests ensure that a user agent can't match multiple times even across multiple files (e.g. it can't match both a library and an app?)
I'm not exactly sure how to best incorporate it into Buzzsprout. Ideally, I could just point the production Ruby code to the files with the patterns and the test Ruby code to the samples. That would be the closest to our current approach. |
Beta Was this translation helpful? Give feedback.
-
Alright @tomrossi7, just renamed the https://github.com/skymethod/user-agents-v2/tree/master/build Currently, I'm generating two build artifacts for every source JSON:
Does that work for you? |
Beta Was this translation helpful? Give feedback.
-
I think it best to remove OPAWG's user-agent list and to point those users to this repo - with the hope that we also get some code samples worked out (which is what I like about Buzzsprout and PRX's lists). Would this be the right thing to do? Is there any best-practice on how to gently suggest people move away? |
Beta Was this translation helpful? Give feedback.
-
Ok, migrated the repo to a new I cleared out the old repo and added a big readme pointer to the final location. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Time to evaluate our approach for breaking down the raw user-agent strings in each request. Ideally we'd leverage an existing open library for this.
Two known libraries are:
Today I took an OP3 "core sample" (every OP3 hit for a single utc day), and started running it through the unmodified opawg patterns looking to see how it fit OP3s data, prioritized by the number of downloads
Detailed results are in a public spreadsheet here:
https://docs.google.com/spreadsheets/d/15ze9SdEoNcHPJUpaaPnhA9ykCiz55QbBxBJRQS05avc/edit?usp=sharing
The opawg pattern set is pretty comprehensive, but we'd want to fix the cases where it returns multiple matches for a single UA, and add the UAs that have no matches at all. One big missing dimension is web apps, not something included in the opawg model at all. Also the percentage of downloads that match opawg results with no defined
os
ordevice
dimension seems less than ideal.Update: Wednesday, 2022-11-02: Performed a similar analysis on the same data using buzzsprout's library, and added new tabs to the spreadsheet to see the differences. As with opawg, there are improvements that could be made to improve completeness, and also found a UA that triggered multiple matches! Something that the buzzsprout lib explicitly tries to avoid.
One big downside to buzzsprouts lib is that it's in ruby, which is not a problem for the main buzzsprout logic (easily ported), but the fact it relies on the 3p useragent gem for determining "browserness" - this is unfortunately not as simple to port to other environments (it's a hairball of logic, not data), and would have to be replaced completely.
Beta Was this translation helpful? Give feedback.
All reactions