Skip to content

Collect simple, anonymous, constructive analytics #4

@niloc132

Description

@niloc132

When configured as documented, logs are collected by nginx-proxy, with the standard NCSA log details plus the hostname of the vhost (real log line from production, with some specifics redacted):

nginx.1    | www.gwtproject.org 1.2.3.4 - - [04/Mar/2023:14:00:00 +0000] "GET /javadoc/latest/com/google/gwt/core/ext/Linker.html HTTP/1.1" 200 25683 "-" "User agent string logged here"

In order to better serve the GWT project itself, and respect the privacy of users, we don't need all of this information, but should still collect at least some parts of it. In the interest of transparency, we probably want to publish at least coarse details of the project, so that we know what resources are being used, volume of requests, where we're seeing 404s or redirects, etc. Optimizing for traffic isn't the goal, these should be used for spotting bugs, ensuring resources are not abused, curiosity, etc.

Breaking down the example log line from above, and considering what would be helpful and how:

  • nginx.1 | - this prefix is added by the forego process. For better or worse, there are terminal control characters that add color here which need to be accounted for. The nginx-proxy container also logs docker-gen details, when changes take place that require the nginx configuration to be updated, restarted. Should be filtered to only include nginx contents.
  • www.gwtproject.org - the Host/:authority headers in the request, indicating which host has a request coming to it. Should be filtered to only include gwtproject domain names, and eventual output should be grouped by the same domain name.
  • 1.2.3.4 - client's IP address. We should probably remove this entirely, I can't think of a purpose it would serve, except to link pages loaded within a session by a user, to see how they explore, or what page was linking to a 404 resource. I'm inclined to err on the side of skipping it entirely until we have a better reason to keep it and sufficiently anonymize it.
  • - - rfc 1413 user identity, can be ignored since we won't collect this
  • - - user id of the client, can be ignored since we won't collect this
  • [04/Mar/2023:14:00:00 +0000] timestamp of the request. My first thought is that we likely only need to bucket these into (for example) 1/5/60 minute intervals rather than publish/record exact timestamps.
  • "GET /javadoc/latest/com/google/gwt/core/ext/Linker.html HTTP/1.1" - the normalized request line, indicating what http method, what path was requested, and what http version. This is all probably important to keep.
  • 200 - the http status code of the response. Important to keep.
  • 25683 - size in bytes of the response body. In our case, with no dynamic resources, this probably is not very informative, as the path itself should make it clear what the size was, but it also doesn't seem to have a downside to provide (and lets any analysis avoid actually needing to join against the current size of all files)
  • "-" - the referrer to this resource. Browsers are increasingly strict about when this is sent, but we may still want to filter so that only other gwtproject domains are listed, so that we can work backwards to find out where 404s or the like are coming from. In theory there could be value in seeing what is linking to our docs, in practice many browsers already omit this, and it is easy to "game".
  • "User agent string logged here" - user agent string (omitted here, obviously). It might be constructive to either flag or filter out "bot" user agents, not because we imagine that this will make our measurements actually accurate, but at least to avoid confusing "poor bot behavior" with "actual issues that users are encountering". It might also be good to strip this out entirely from the final output, or to null out this value unless the user-agent string is (for example) in the top 5% of observed strings. Very unscientific analysis:
    • For reference, of the last 100000 hits (about 26 hrs of data) to gwtproject, there were 1201 unique user agent strings
    • 625 calls (27th most popular) had no user agent defined
    • Of the top 10, three were explicitly bots, two were different Java 17 versions (eclipse plugin updates, dtd checks, and terms.html), three were on windows, one mac, and one linux.
    • I put very little stock in this, as the 11th most popular was IE8 on windows 7.

The first step is probably to replace the NCSA log directive currently in use with something more specific (removing fields we don't want to use), then putting some filtering/batching/bucketing downstream, then publishing results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions