|
| 1 | +# RFC 138: support subsuites in wptrunner |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +Support running tests in multiple different browser configurations in |
| 6 | +a single wptrunner invocation. |
| 7 | + |
| 8 | +## Details |
| 9 | + |
| 10 | +It is common for browser vendors to want to run web platform tests |
| 11 | +with particular configurations of the browser. For example when |
| 12 | +developing a major new feature, or refactoring existing code, they may |
| 13 | +want to run with non-default configuration to enable the new |
| 14 | +code paths. Often, for cost and efficiency reasons, these test runs |
| 15 | +will initially be limited to just the tests most likely to be affected |
| 16 | +by the changes e.g. a change to session history is highly likely to |
| 17 | +affect tests involving navigation, but highly unlikely to affect CSS |
| 18 | +layout. |
| 19 | + |
| 20 | +Typically vendors wanting to enable multiple configurations had to |
| 21 | +define those configurations externally, and invoke wptrunner multiple |
| 22 | +times, once per configuration. This has some advantages; in particular |
| 23 | +it keeps the vendor-specific issue of browser configurations cleanly |
| 24 | +separated from the actual business of running tests. However there are |
| 25 | +also a couple of disadvantages: |
| 26 | + |
| 27 | +* When configurations are handled externally, they are either written |
| 28 | + into a CI configuration, which is then difficult to run locally, or |
| 29 | + the vendor is responsible for writing a substantial frontend on top |
| 30 | + of wptrunner which handles invoking it for each of the different |
| 31 | + configurations. |
| 32 | + |
| 33 | +* Even with a vendor frontend it's difficult to schedule the tests |
| 34 | + optimally across multiple cores. This is because you essentially |
| 35 | + have two choices on how to handle the different |
| 36 | + configurations. Option one is to run each configuration in serial, |
| 37 | + with each wptrunner process using all available cores. This can be |
| 38 | + wasteful when you have many configurations, some of which only run a |
| 39 | + small number of tests, because those configurations only use a small |
| 40 | + fraction of the available compute, but can't be scheduled in |
| 41 | + parallel. Option two is to invoke many wptrunner in parallel processes, each |
| 42 | + responsible for running a small fraction of the total tests in each |
| 43 | + configuration. However wptrunner does a lot of global setup (reading |
| 44 | + the test manifest, starting wptserve), and repeating this work per |
| 45 | + process would be very inefficient. |
| 46 | + |
| 47 | +So in order to meet this use case there are two obvious options: |
| 48 | + |
| 49 | +1. Refactor wptrunner to allow it to be driven by an external test |
| 50 | + scheduler without having to redo all the global setup for each |
| 51 | + group of tests. |
| 52 | + |
| 53 | +1. Push the notion of multiple configurations down into wptrunner |
| 54 | + itself, and update the existing test scheduling logic to handle |
| 55 | + multiple browser configurations. |
| 56 | + |
| 57 | +In practice people are currently using wptrunner as a monolith, with |
| 58 | +test selection and browser setup provided by command line options and |
| 59 | +configuration files, so although the architecture would probably allow |
| 60 | +adopting a more service-based model, it's more in keeping with existing |
| 61 | +usage patterns to implement support for multiple configurations |
| 62 | +directly in wptrunner. |
| 63 | + |
| 64 | +### Implementation Strategy |
| 65 | + |
| 66 | +From an implementation point of view, the main constraint is around |
| 67 | +how we associate test results with a specific configuration. |
| 68 | + |
| 69 | +Our test data flow is built around mozlog. This groups tests into |
| 70 | +"suites". Each suite is started with a `suite_start` event ends with a |
| 71 | +`suite_end` event. The `suite_start` event contains information, |
| 72 | +particularly in the `run_info` key about the browser being tested. That |
| 73 | +`run_info` data is used in wptrunner metadata files to determine which |
| 74 | +metadata applies to a specific test run (e.g. to allow different |
| 75 | +expected results on different platforms). Specific test results are |
| 76 | +associated to a suite implicitly by sequence i.e. any `test_start` log |
| 77 | +event coming after a specific `suite_start` is grouped within that |
| 78 | +suite. This implicitly means that only a single suite can be running |
| 79 | +in the context of a specific logger at a specific time. That is |
| 80 | +incompatible with the requirement to run multiple configurations in |
| 81 | +parallel. |
| 82 | + |
| 83 | +There are three obvious approaches we could use to address this: |
| 84 | + |
| 85 | +1. Use multiple loggers, one per configuration. In this case each |
| 86 | + configuration would produce a separate stream of log events that |
| 87 | + could be written to a separate file. However mozlog is really |
| 88 | + designed around the assumption of a single top-level logger per |
| 89 | + process, which is reflected in the way output is configured with |
| 90 | + command line options. Since we'd presumably still want a single |
| 91 | + top-level output to stdout, we'd still need to solve the problem of |
| 92 | + combining multiple log streams into one top-level stream, which |
| 93 | + would require a mechanism to relax the one-suite-at-a-time |
| 94 | + constraint. We would also need to work out how to pass specific |
| 95 | + configuration (e.g. output filenames) down to specific loggers. |
| 96 | + |
| 97 | +1. Put configuration information inside the test id, so that from the |
| 98 | + point of view of consumers, the same test run in different |
| 99 | + configurations look like different tests. This is how the Chromium |
| 100 | + CI system handles this problem with the concept of "virtual" |
| 101 | + testsuites. It has the advantage that it requires minimal changes |
| 102 | + to data structures; tests get an id like `/<prefix>/<configuration |
| 103 | + name>/<actual test id>`, so from the point of view of consumers |
| 104 | + these are different tests. However this is also a disadvantage; |
| 105 | + every consumer that wants to process the results (e.g. to compare |
| 106 | + results between configurations) has to know about the details of |
| 107 | + how to parse the test id to extract the "real" test and |
| 108 | + configuration name. This goes directly against the mozlog design |
| 109 | + principle of providing data in a structured form to minimise the |
| 110 | + amount of error-prone parsing required in consumers. It also makes |
| 111 | + these configurations work in a way that's markedly different to |
| 112 | + other kinds of configuration data (e.g. platform name, whether |
| 113 | + we're running a debug build, etc.); that data is represented in the |
| 114 | + `run_info` rather than as part of the test id. |
| 115 | + |
| 116 | +1. Add the concept of multiple "subsuites" directly in the mozlog data |
| 117 | + model and explicitly link other log actions which depend on the |
| 118 | + configuration to the subsuite. This has the disadvantage that it |
| 119 | + requires updates to consumers which care about subsuite membership, |
| 120 | + but the advantage that over the longer term there's a clearly |
| 121 | + defined data model which doesn't require additional parsing to |
| 122 | + understand test identity. It also allows us to treat subsuite |
| 123 | + configuration like other kinds of configuration data, by |
| 124 | + associating each subsuite with particular `run_info` data. |
| 125 | + |
| 126 | +The proposal is to adopt the third approach, and update mozlog to |
| 127 | +allow it to represent the concept of subsuites, and wptrunner to allow |
| 128 | +scheduling tests from multiple subsuites in parallel. |
| 129 | + |
| 130 | +### Mozlog Changes |
| 131 | + |
| 132 | +For mozlog, we propose allowing each top level test suite (represented |
| 133 | +by a `suite_start` / `suite_end` event pair) to also have associated |
| 134 | +subsuites. Each subsuite will define updates to the top-level suite |
| 135 | +`run_info` data for the configuration in that subsuite. Log actions |
| 136 | +which are associated with a specific test or set of tests will get a |
| 137 | +`subsuite` property that explicitly names the subsuite they're |
| 138 | +associated with. |
| 139 | + |
| 140 | +In detail this means one extra log event: |
| 141 | + |
| 142 | +```py |
| 143 | +add_subsuite(name: str, run_info: Mapping[str, Any]) |
| 144 | +``` |
| 145 | + |
| 146 | +`name` is required and must be unique within a given suite. `run_info` |
| 147 | +is optional and represents an update to the `run_info` of the |
| 148 | +suite. In addition the logger adds a `subsuite` key to the `run_info`, |
| 149 | +set to the name of the `subsuite` (TODO: should this be implicit, or |
| 150 | +should consumers get freedom in how to handle this?). |
| 151 | + |
| 152 | +Plus `subsuite` keys added to the following log events: |
| 153 | + |
| 154 | +* `test_start` |
| 155 | +* `test_end` |
| 156 | +* `test_status` |
| 157 | +* `assertion_count` |
| 158 | +* `lsan_leak` |
| 159 | +* `lsan_summary` |
| 160 | +* `mozleak_object` |
| 161 | +* `mozleak_total` |
| 162 | + |
| 163 | +In all cases this key is optional, and when missing it's assumed that |
| 164 | +the event is associated with the default (unnamed) subsuite. Mozlog |
| 165 | +will enforce the property that the subsuite name corresponds to a |
| 166 | +subsuite that has previously been added with `add_subsuite`. |
| 167 | + |
| 168 | +### wptrunner changes |
| 169 | + |
| 170 | +wptrunner already has the concept of test groups. These are sets of |
| 171 | +tests which run in a single browser process, between which the browser |
| 172 | +is restarted. Parallel test running is enabled by scheduling multiple |
| 173 | +test groups, each of which is associated with a single browser |
| 174 | +instance. |
| 175 | + |
| 176 | +It also has the concept of multiple test types (e.g. testharness, |
| 177 | +reftest), each of which has a different implementation. Each test |
| 178 | +group is associated with a single type of test (i.e. a single group |
| 179 | +can contain either testharness tests or reftests, but not both) |
| 180 | +i.e. we can form a mapping from a test type to a list of test |
| 181 | +groups. Groups of different test types can be run in parallel. |
| 182 | + |
| 183 | +For subsuites the different browser configurations in each subsuite |
| 184 | +mean that it's generally desirable not to reuse browser instances |
| 185 | +between different subsuites either by necessity (e.g. because the |
| 186 | +configuration of the browser for the subsuite has to happen at browser |
| 187 | +startup) or because of isolation (so that we don't accidentally get |
| 188 | +shared browser state between subsuites affecting the results of |
| 189 | +later-run tests). Therefore we can assume that each test group should |
| 190 | +be associated with exactly one subsuite. This leaves a small amount of |
| 191 | +parallelism on the table in cases where it would be possible to run |
| 192 | +multiple subsuites in the same group, but the resulting model is |
| 193 | +relatively easy to understand and satisfies the main use cases where |
| 194 | +changing configuration at runtime is not possible. |
| 195 | + |
| 196 | +Therefore the implementation will work as follows: |
| 197 | + |
| 198 | +* Test groups are associated with a pair `(subsuite, test_type)` |
| 199 | + rather than a test type alone. |
| 200 | + |
| 201 | +* Each `(subsuite, test_type)` combination has a unique |
| 202 | + `TestImplementation`, which corresponds to a unique instance of the |
| 203 | + product-specific `Browser` and `Executor` classes. |
| 204 | + |
| 205 | +* Each product implementation is responsible for configuring the |
| 206 | + browser and executor classes as required for the subsuite. To |
| 207 | + facilitate this the subsuite details are passed to the product |
| 208 | + specific functions for getting the browser and executor kwargs. |
| 209 | + |
| 210 | +* Any global state (e.g. reftest screenshot cache) that depends on the |
| 211 | + test id is instead keyed on `(subsuite, test.id)`. |
| 212 | + |
| 213 | +#### Subsuite definition |
| 214 | + |
| 215 | +The definition of subsuites contains the following properties: |
| 216 | + |
| 217 | +* `name` (required) - The name of the subsuite. |
| 218 | + |
| 219 | +* `config` (required) - Product-specific configuration options for the |
| 220 | + subsuite. This is used to e.g. set the command line arguments used |
| 221 | + when launching the browser, or the prefs set in the browser profile. |
| 222 | + |
| 223 | +* `run_info` (optional) - Any updates to the `run_info` associated with the |
| 224 | + subsuite, other than its name (the `run_info["subsuite"] = name` key |
| 225 | + is set implicitly). |
| 226 | + |
| 227 | +* `include` (optional) - A list of test id prefixes to include when |
| 228 | + running the subsuite. |
| 229 | + |
| 230 | +* `tags` (optional) - A list of test tags (as defined in the metadata |
| 231 | + files) to include when running the subsuite. |
| 232 | + |
| 233 | +* `expires` (optional) - A date after which the subsuite is no longer |
| 234 | + run (required for compatibility with Chromium's existing |
| 235 | + infrastructure). |
| 236 | + |
| 237 | +In the first instance the subsuite definitions will be provided in a |
| 238 | +JSON file of the form |
| 239 | + |
| 240 | +```json |
| 241 | +{ |
| 242 | + subsuite_name: {properties} |
| 243 | +} |
| 244 | +``` |
| 245 | + |
| 246 | +This will be passed to wptrunner using a new `--subsuite-file` command |
| 247 | +line argument. |
| 248 | + |
| 249 | +#### Test selection |
| 250 | + |
| 251 | +The test selection mechanism in wptrunner is complex, and tries to |
| 252 | +meet many use cases. Adding subsuites into the mix inevitably |
| 253 | +increases that complexity. |
| 254 | + |
| 255 | +Typically in wptrunner test loading happens in multiple stages: |
| 256 | + |
| 257 | +1. One or more `MANIFEST.json` files are loaded, providing a complete list of |
| 258 | + all the tests in the suite (multiple `MANIFEST.json` files are used |
| 259 | + to allow vendors to add vendor-specific tests to the suite). |
| 260 | + |
| 261 | +1. The entries of each manifest are iterated. These are filtered |
| 262 | + according to a test id filter built from the `--include`, |
| 263 | + `--exclude`, `--include-file` and `--include-manifest` command line |
| 264 | + arguments. |
| 265 | + |
| 266 | +1. For each included test, the corresponding metadata is parsed. |
| 267 | + |
| 268 | +1. Further filters based on the metadata values are applied. The |
| 269 | + `--tag` command line argument, and `--skip-implementation-status` |
| 270 | + command line arguments are appleid at this stage, since they depend |
| 271 | + on metadata values. |
| 272 | + |
| 273 | +1. The tests are split into test groups, according to the |
| 274 | + configuration. |
| 275 | + |
| 276 | +The `--test-groups` command line argument changes the behaviour here |
| 277 | +slightly; it directly provides a list of tests that will run, and |
| 278 | +names groups for the tests in the format `{group_name: |
| 279 | +[tests_ids]}`. This can be used in combination with the `--include` |
| 280 | +command line argument to subset the tests in the file, but trying to |
| 281 | +specify more tests than in the file will cause an error. |
| 282 | + |
| 283 | +The current metadata processing means that we need to read the |
| 284 | +metadata files once per subsuite (since we resolve the metadata |
| 285 | +properties when reading the file, and the metadata can be different |
| 286 | +for different subsuites). |
| 287 | + |
| 288 | +The simplest solution here is to read the `MANIFEST.json` files once, |
| 289 | +apply the top-level filters that apply to all subsuites (e.g. those |
| 290 | +from `--include`), and then run the remaining steps for each subsuite, |
| 291 | +adding the subsuite include filters after parsing the metadata. This |
| 292 | +isn't the optimal solution, since we can end up reading metadata |
| 293 | +multiple times even though it's not used. However further |
| 294 | +optimisations are possible (e.g. separating out the parse and |
| 295 | +evaluation phase of reading metadata, as we do with the update tool, |
| 296 | +making the top-level filters per-subsuite to avoid reading metadata |
| 297 | +files for tests won't be included in the subsuite). However these |
| 298 | +optimistations should be possible without backwards-incompatible |
| 299 | +changes. |
| 300 | + |
| 301 | +Since the `--test-groups` file is directly providing a list of tests |
| 302 | +that run, rather than a filter, the format of that file is modified in |
| 303 | +a backwards compatible way so that instead of the keys being just |
| 304 | +group name, they are of the form `<subsuite_name>:<group_name>`, with |
| 305 | +the part up to and including the colon omitted for the default subsuite. |
| 306 | + |
| 307 | +Running individual subsuites will be possible with the `--subsuite` |
| 308 | +command line argument. This can be provided multiple times to run |
| 309 | +multiple subsuites. |
| 310 | + |
| 311 | +#### wptreport.json |
| 312 | + |
| 313 | +The wptreport.json format needs to be updated to allow specifying |
| 314 | +which subsuite each test belongs to. To do this two principal changes |
| 315 | +are proposed: |
| 316 | + |
| 317 | +1. A new top-level key `subsuites` which contains a mapping from |
| 318 | + subsuite name to the `run_info` data specific to that subsuite. |
| 319 | + |
| 320 | +1. Entries in the `results`, `asserts`, `lsan_leaks`, and `mozleak` |
| 321 | + top-level fields get a `subsuite` property which is the name of a |
| 322 | + subsuite. |
| 323 | + |
| 324 | +#### Metadata update |
| 325 | + |
| 326 | +The automatic metadata update works by collecting test results from |
| 327 | +multiple files, grouped by their `run_info` data. This can naturally |
| 328 | +be extended to support subsuites, just by extending the processing to |
| 329 | +create a `run_info` entry per subsuite rather than assuming the |
| 330 | +`run_info` is the same for all tests in a file. |
| 331 | + |
| 332 | +The main difficulty here is that the metadata expression language |
| 333 | +doesn't have a concept of `null`. So we need to represent the default |
| 334 | +subsuite name as the empty string in `run_info`, or add a `null` |
| 335 | +concept to allow specifying metadata specific to the default subsuite. |
| 336 | + |
| 337 | +## Risks |
| 338 | + |
| 339 | +The main trade-offs of the proposed approach have been discussed in the |
| 340 | +Details section. |
| 341 | + |
| 342 | +The main risk is that by making changes to the underlying data model |
| 343 | +we need to update tooling that consumes the generated files. However, |
| 344 | +because the changes are purely additive, and the use-cases for |
| 345 | +subsuites are limited (e.g. we don't currently have any plans to use |
| 346 | +these on the GitHub infrastructure), it should be possible to upgrade |
| 347 | +consumers as required. In particular no changes proposed here should |
| 348 | +affect wpt.fyi in the foreseeable future. |
0 commit comments