@@ -371,9 +371,119 @@ Reporting causes of resets
371
371
372
372
Apart from propagating the reset through the stack so apps can recover, it's
373
373
really useful for driver developers to learn more about what caused the reset in
374
- the first place. DRM devices should make use of devcoredump to store relevant
375
- information about the reset, so this information can be added to user bug
376
- reports.
374
+ the first place. For this, drivers can make use of devcoredump to store relevant
375
+ information about the reset and send device wedged event with ``none `` recovery
376
+ method (as explained in "Device Wedging" chapter) to notify userspace, so this
377
+ information can be collected and added to user bug reports.
378
+
379
+ Device Wedging
380
+ ==============
381
+
382
+ Drivers can optionally make use of device wedged event (implemented as
383
+ drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
384
+ (hanged/unusable) state of the DRM device through a uevent. This is useful
385
+ especially in cases where the device is no longer operating as expected and has
386
+ become unrecoverable from driver context. Purpose of this implementation is to
387
+ provide drivers a generic way to recover the device with the help of userspace
388
+ intervention, without taking any drastic measures (like resetting or
389
+ re-enumerating the full bus, on which the underlying physical device is sitting)
390
+ in the driver.
391
+
392
+ A 'wedged' device is basically a device that is declared dead by the driver
393
+ after exhausting all possible attempts to recover it from driver context. The
394
+ uevent is the notification that is sent to userspace along with a hint about
395
+ what could possibly be attempted to recover the device from userspace and bring
396
+ it back to usable state. Different drivers may have different ideas of a
397
+ 'wedged' device depending on hardware implementation of the underlying physical
398
+ device, and hence the vendor agnostic nature of the event. It is up to the
399
+ drivers to decide when they see the need for device recovery and how they want
400
+ to recover from the available methods.
401
+
402
+ Driver prerequisites
403
+ --------------------
404
+
405
+ The driver, before opting for recovery, needs to make sure that the 'wedged'
406
+ device doesn't harm the system as a whole by taking care of the prerequisites.
407
+ Necessary actions must include disabling DMA to system memory as well as any
408
+ communication channels with other devices. Further, the driver must ensure
409
+ that all dma_fences are signalled and any device state that the core kernel
410
+ might depend on is cleaned up. All existing mmaps should be invalidated and
411
+ page faults should be redirected to a dummy page. Once the event is sent, the
412
+ device must be kept in 'wedged' state until the recovery is performed. New
413
+ accesses to the device (IOCTLs) should be rejected, preferably with an error
414
+ code that resembles the type of failure the device has encountered. This will
415
+ signify the reason for wedging, which can be reported to the application if
416
+ needed.
417
+
418
+ Recovery
419
+ --------
420
+
421
+ Current implementation defines three recovery methods, out of which, drivers
422
+ can use any one, multiple or none. Method(s) of choice will be sent in the
423
+ uevent environment as ``WEDGED=<method1>[,..,<methodN>] `` in order of less to
424
+ more side-effects. If driver is unsure about recovery or method is unknown
425
+ (like soft/hard system reboot, firmware flashing, physical device replacement
426
+ or any other procedure which can't be attempted on the fly), ``WEDGED=unknown ``
427
+ will be sent instead.
428
+
429
+ Userspace consumers can parse this event and attempt recovery as per the
430
+ following expectations.
431
+
432
+ =============== ========================================
433
+ Recovery method Consumer expectations
434
+ =============== ========================================
435
+ none optional telemetry collection
436
+ rebind unbind + bind driver
437
+ bus-reset unbind + bus reset/re-enumeration + bind
438
+ unknown consumer policy
439
+ =============== ========================================
440
+
441
+ The only exception to this is ``WEDGED=none ``, which signifies that the device
442
+ was temporarily 'wedged' at some point but was recovered from driver context
443
+ using device specific methods like reset. No explicit recovery is expected from
444
+ the consumer in this case, but it can still take additional steps like gathering
445
+ telemetry information (devcoredump, syslog). This is useful because the first
446
+ hang is usually the most critical one which can result in consequential hangs or
447
+ complete wedging.
448
+
449
+ Consumer prerequisites
450
+ ----------------------
451
+
452
+ It is the responsibility of the consumer to make sure that the device or its
453
+ resources are not in use by any process before attempting recovery. With IOCTLs
454
+ erroring out, all device memory should be unmapped and file descriptors should
455
+ be closed to prevent leaks or undefined behaviour. The idea here is to clear the
456
+ device of all user context beforehand and set the stage for a clean recovery.
457
+
458
+ Example
459
+ -------
460
+
461
+ Udev rule::
462
+
463
+ SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
464
+ RUN+="/path/to/rebind.sh $env{DEVPATH}"
465
+
466
+ Recovery script::
467
+
468
+ #!/bin/sh
469
+
470
+ DEVPATH=$(readlink -f /sys/$1/device)
471
+ DEVICE=$(basename $DEVPATH)
472
+ DRIVER=$(readlink -f $DEVPATH/driver)
473
+
474
+ echo -n $DEVICE > $DRIVER/unbind
475
+ echo -n $DEVICE > $DRIVER/bind
476
+
477
+ Customization
478
+ -------------
479
+
480
+ Although basic recovery is possible with a simple script, consumers can define
481
+ custom policies around recovery. For example, if the driver supports multiple
482
+ recovery methods, consumers can opt for the suitable one depending on scenarios
483
+ like repeat offences or vendor specific failures. Consumers can also choose to
484
+ have the device available for debugging or telemetry collection and base their
485
+ recovery decision on the findings. This is useful especially when the driver is
486
+ unsure about recovery or method is unknown.
377
487
378
488
.. _drm_driver_ioctl :
379
489
0 commit comments