There are a few gaps in our testing that aren't addressed well by typical coverage statistics.
First, we have a few parts of the code that are prone to error, for example the functions in io.py can break when C MuJoCo data structures change. We should beef up testing for io.py and possibly constraint.py which have important shared fields that are being actively futzed with right now on the C MuJoCo side.
Second, we should introduce more test fixture models that exercise parts of the code that are currently not well tested, e.g. some collision functions, diverse constraints in diverse dense/sparse configurations, etc.