After #40, the safe 1024-bit prime generation benchmark for BoxedUint is about 30% slower than that for Uint. What are the contributing factors here?
One possible avenue for improvement would be making use of mutating methods to reduce the amount of allocations. This is especially prominent in lucas_test(), where we rely on clone() a lot, but may be useful in other places as well.