This lets us encode and decode directly from the pulse vector without an
intermediate transformation.
This makes old streams undecodable.
Additionally, ncwrs_u32() has been sped up for large N by using the sliding
recurrence from Mohorko et al.
ncwrs_u64 could be sped up in a similar manner, but would require a larger
table of multiplicative inverses (or several 32x32->64 bit multiplies).
Note that U(N,M) is now everywhere 1/2 the value it used to be.
This eliminates an extra O(nm) lookups on decode, and reduces the rate control
from O(nm^2) to O(nm), in addition to eliminating O(m) lookups on both encode
and decode.
Although the interface is slightly more complex, the internal code is also
simpler.