Accuracy for rsqrt, rcp, cos, and log2 is now at the level of truncation error
for the current output resolution of these functions.
sqrt and exp2 still have non-trivial algebraic error, but this cannot be
reduced much further using the current method without additional computation.
Also updates the fast float approximations for log2 and exp2 with coefficients
that give slightly lower maximum relative error.
Patch modified by Jean-Marc Valin to leave the cos approximation as is and
leave the check for x<-15 in exp2 as is.
Making it so all the information encoded directly with ec_enc_bits() gets
stored at the end of the stream, without going through the range coder. This
should be both faster and reduce the effects of bit errors.
Conflicts:
tests/ectest.c
Adds specialized O(N*log(K)) versions of cwrsi() and O(N) versions of icwrs()
for N={3,4,5}, which allows them to operate all the way up to the theoretical
pulse limit without serious performance degredation.
Also substantially reduces the computation time and stack usage of
get_required_bits().
On x86-64, this gives a 2% speed-up for 256 sample frames, and almost a 16%
speed-up for 64 sample frames.
When I removed the special case for EC_ILOG(0) in commit
06390d082d, it broke ec_dec_uint() with _ft=1
(which should encode the value 0 using 0 bits).
This feature was tested by ectest.c, but not actually used by libcelt.
An assert has been added to ec_dec_uint() to ensure that we don't try to use
this feature by accident.
ec_enc_uint() was actually correct, but support for this feature has been
removed and the assert put in its place.
This lets us encode and decode directly from the pulse vector without an
intermediate transformation.
This makes old streams undecodable.
Additionally, ncwrs_u32() has been sped up for large N by using the sliding
recurrence from Mohorko et al.
ncwrs_u64 could be sped up in a similar manner, but would require a larger
table of multiplicative inverses (or several 32x32->64 bit multiplies).
Note that U(N,M) is now everywhere 1/2 the value it used to be.
This eliminates an extra O(nm) lookups on decode, and reduces the rate control
from O(nm^2) to O(nm), in addition to eliminating O(m) lookups on both encode
and decode.
Although the interface is slightly more complex, the internal code is also
simpler.