In most cases these will use __builtin_clz().
In a follow-up, we should audit usage of silk_CLZ32() and convert
the places where its argument must be non-zero to use EC_ILOG()
directly to avoid the test for zero (which is necessary on x86).
Original patch by Aurélien Zanelli <aurelien.zanelli@parrot.com>:
http://lists.xiph.org/pipermail/opus/2013-May/002078.html
Revised version:
- Add autconf detection (ported from libtheora).
- Rename ARM5E to ARMv5E (an ARM5 is not the same thing as ARMv5!).
- Use actual macros so they can still be selectively overridden.
- Split out ARMv4 parts and add a few more ARMv4 macros.
- Label blocks to make them easy to find in generated assembly.
- Fix MULT16_32_Q15() so we can pass make check.
The MDCT test passes in values larger than 2**30 for b.
The new version should be just as fast (or faster, since it's
easier to merge the shift with following instructions), and
there's no appreciable impact on accuracy (FFT/MDCT SNR actually
goes up in most cases).
- Fix register constraints.
We were using early-clobber flags in a bunch of places that
didn't need them, and commutative-pair flags in a bunch of
places that weren't actually commutative.
This was Jean-Marc's fault (the original code came from Speex).
- Simplify silk_CLZ16().
- Port over iFFT C_MULC asm by Andree Buschmann
<AndreeBuschmann@t-online.de> from Rockbox.
- Speed up the C_MULC asm by using LDRD, allowing more flexible
addressing, re-ordering instructions to avoid some stalls,
allowing more flexible register allocation, and getting things
out of the inline asm block so the compiler can schedule them
better.
- Add C_MUL and C_MUL4 asm for the FFT to the encoder based, on the
new C_MULC.
In total, this patch gives a 22.3% speed-up on test_opus_encoder on
a 600 MHz Cortex A8 using gcc 4.2.1,
When restricted to ARMv4 optimizations, it gives a 9.6% speed-up
on the same processor/compiler.
On the conformance test vectors:
Average mono quality is 97.0583 %
Average stereo quality is 97.775 %
This makes all remaining large stack allocations use the vararray
macros.
This continues the work of 6f2d9f50 to allow compiling with
NONTHREADSAFE_PSEUDOSTACK to move the memory for large buffers
off the stack for devices where it is very limited.
It also does this for some additional large buffers used by the
PLC in the decoder.
This changes the saturation test to ensure that it relies on the
unsigned overflow behaviour (which is allowed) rather than the signed
overflow behaviour (which is undefined).
decoder:
- fixed incorrect scaling of filter states for the smallest quantization
step sizes
- NLSF2A now limits the prediction gain of LPC filters
encoder:
- increased damping of LTP coefficients in LTP analysis
- increased white noise fraction in noise shaping LPC analysis
- introduced maximum total prediction gain. Used by Burg's method to
exit early if prediction gain is exceeded. This improves packet
loss robustness and numerical robustness in Burg's method
- Prefiltered signal is now in int32 Q10 domain, from int16 Q0
- Increased max number of iterations in CBR gain control loop from 5 to 6
- Removed useless code from LTP scaling control
- Optimization: smarter LPC loop unrolling
- Switched default win32 compile mode to be floating-point
resampler:
- made resampler have constant delay of 0.75 ms; removed delay
compensation from silk code.
- removed obsolete table entries (~850 Bytes)
- increased downsampling filter order from 16 to 18/24/36 (depending on
frequency ratio)
- reoptimized filter coefficients
C reserves identifiers of the from _[A-Z]+ and we have a number of
those in the code. This patch renames the various function arguments,
MACROS and preprocessor symbols to avoid the reserved form.
It also removes the CHANNELS() macro altogether. This was a
minor optimization for TI DSP to force a mono-only build,
as were the associated local 'const' versions. Since stereo
support is manditory, it wasn't worth keeping.
Thanks to John Ridges for raising the issue, and Jean-Marc Valin
and Greg Maxwell for reviewing the changes.