I've done some editing for clarity, but more needs to be done.
The language needs clean-up, we should forward-reference the LPC
Extrapolation section, and we need a reference for actually
computing linear prediction coefficients.
Reorder register usage to take advantage of early termination on
multiplications and reorder a load instruction to hide its
latency on ARM9.
Speeds up decoding of a 64 kbps test file by 0.1MHz on an ARM7TDMI
and 0.2MHz on an ARM9TDMI.
Signed-off-by: Timothy B. Terriberry <tterribe@xiph.org>
Uses a C implementation with a 32*32 => 64 multiplication, which
ARM has.
Speeds up decoding of a 64 kbps test file by 0.5MHz on an ARM7TDMI
and 1.0MHz on an ARM9TDMI.
0.2% speedup on a 96 kbps enc+dec test on a Cortex A8.
Signed-off-by: Timothy B. Terriberry <tterribe@xiph.org>
This splits out the non-arch-specific portions of a patch written
by Aurélien Zanelli <aurelien.zanelli@parrot.com
http://lists.xiph.org/pipermail/opus/2013-May/002088.html
I also added support for odd n, for custom modes.
0.25% speedup on 96 kbps stereo encode+decode on a Cortex A8.
58.4% speedup (2.4x faster) on test_unit_cwrs32 (no custom modes).
Gives a 3.2% speedup on
./opus_demo restricted-lowdelay 48000 2 96000 comp48-stereo.sw /dev/null
on a 600 MHz Cortex A8.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/CIHBJEHG.html
says that "Rd cannot be the same as Rm."
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/CIHBJEHG.html
says that "RdLo, RdHi, and Rm must all be different registers."
This means that some of the early clobbers I removed really should
have been there (to prevent aliasing Rd, RdLo, or RdHi with Rm).
It also means that we should reverse some of the operands in the
FFT's complex multiplies.
This should only affect the ARMv4 optimizations.
Thanks to Nils Wallménius for the report.
While we're here, audit the commutative pair flags again, since I
screwed up at least one of them, and eliminate some dead code.
Needed by commit 972a34ec2c.
Use autoreconf in autogen.sh instead of the handwritten version,
it's simpler, and also updates things that we weren't handling.
Drop the hand-written INSTALL file. Its information content was
~zero, and autotools wants to overwrite it with its own version,
so don't fight that, just .gitignore it.
In most cases these will use __builtin_clz().
In a follow-up, we should audit usage of silk_CLZ32() and convert
the places where its argument must be non-zero to use EC_ILOG()
directly to avoid the test for zero (which is necessary on x86).
Original patch by Aurélien Zanelli <aurelien.zanelli@parrot.com>:
http://lists.xiph.org/pipermail/opus/2013-May/002078.html
Revised version:
- Add autconf detection (ported from libtheora).
- Rename ARM5E to ARMv5E (an ARM5 is not the same thing as ARMv5!).
- Use actual macros so they can still be selectively overridden.
- Split out ARMv4 parts and add a few more ARMv4 macros.
- Label blocks to make them easy to find in generated assembly.
- Fix MULT16_32_Q15() so we can pass make check.
The MDCT test passes in values larger than 2**30 for b.
The new version should be just as fast (or faster, since it's
easier to merge the shift with following instructions), and
there's no appreciable impact on accuracy (FFT/MDCT SNR actually
goes up in most cases).
- Fix register constraints.
We were using early-clobber flags in a bunch of places that
didn't need them, and commutative-pair flags in a bunch of
places that weren't actually commutative.
This was Jean-Marc's fault (the original code came from Speex).
- Simplify silk_CLZ16().
- Port over iFFT C_MULC asm by Andree Buschmann
<AndreeBuschmann@t-online.de> from Rockbox.
- Speed up the C_MULC asm by using LDRD, allowing more flexible
addressing, re-ordering instructions to avoid some stalls,
allowing more flexible register allocation, and getting things
out of the inline asm block so the compiler can schedule them
better.
- Add C_MUL and C_MUL4 asm for the FFT to the encoder based, on the
new C_MULC.
In total, this patch gives a 22.3% speed-up on test_opus_encoder on
a 600 MHz Cortex A8 using gcc 4.2.1,
When restricted to ARMv4 optimizations, it gives a 9.6% speed-up
on the same processor/compiler.
On the conformance test vectors:
Average mono quality is 97.0583 %
Average stereo quality is 97.775 %
We shouldn't ever have any trailing newlines that need trimming here,
and the _s version wasn't added to m4sugar.m4 until autoconf 2.63b,
so this will let it work with 2.13 again.
There's currently at least one way that people can legitimately get a
tarball that doesn't include it, via the gitweb snapshots, so create
it rather than considering that an error to be manually fixed.
Drop some unneeded CINCLUDES.
Drop the VPATH stuff altogether. It's entirely unused here, and some of
the paths in it don't even exist and apparently never have in this tree.
Drop the 'default' rule, without it there, 'all' already is the default.
Drop $(TARGET) from all, it already includes 'lib' which is $(TARGET).
Declare phony targets PHONY.
This one meets or exceeds the following requirements:
- Version is checked/updated for every build action when in the git repo.
Does not require the user to re- ./configure to get the correct version.
- Version is not updated automatically when using exported tarball source.
Avoids accidentally getting a wrong version from some other git repo in
a parent directory of the source, and allows setting the correct version
for distro package exports.
- Automatic updating can be manually suppressed.
For developers doing lots of change/rebuild cycles they don't plan to
release, when they don't want a full rebuild triggered for every commit,
and again for every change made immediately after a commit.
The version will still always be updated if they do a `make dist`.
- Does not require any manual updating of versions in the mainline git
repo for each release aside from normal tagging. The version is
recorded in one file only, that is automatically generated and will
never need to be committed.
- Does not require gnu-make features for the autoconf builds.
It does not currently:
- Keep a checksum of every source file in tarball releases to mangle the
version if people modify the tarball source. Responsible people can
manually update the version easily though in such cases.
The version.mk file is now only used by the VC project files. Once they
are updated to use the package_version file too, then it can be deleted
from the repository.
We stop the schur recursion before any reflection coefficient
goes outside of ]-1,1[ and we force reporting a residual energy
of at least 1.
Assertion was:
Fatal (internal) error in ../silk/fixed/noise_shape_analysis_FIX.c, line 290: assertion failed: nrg >= 0
triggered by:
opus_demo voip 16000 1 12500 -bandwidth WB -complexity 10 pl04f087.stp-crash out.pcm
silk_setup_resamples() was using x_bufFIX for two purposes, and I
only allocated enough space for one of them.
This patch also switches to slightly more descriptive variable
names than nSamples_temp and computes the resampler input/ouput
sizes in a way that a little more obviously doesn't have issues
with fractional samples (and replaces a divide with a variable
divisor by one with a constant divisor).