The existing code in vec_avx.h produced
warning: dereferencing type-punned pointer will break
strict-aliasing rules
with gcc 6.4.0.
We already had a macro to work around this within the rules of the
C standard, but trying to use that here does not get optimized
into a single MOVD like we were hoping.
Replacing it with memcpy() instead does get optimized correctly,
but requires switching from a macro to an inline function in order
to be able to declare a local variable and return a value.
We already have such an inline function in NSQ_del_dec_avx2.c, so
hoist that out and use it everywhere, and then convert vec_avx.h
to use it also.
Since any value of dQ > 0 will cause the initial quantizer to
degrade to the format-implied maximum (15) with a sufficient
number of DRED frames, allow signaling a maximum smaller than 15.
This allows encoders to improve the minimum quality of long DRED
sequences (at the expense of bitrate) without requiring a constant
quantizer for all frames (dQ == 0).
Commit 735c40706f added uses of intrinsics that require at least
gcc 9.0 (cf. <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782>),
even though AVX2 support may appear to be available in earlier gcc
versions.
We were not testing for this.
Update the compiler test in configure.ac to use these intrinsics
explicitly, so it will error out and disable AVX2 if they are not
available.
warning: expression does not compute the number of elements in this array
Seems like gcc thinks we're trying to get the number of elements in our
array or something like that. It then suggests adding parentheses to
silence the warning.
xcorr_kernel_neon_fixed() read one more sample from y[] in the
main loop than it needed to allow use of vector loads, but unlike
the native asm in celt_pitch_xcorr_arm.s, the loop condition did
not exit early enough to prevent this from overrunning the end of
the array.
Additionally, the tail loop _always_ read one value beyond what it
needed.
This patch fixes the loop condition on the main loop.
Since this makes the tail section run even for lengths that are a
multiple of 8 (e.g., on fully half the multiplies for usages like
celt_fir() or celt_iir() with an order of 16, which is common),
rather than try to fix the tail loop, we replace it with a
non-looping adaptation of the native asm, which continues to use
vector loads as much as possible for the remaining elements (and
also does not read ahead past the end of the y[] array).
Overall slowdown of test_opus_encode on a Raspberry Pi 5 Model B
Rev 1.0 is 0.12% vs. 0.13% for fixing the existing tail loop.
Signed-off-by: Jean-Marc Valin <jmvalin@jmvalin.ca>
Compare the output of xcorr_kernel() against the results of
xcorr_kernel_c() when configured with --enable-check-asm.
Currently this is only checked in fixed point, as a float check
requires more sophisticated error analysis and may need to be
customized for each vector implementation.
Signed-off-by: Jean-Marc Valin <jmvalin@jmvalin.ca>