From 828e33f304e99321a7990752aeb97616399bfbcc Mon Sep 17 00:00:00 2001 From: "Timothy B. Terriberry" Date: Mon, 26 Sep 2011 20:53:26 -0700 Subject: [PATCH] Draft clean-ups and additions. --- doc/build_draft.sh | 5 +- doc/draft-ietf-codec-opus.xml | 751 +++++++++++++++++++++++----------- 2 files changed, 522 insertions(+), 234 deletions(-) diff --git a/doc/build_draft.sh b/doc/build_draft.sh index 91474555..75edd5ac 100755 --- a/doc/build_draft.sh +++ b/doc/build_draft.sh @@ -35,7 +35,10 @@ cp -a "${toplevel}"/COPYING "${destdir}"/COPYING tar czf opus_source.tar.gz "${destdir}" echo building base64 version -cat opus_source.tar.gz| base64 | tr -d '\n' | fold -w 64 | sed 's/^/###/' > opus_source.base64 +cat opus_source.tar.gz| base64 | tr -d '\n' | fold -w 64 | \ + sed -e 's/^/\###/' -e 's/$/\<\/spanx\>\/' > \ + opus_source.base64 + #echo '
' > opus_compare_escaped.c #echo '' >> opus_compare_escaped.c diff --git a/doc/draft-ietf-codec-opus.xml b/doc/draft-ietf-codec-opus.xml index 910b3d40..ce00e5c8 100644 --- a/doc/draft-ietf-codec-opus.xml +++ b/doc/draft-ietf-codec-opus.xml @@ -38,7 +38,7 @@ - + Mozilla Corporation
@@ -1127,9 +1127,9 @@ However, this error is bounded, and periodic calls to ec_tell() or ec_tell_frac() at precisely defined points in the decoding process prevent it from accumulating. For a range coder symbol that requires a whole number of bits (i.e., - ft/(fh[k]-fl[k]) is a power of two), where there are at least p 1/8th bits - available, decoding the symbol will never advance the decoder past the end of - the frame ("bust the budget"). + for which ft/(fh[k]-fl[k]) is a power of two), where there are at least p + 1/8th bits available, decoding the symbol will never cause ec_tell() or + ec_tell_frac() to exceed the size of the frame ("bust the budget"). In this case the return value of ec_tell_frac() will only advance by more than p 1/8th bits if there was an additional, fractional number of bits remaining, and it will never advance beyond the next whole-bit boundary, which is safe, @@ -1172,13 +1172,38 @@ ec_tell_frac() estimates the number of bits buffered in rng to fractional precision. Since rng must be greater than 2**23 after renormalization, l must be at least 24. -Let r = rng>>(l-16), so that 32768 <= r < 65536, an unsigned Q15 - value representing the fractional part of rng. +Let +
+ +> (l-16) , +]]> +
+ so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the + fractional part of rng. Then the following procedure can be used to add one bit of precision to l. -First, update r = r*r>>15. -Then add the 16th bit of r to l via l = 2*l + (r>>16). -Finally, if this bit was a 1, reduce r by a factor of two via r = r>>1, - so that it once again lies in the range 32768 <= r < 65536. +First, update +
+ +> 15 . +]]> +
+Then add the 16th bit of r_Q15 to l via +
+ +> 16) . +]]> +
+Finally, if this bit was a 1, reduce r_Q15 by a factor of two via +
+ +> 1 , +]]> +
+ so that it once again lies in the range 32768 <= r_Q15 < 65536. This procedure is repeated three times to extend l to 1/8th bit precision. @@ -1199,6 +1224,73 @@ It runs in NB, MB, and WB modes internally. When used in a hybrid frame in SWB or FB mode, the LP layer itself still only runs in WB mode. + +
+ +An overview of the decoder is given in . + +
+ +| Range |--->| Decode |---------------------------+ + 1 | Decoder | 2 | Parameters |----------+ 5 | + +---------+ +------------+ 4 | | + 3 | | | + \/ \/ \/ + +------------+ +------------+ +------------+ + | Generate |-->| LTP |-->| LPC | + | Excitation | | Synthesis | | Synthesis | + +------------+ +------------+ +------------+ + | + +------------------------------------+ + | 6 + | +------------+ +------------+ + +-->| Stereo |-->| Resampling |--> + 8 | Unmixing | 7 | | 8 + +------------+ +------------+ + +1: Range encoded bitstream +2: Coded parameters +3: Pulses and gains +4: Pitch lags and LTP coefficients +5: LPC coefficients +6: Decoded signal (mono or mid-side stereo) +7: Unmixed signal (mono or left-right stereo) +8: Resampled signal +]]> + +Decoder block diagram. +
+ + + +The decoder feeds the bitstream (1) to the range decoder from + , and then decodes the parameters in it (2) + using the procedures detailed in + Sections  + through . +These parameters (3, 4, 5) are used to generate an excitation signal (see + ), which is fed to an optional + long-term prediction (LTP) filter (voiced frames only, see + ) and then a short-term prediction filter + (see ), producing the decoded signal (6). +For stereo streams, the mid-side representation is converted to separate left + and right channels (7). +The result is finally resampled to the desired output sample rate (e.g., + 48 kHz) so that the resampled signal (8) can be mixed with the CELT + layer. + + +
+ + + +
+ Internally, the LP layer of a single Opus frame is composed of either a single 10 ms regular SILK frame or between one and three 20 ms regular SILK @@ -1216,9 +1308,12 @@ This draft uses "SILK frame" to refer to either one and "regular SILK frame" if it needs to draw a distinction between the two. -Each SILK frame is in turn composed of either two or four 5 ms subframes. +Logically, each SILK frame is in turn composed of either two or four 5 ms + subframes. Various parameters, such as the quantization gain of the excitation and the pitch lag and filter coefficients can vary on a subframe-by-subframe basis. +Physically, the parameters for each subframe are interleaved in the bitstream, + as described in the relevant sections for each parameter. All of these frames and subframes are decoded from the same range coder, with @@ -1239,6 +1334,15 @@ It would be required to do so anyway for hybrid Opus frames, or to support decoding individual 20 ms frames. + + summarizes the overal grouping of the contents of + the LP layer. +Figures  + and  illustrate + the ordering of the various SILK frames for a 60&nbps;ms Opus frame according + to the rules described, for both mono and stereo, respectively. + + Symbol(s) PDF(s) @@ -1269,126 +1373,102 @@ Organization of the SILK layer of an Opus frame. -
- -An overview of the decoder is given in . - -
- -| Range |--->| Decode |---------------------------+ - 1 | Decoder | 2 | Parameters |----------+ 5 | - +---------+ +------------+ 4 | | - 3 | | | - \/ \/ \/ - +------------+ +------------+ +------------+ - | Generate |-->| LTP |-->| LPC |--> - | Excitation | | Synthesis | | Synthesis | 6 - +------------+ +------------+ +------------+ - -1: Range encoded bitstream -2: Coded parameters -3: Pulses and gains -4: Pitch lags and LTP coefficients -5: LPC coefficients -6: Decoded signal -]]> - -Decoder block diagram. +
+
-
- - The range decoder decodes the encoded parameters from the received bitstream. Output from this function includes the pulses and gains for generating the excitation signal, as well as LTP and LSF codebook indices, which are needed for decoding LTP and LPC coefficients needed for LTP and LPC synthesis filtering the excitation signal, respectively. - -
+
+ +
-
- - Pulses and gains are decoded from the parameters that were decoded by the range decoder. - +
- - When a voiced frame is decoded and LTP codebook selection and indices are received, LTP coefficients are decoded using the selected codebook by choosing the vector that corresponds to the given codebook index in that codebook. This is done for each of the four subframes. - The LPC coefficients are decoded from the LSF codebook by first adding the chosen LSF vector and the decoded LSF residual signal. The resulting LSF vector is stabilized using the same method that was used in the encoder; see - . The LSF coefficients are then converted to LPC coefficients, and passed on to the LPC synthesis filter. - -
- -
- - The pulses signal is multiplied with the quantization gain to create the excitation signal. - -
- -
- - For voiced speech, the excitation signal e(n) is input to an LTP synthesis filter that recreates the long-term correlation removed in the LTP analysis filter and generates an LPC excitation signal e_LPC(n), according to -
- - - -
- using the pitch lag L, and the decoded LTP coefficients b_i. - The number of LTP coefficients is 5, and thus d = 2. - - For unvoiced speech, the output signal is simply a copy of the excitation signal, i.e., e_LPC(n) = e(n). -
-
- -
- - In a similar manner, the short-term correlation that was removed in the LPC analysis filter is recreated in the LPC synthesis filter. The LPC excitation signal e_LPC(n) is filtered using the LTP coefficients a_i, according to -
- - - -
- where d_LPC is the LPC synthesis filter order, and y(n) is the decoded output signal. -
-
-
- - - -
+
The LP layer begins with two to eight header bits, decoded in silk_Decode() - (silk_dec_API.c). + (dec_API.c). These consist of one Voice Activity Detection (VAD) bit per frame (up to 3), followed by a single flag indicating the presence of LBRR frames. -For a stereo packet, these flags correspond to the mid channel, and a second - set of flags is included for the side channel. +For a stereo packet, these first flags correspond to the mid channel, and a + second set of flags is included for the side channel. -Because these are the first symbols decoded by the range coder, they can be - extracted directly from the upper bits of the first byte of compressed data. +Because these are the first symbols decoded by the range coder and because they + are coded as binary values with uniform probability, they can be extracted + directly from the most significant bits of the first byte of compressed data. Thus, a receiver can determine if an Opus frame contains any active SILK frames without the overhead of using the range decoder.
-
+
-For Opus frames longer than 20 ms, a set of per-frame LBRR flags is +For Opus frames longer than 20 ms, a set of LBRR flags is decoded for each channel that has its LBRR flag set. -For 40 ms Opus frames the 2-frame LBRR flag PDF from - is used, and for 60 ms Opus frames - the 3-frame LBRR flag PDF is used. +Each set contains one flag per 20 ms SILK frame. +40 ms Opus frames use the 2-frame LBRR flag PDF from + , and 60 ms Opus frames use the + 3-frame LBRR flag PDF. For each channel, the resulting 2- or 3-bit integer contains the corresponding LBRR flag for each frame, packed in order from the LSb to the MSb. @@ -1400,12 +1480,19 @@ For each channel, the resulting 2- or 3-bit integer contains the corresponding 60 ms {0, 41, 20, 29, 41, 15, 28, 82}/256 + +A 10 or 20 ms Opus frame does not contain any per-frame LBRR flags, + as there may be at most one LBRR frame per channel. +The global LBRR flag in the header bits (see ) + is already sufficient to indicate the presence of that single LBRR frame. + +
-The LBRR frames, if present, immediately follow, one per set LBRR flag, and - prior to any regular SILK frames. +The LBRR frames, if present, immediately follow, as indicated by the LBRR + flags, and prior to any regular SILK frames. describes their exact contents. LBRR frames do not include their own separate VAD flags. LBRR frames are only meant to be transmitted for active speech, thus all LBRR @@ -1413,12 +1500,13 @@ LBRR frames are only meant to be transmitted for active speech, thus all LBRR -In a stereo Opus frame longer than 20 ms, although all the per-frame LBRR - flags for the mid channel are coded before the per-frame LBRR flags for the - side channel, the LBRR frames themselves are interleaved. -The LBRR frame for the mid channel of a given 20 ms interval (if present) - is immediately followed by the corresponding LBRR frame for the side channel - (if present). +In a stereo Opus frame longer than 20 ms, although the per-frame LBRR + flags for the mid channel are coded as a unit before the per-frame LBRR flags + for the side channel, the LBRR frames themselves are interleaved. +The decoder parses an LBRR frame for the mid channel of a given 20 ms + interval (if present) and then immediately parses the corresponding LBRR + frame for the side channel (if present), before proceeding to the next + 20 ms interval.
@@ -1428,8 +1516,9 @@ The regular SILK frame(s) follow the LBRR frames (if any). describes their contents, as well. Unlike the LBRR frames, a regular SILK frame is always coded for each time interval in an Opus frame, even if the corresponding VAD flag is unset. -Like the LBRR frames, in stereo Opus frames longer than 20 ms, the mid and - side frames are interleaved for each 20 ms interval. +For stereo Opus frames longer than 20 ms, the regular mid and side SILK + frames for each 20 ms interval are interleaved, just as with the LBRR + frames. The side frame may be skipped by coding an appropriate flag, as detailed in . @@ -1437,11 +1526,20 @@ The side frame may be skipped by coding an appropriate flag, as detailed in
-Each SILK frame includes a set of side information that encodes the frame type, - quantization type and gains, short-term prediction filter coefficients, an LSF - interpolation weight, long-term prediction filter lags and gains, and a - linear congruential generator (LCG) seed. -The quantized excitation signal follows these at the end of the frame. +Each SILK frame includes a set of side information that encodes + +The frame type and quantization type (), +Quantization gains (), +Short-term prediction filter coefficients (), +An LSF interpolation weight (), + +Long-term prediction filter lags and gains (), + and + +A linear congruential generator (LCG) seed (). + +The quantized excitation signal (see ) follows + these at the end of the frame. details the overall organization of a SILK frame. @@ -1544,9 +1642,17 @@ In that case, the previous weights are used, again substituting in zeros if no previous weights are available since the last decoder reset. + +To summarize, these weights are coded if and only if + +This is a stereo Opus frame (), and +The current SILK frame corresponds to the mid channel. + + + The prediction weights are coded in three separate pieces, which are decoded - by silk_stereo_decode_pred() (silk_decode_stereo_pred.c). + by silk_stereo_decode_pred() (decode_stereo_pred.c). The first piece jointly codes the high-order part of a table index for both weights. The second piece codes the low-order part of each table index. @@ -1603,6 +1709,7 @@ w0_Q13 = w_Q13[wi0] - w1_Q13 ]]>
+N.b., w1_Q13 is computed first here, because w0_Q13 depends on it. A flag appears after the stereo prediction weights that indicates if only the mid channel is coded for this time interval. -It is omitted when there are no stereo weights, i.e., unless the SILK frame - corresponds to the mid channel of a stereo Opus frame, and it is also omitted - for an LBRR frame when the corresponding LBRR flags indicate the side channel - is present. -When present, the decoder reads a single value using the PDF in +It appears only when + +This is a stereo Opus frame (see ), +The current SILK frame corresponds to the mid channel, and +Either + +This is a regular SILK frame, or + +This is an LBRR frame where the corresponding LBRR flags + (see and ) + indicate the side channel is not coded. + + + + +It is omitted when there are no stereo weights, and it is also omitted for an + LBRR frame when the corresponding LBRR flags indicate the side channel is + coded. +When the flag is present, the decoder reads a single value using the PDF in , as implemented in - silk_stereo_decode_mid_only() (silk_decode_stereo_pred.c). + silk_stereo_decode_mid_only() (decode_stereo_pred.c). If the flag is set, then there is no corresponding SILK frame for the side channel, the entire decoding process for the side channel is skipped, and zeros are used during the stereo unmixing process. @@ -1707,17 +1828,51 @@ The quantization gains are themselves uniformly quantized to 6 bits on a of approximately 1.94 dB to 88.21 dB. -For the first LBRR frame, an LBRR frame where the previous LBRR frame in the - same channel is not coded, or the first regular SILK frame in the current - channel of an Opus frame, the first subframe uses an independent coding - method. -In a stereo Opus frame, the mid-only flag (from - ) may cause the first regular SILK frame in - the side channel to occur in a later time interval than the first regular SILK - frame in the mid channel. -The 3 most significant bits of the quantization gain are decoded using a PDF - selected from based on the - decoded signal type. +The subframe gains are either coded independently, or relative to the gain from + the most recent coded subframe in the same channel. +Independent coding is used if and only if + + +This is the first subframe in the current SILK frame, and + +Either + +This is the first LBRR frame for this channel in the current Opus frame, + +This is an LBRR frame where the LBRR flags (see + and ) + indicate the previous LBRR frame in the same channel is not coded, or + + +This is the first regular SILK frame for this channel in the current Opus + frame. + + + + + + +There are a few subtle points here that may benefit from some clarification. +The rules for uncoded LBRR frames are very different from the rules for regular + SILK frames for the side channel of a stereo Opus frame. +Both allow gaps in the sequence of coded frames for a channel, the former based + on the LBRR flags, and the latter on the mid-only flag (from + ). +LBRR frames do not use relative coding to predict across these gaps, while + regular SILK frames in the side channel do. +In particular, in a 60 ms stereo Opus frame, if the first and third + regular SILK frames in the side channel are coded, but the second is not, the + first subframe of the third frame is still coded relative to the last subframe + in the first frame. +In contrast, in a similar situation with LBRR frames, the first subframe of the + third frame would use independent coding, even if the mid-only flag for the + second frame was 0. + + +In an independently coded subframe gain, the 3 most significant bits of the + quantization gain are decoded using a PDF selected from + based on the decoded signal + type (see ). -For all other subframes (including the first subframe of frames not listed as - using independent coding above), the quantization gain is coded relative to - the gain from the previous subframe (in the same channel). -In particular, unlike an LBRR frame where the previous frame is not coded, in a - 60 ms stereo Opus frame, if the first and third regular SILK frames - in the side channel are coded, but the second is not, the first subframe of - the third frame is still coded relative to the last subframe in the first - frame. +For subframes which do not have an independent gain (including the first + subframe of frames not listed as using independent coding above), the + quantization gain is coded relative to the gain from the previous subframe (in + the same channel). The PDF in yields a delta gain index between 0 and 40, inclusive. @@ -1770,8 +1921,8 @@ log_gain = min(max(2*gain_index - 16, ]]> -silk_gains_dequant() (silk_gain_quant.c) dequantizes the gain for the - k'th subframe and converts it into a linear Q16 scale factor via +silk_gains_dequant() (gain_quant.c) dequantizes the gain for the k'th subframe + and converts it into a linear Q16 scale factor via
>16) + 2090) @@ -1779,32 +1930,26 @@ gain_Q16[k] = silk_log2lin((0x1D1C71*log_gain>>16) + 2090)
-The function silk_log2lin() (silk_log2lin.c) computes an approximation of - of 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input. +The function silk_log2lin() (log2lin.c) computes an approximation of + 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input. Let i = inLog_Q7>>7 be the integer part of inLogQ7 and f = inLog_Q7&127 be the fractional part. -Then, if i < 16, then +Then
>16)+f)>>7)*(1<>16)) << (i - 7) ]]>
yields the approximate exponential. -Otherwise, silk_log2lin uses -
->16)+f)*((1<>7) . -]]> -
-Normalized Line Spectral Frequency (LSF) coefficients follow the quantization - gains in the bitstream, and represent the Linear Predictive Coding (LPC) - coefficients for the current SILK frame. +A set of normalized Line Spectral Frequency (LSF) coefficients follow the + quantization gains in the bitstream, and represent the Linear Predictive + Coding (LPC) coefficients for the current SILK frame. Once decoded, the normalized LSFs form an increasing list of Q15 values between 0 and 1. These represent the interleaved zeros on the unit circle between 0 and pi @@ -2088,7 +2233,7 @@ This gives the index, I2[k], a total range of -10 to 10, inclusive. The decoded indices from both stages are translated back into normalized LSF - coefficients in silk_NLSF_decode() (silk_NLSF_decode.c). + coefficients in silk_NLSF_decode() (NLSF_decode.c). The stage-2 indices represent residuals after both the first stage of the VQ and a separate backwards-prediction step. The backwards prediction process in the encoder subtracts a prediction from @@ -2126,7 +2271,7 @@ There are two lists for NB and MB, and another two lists for WB, giving two The prediction is undone using the procedure implemented in - silk_NLSF_residual_dequant() (silk_NLSF_decode.c), which is as follows. + silk_NLSF_residual_dequant() (NLSF_decode.c), which is as follows. Each coefficient selects its prediction weight from one of the two lists based on the stage-1 index, I1. gives the selections for each @@ -2335,7 +2480,7 @@ The cb1_Q8[] vector completely determines these weights, and they may be inclusive) to avoid computing them when decoding. The reference implementation already requires code to compute these weights on unquantized coefficients in the encoder, in silk_NLSF_VQ_weights_laroia() - (silk_NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the + (NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the decoder instead of using a pre-computed table to reduce the amount of ROM required. @@ -2506,9 +2651,10 @@ The next section describes a stabilization procedure used to make these
+ The normalized LSF stabilization procedure is implemented in - silk_NLSF_stabilize() (silk_NLSF_stabilize.c). + silk_NLSF_stabilize() (NLSF_stabilize.c). This process ensures that consecutive values of the normalized LSF coefficients, NLSF_Q15[], are spaced some minimum distance apart (predetermined to be the 0.01 percentile of a large training set). @@ -2615,7 +2761,7 @@ For 20 ms SILK frames, the first half of the frame (i.e., the first two current frame. A Q2 interpolation factor follows the LSF coefficient indices in the bitstream, which is decoded using the PDF in . -This happens in silk_decode_indices() (silk_decode_indices.c). +This happens in silk_decode_indices() (decode_indices.c). For the first frame after a decoder reset, when no prior LSF coefficients are available, the decoder still decodes this factor, but ignores its value and always uses 4 instead. @@ -2640,7 +2786,7 @@ n1_Q15[k] = n0_Q15[k] + (w_Q2*(n2_Q15[k] - n0_Q15[k]) >> 2) . ]]> This interpolation is performed in silk_decode_parameters() - (silk_decode_parameters.c). + (decode_parameters.c).
@@ -2692,7 +2838,7 @@ Q(z) = (1 - z ) * | | (1 - 2*cos(pi*n[2*k+1])*z + z ) However, SILK performs this reconstruction using a fixed-point approximation so that all decoders can reproduce it in a bit-exact manner to avoid prediction drift. -The function silk_NLSF2A() (silk_NLSF2A.c) implements this procedure. +The function silk_NLSF2A() (NLSF2A.c) implements this procedure. To start, it approximates cos(pi*n[k]) using a table lookup with linear @@ -2792,7 +2938,7 @@ c_Q17[k] = (cos_Q13[i]*256 + (cos_Q13[i+1]-cos_Q13[i])*f + 8) >> 4 ,
-Given the list of cosine values, silk_NLSF2A_find_poly() (silk_NLSF2A.c) +Given the list of cosine values, silk_NLSF2A_find_poly() (NLSF2A.c) computes the coefficients of P and Q, described here via a simple recurrence. Let p_Q16[k][j] and q_Q16[k][j] be the coefficients of the products of the first (k+1) root pairs for P and Q, with j indexing the coefficient number. @@ -2881,9 +3027,8 @@ This is an approximation of the chirp factor needed to reduce the target too large. -silk_bwexpander_32() (silk_bwexpander_32.c) performs the bandwidth expansion - (again, only when maxabs_Q12 is greater than 32767) using the following - recurrence: +silk_bwexpander_32() (bwexpander_32.c) performs the bandwidth expansion (again, + only when maxabs_Q12 is greater than 32767) using the following recurrence:
> 16 @@ -2920,14 +3065,16 @@ This saturation is not performed if maxabs_Q12 drops to 32767 or less prior to
+The prediction gain of an LPC synthesis filter is the square-root of the output + energy when the filter is excited by a unit-energy impulse. Even if the Q12 coefficients would fit, the resulting filter may still have a significant gain (especially for voiced sounds), making the filter unstable. silk_NLSF2A() applies up to 18 additional rounds of bandwidth expansion to limit the prediction gain. Instead of controlling the amount of bandwidth expansion using the prediction gain itself (which may diverge to infinity for an unstable filter), - silk_NLSF2A() uses LPC_inverse_pred_gain_QA() (silk_LPC_inv_pred_gain.c) - to compute the reflection coefficients associated with the filter. + silk_NLSF2A() uses LPC_inverse_pred_gain_QA() (LPC_inv_pred_gain.c) to + compute the reflection coefficients associated with the filter. The filter is stable if and only if the magnitude of these coefficients is sufficiently less than one. The reflection coefficients, rc[k], can be computed using a simple Levinson @@ -3012,7 +3159,7 @@ If abs(a32_Q16[k][k]) <= 65520 for On round i, 1 <= i <= 18, if the filter passes this stability check, then this procedure stops, and the final LPC coefficients to - use for reconstruction are + use for reconstruction in are
> 5 . @@ -3047,11 +3194,26 @@ Each subframe also gets its own prediction gain coefficient. The primary lag index is coded either relative to the primary lag of the prior frame or as an absolute index. -Like the quantization gains, the first LBRR frame, an LBRR frame where the - previous LBRR frame was not coded, and the first regular SILK frame in each - channel of an Opus frame all code the pitch lag as an absolute index. -When the most recent coded frame in the current channel was not voiced, this - also forces absolute coding. +Like the quantization gains, the primary pitch lag is coded either as an + absolute index, or relative to the most recent coded frame in the same + channel. +Absolute coding is used if and only if + +This is the first LBRR frame for this channel in the current Opus frame, + +This is an LBRR frame where the LBRR flags (see + and ) + indicate the previous LBRR frame in the same channel is not coded, + + +This is the first regular SILK frame for this channel in the current Opus + frame, or + + +The most recently coded frame in the current channel was not voiced + (see ). + + In particular, unlike an LBRR frame where the previous frame is not coded, in a 60 ms stereo Opus frame, if the first and third regular SILK frames in the side channel are coded, voiced frames, but the second is not coded, the @@ -3135,10 +3297,10 @@ After the primary pitch lag, a "pitch contour", stored as a single entry from The codebook index is decoded using one of the PDFs in depending on the current frame size and audio bandwidth. - through - give the corresponding offsets - to apply to the primary pitch lag for each subframe given the decoded codebook - index. +Tables through + give the + corresponding offsets to apply to the primary pitch lag for each subframe + given the decoded codebook index. The final pitch lag for each subframe is assembled in silk_decode_pitch() - (silk_decode_pitch.c). + (decode_pitch.c). Let lag be the primary pitch lag for the current SILK frame, contour_index be index of the VQ codebook, and lag_cb[contour_index][k] be the corresponding entry of the codebook from the appropriate table given above for the k'th @@ -3300,9 +3462,9 @@ This immediately follows the subframe pitch lags, and is coded using the The index of the filter to use for each subframe follows. They are all coded using the PDF from corresponding to the periodicity index. - through - contain the corresponding filter taps - as signed Q7 integers. +Tables through + contain the + corresponding filter taps as signed Q7 integers. @@ -3453,14 +3615,27 @@ They are all coded using the PDF from
-In some circumstances an LTP scaling parameter appears after the LTP filter - coefficients. +An LTP scaling parameter appears after the LTP filter coefficients if and only + if + +This is a voiced frame (see ), and +Either + +This is the first LBRR frame for this channel in the current Opus frame, + +This is an LBRR frame where the LBRR flags (see + and ) + indicate the previous LBRR frame in the same channel is not coded, or + + +This is the first regular SILK frame for this channel in the current Opus + frame. + + + + This allows the encoder to trade off the prediction gain between packets against the recovery time after packet loss. -Like the quantization gains, only the first LBRR frame in an Opus frame, - an LBRR frame where the prior LBRR frame was not coded, and the first regular - SILK frame in each channel of an Opus frame include this field, and, like all - of the other LTP parameters, only for frames that are also voiced. Unlike absolute-coding for pitch lags, a regular SILK frame other than the first one in a channel will not include this field even if the prior frame was not voiced. @@ -3531,7 +3706,7 @@ SILK also handles large codebooks by coding the least significant bits (LSb's) of each coefficient directly. This adds a small coding efficiency loss, but greatly reduces the computation time and ROM size required for decoding, as implemented in - silk_decode_pulses() (silk_decode_pulses.c). + silk_decode_pulses() (decode_pulses.c). @@ -3648,8 +3823,8 @@ The cumulative distribution for rate level 10 is just a shifted version of
-The locations of the pulses in each shell block follows the pulse counts, - as decoded by silk_shell_decoder() (silk_shell_coder.c). +The locations of the pulses in each shell block follow the pulse counts, + as decoded by silk_shell_decoder() (shell_coder.c). As with the pulse counts, these locations are coded for all the shell blocks before any of the remaining information for each block. Unlike many other codecs, SILK places no restriction on the distribution of @@ -3666,9 +3841,9 @@ The process then recurses into the left half, and after that returns, the right half (preorder traversal). The PDF to use is chosen by the size of the current partition (16, 8, 4, or 2) and the number of pulses in the partition (1 to 16, inclusive). - through - list the PDFs used for each partition - size and pulse count. +Tables through + list the PDFs used for + each partition size and pulse count. This process skips partitions without any pulses, i.e., where the initial pulse count from was zero, or where the split in the prior level indicated that all of the pulses fell on the other side. @@ -3805,9 +3980,12 @@ The decoder chooses the PDF for the sign based on the signal type and quantization offset type (from ) and the number of pulses in the block (from ). The number of pulses in the block does not take into account any LSb's. -If a block has no pulses, even if it has some LSb's (and thus may have some - non-zero coefficients), then no signs are decoded. -In that case, any non-zero coefficients use a positive sign. +Most PDFs are skewed towards negative signs because of the quantizaton offset, + but the PDFs for zero pulses are highly skewed towards positive signs. +If a block contains many positive coefficients, it is sometimes beneficial to + code it solely using LSb's (i.e., with zero pulses), since the encoder may be + able to save enough bits on the signs to justify the less efficient + coefficient magnitude encoding. Quantization Offset Type Pulse Count PDF +Inactive Low 0 {2, 254}/256 Inactive Low 1 {207, 49}/256 Inactive Low 2 {189, 67}/256 Inactive Low 3 {179, 77}/256 Inactive Low 4 {174, 82}/256 Inactive Low 5 {163, 93}/256 Inactive Low 6 or more {157, 99}/256 +Inactive High 0 {58, 198}/256 Inactive High 1 {245, 11}/256 Inactive High 2 {238, 18}/256 Inactive High 3 {232, 24}/256 Inactive High 4 {225, 31}/256 Inactive High 5 {220, 36}/256 Inactive High 6 or more {211, 45}/256 +Unvoiced Low 0 {1, 255}/256 Unvoiced Low 1 {210, 46}/256 Unvoiced Low 2 {190, 66}/256 Unvoiced Low 3 {178, 78}/256 Unvoiced Low 4 {169, 87}/256 Unvoiced Low 5 {162, 94}/256 Unvoiced Low 6 or more {152, 104}/256 +Unvoiced High 0 {48, 208}/256 Unvoiced High 1 {242, 14}/256 Unvoiced High 2 {235, 21}/256 Unvoiced High 3 {224, 32}/256 Unvoiced High 4 {214, 42}/256 Unvoiced High 5 {205, 51}/256 Unvoiced High 6 or more {190, 66}/256 +Voiced Low 0 {1, 255}/256 Voiced Low 1 {162, 94}/256 Voiced Low 2 {152, 104}/256 Voiced Low 3 {147, 109}/256 Voiced Low 4 {144, 112}/256 Voiced Low 5 {141, 115}/256 Voiced Low 6 or more {138, 118}/256 +Voiced High 0 {8, 248}/256 Voiced High 1 {203, 53}/256 Voiced High 2 {187, 69}/256 Voiced High 3 {176, 80}/256 @@ -3856,6 +4040,97 @@ In that case, any non-zero coefficients use a positive sign.
+
+ + +After the signs have been read, there is enough information to reconstruct the + complete excitation signal. +This requires adding a constant quantization offset to each non-zero sample, + and then pseudorandomly inverting and offsetting every sample. +The constant quantization offset varies depending on the signal type and + quantization offset type (see ). + + + +Signal Type +Quantization Offset Type +Quantization Offset (Q10) +Inactive Low 100 +Inactive High 240 +Unvoiced Low 100 +Unvoiced High 240 +Voiced Low 32 +Voiced High 100 + + + +Let e_raw[i] be the raw excitation value at position i, with a magnitude + composed of the pulses at that location (see + ) combined with any additional LSb's (see + ), and with the corresponding sign decoded in + . +Additionally, let seed be the current pseudorandom seed, which is initialized + to the value decoded from for the first sample in + the current SILK frame, and updated for each subsequent sample according to + the procedure below. +Finally, let offset_Q10 be the quantization offset from + . +Then the following procedure produces the final reconstructed excitation value, + e_Q10[i]: +
+ +
+When e_raw[i] is zero, sign() returns 0 by the definition in + , implying that no quantization offset gets added. +The final e_Q10[i] value may require more than 16 bits per sample, but will not + require more than 32. +
+ +
+ +
+ +
+ +
+ +For voiced speech, the excitation signal e(n) is input to an LTP synthesis filter that recreates the long-term correlation removed in the LTP analysis filter and generates an LPC excitation signal e_LPC(n), according to +
+ +
+ using the pitch lag L, and the decoded LTP coefficients b_i. +The number of LTP coefficients is 5, and thus d = 2. +For unvoiced speech, the output signal is simply a copy of the excitation signal, i.e., e_LPC(n) = e(n). +
+
+ +
+ +In a similar manner, the short-term correlation that was removed in the LPC analysis filter is recreated in the LPC synthesis filter. The LPC excitation signal e_LPC(n) is filtered using the LTP coefficients a_i, according to +
+ +
+ where d_LPC is the LPC synthesis filter order, and y(n) is the decoded output signal. +
@@ -3901,10 +4176,10 @@ An overview of the decoder is given in . The decoder is based on the following symbols and sets of symbols: - -Symbol(s) -PDF -Condition + +Symbol(s) +PDF +Condition silence {32767, 1}/32768 post-filter {1, 1}/2 octave uniform (6)post-filter @@ -4558,7 +4833,7 @@ signal and repeats the windowed waveform using the pitch offset. The windowed waveform is overlapped in such a way as to preserve the time-domain aliasing cancellation with the previous frame and the next frame. This is implemented in celt_decode_lost() (mdct.c). In SILK mode, the PLC uses LPC extrapolation -from the previous frame, implemented in silk_PLC() (silk_PLC.c). +from the previous frame, implemented in silk_PLC() (PLC.c).
@@ -5424,31 +5699,41 @@ However, on certain CPU architectures where denormalized floating-point Denormals can be introduced by reordering operations in the compiler and depend on the target architecture, so it is difficult to guarantee that an implementation avoids them. -For architectures on which denormals are problematic, it is RECOMMENDED to -add very small floating-point offsets to the affected signals -to prevent significant numbers of denormalized - operations. Alternatively, it is often possible to configure the hardware to treat +For architectures on which denormals are problematic, adding very small + floating-point offsets to the affected signals to prevent significant numbers + of denormalized operations is RECOMMENDED. +Alternatively, it is often possible to configure the hardware to treat denormals as zero (DAZ). No such issue exists for the fixed-point reference implementation. The reference implementation was validated in the following conditions: -Sending the decoder valid packets generated by the reference encoder and -verifying that the decoder's final range coder state matches that of the encoder. -Sending the decoder packets generated by the reference encoder, after random corruption. -Sending the decoder random packets to the decoder. -Altering the encoder to make random coding decisions (internal fuzzing), including -mode switching and verifying that the range coder final states match. + +Sending the decoder valid packets generated by the reference encoder and + verifying that the decoder's final range coder state matches that of the + encoder. + + +Sending the decoder packets generated by the reference encoder and then + subjected to random corruption. + +Sending the decoder random packets. + +Sending the decoder packets generated by a version of the reference encoder + modified to make random coding decisions (internal fuzzing), including mode + switching, and verifying that the range coder final states match. + -In all of the conditions above, both the encoder and the decoder were run inside -the Valgrind memory debugger, which tracks reads and writes to invalid memory -regions, as well as use of uninitialized memory. There were no error reported -on any of the tested conditions. +In all of the conditions above, both the encoder and the decoder were run + inside the Valgrind memory + debugger, which tracks reads and writes to invalid memory regions as well as + the use of uninitialized memory. +There were no errors reported on any of the tested conditions.
-
+
This document has no actions for IANA. @@ -5549,7 +5834,7 @@ Kat Walsh, for their feedback on the draft. Constrained-Energy Lapped Transform (CELT) Codec - +