`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 1 of 18 Page|D #: 733
`
`Ex. 1
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 2 of 18 PageID #: 734
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 2 of 18 Page|D #: 734
`
`ITU - Telecommunications Standardization Sector
`STUDY GROUP 16 Question 6
`Video Coding Experts Group (VCEG)
`
`Document VCEG-M1§
`Filename: VCEG-M.16.doc
`Generated: 27 March ’01
`
`Thirteenth Meeting: Austin, Texas, USA, 2-4 April, 2001
`
`Question: Q.6/SG16(VCEG)
`
`Source:
`
`Jie Liang (j|iang@fastvdo.com),
`Trac Tran (tran@fastvdo.com),
`Pankaj Topiwala (pnt@fastvdo.com)
`
`Tel:
`Fax:
`Email:
`
`410-730-9191
`410-730-2957
`pnt@fastvdo.com
`
`5565 Sterrett Pl., #322
`Columbia, MD 21044 USA
`
`Title:
`
`A 16-bit architecture for H.26L, treating DCT transforms and quantization
`
`Purpose:
`
`Proposal.
`
`Abstract
`
`For a variety of applications, especially for use in wireless devices with limited power, memory, and CPU
`capacity, it is desirable to have a fixed, low bit-depth integer codec implementation.
`Integer-only codecs
`are also needed if a fully lossless mode is required. A key obstacle to achieving these goals for
`transform-based video codecs is the nature of the transform; all others aspects of video coding can
`generally be maintained to satisfy a fixed low bit-depth (e.g., 16 bits). Most transforms require floating
`point operations, e.g., the true discrete cosine transform, DCT, is a floating-point transform. A variety of
`integer approximations to the DCT have been proposed in the literature, and are used in various
`formulations of existing standards (e.g., JPEG onwards). These approximations differ in the complexity of
`implementation, accuracy of approximation to the true DCT, and compression performance.
`In this
`document, we propose a new approximation to the DCT, named binDCT, which is fully invertible for
`lossless coding, requires no multiplications (only bit-shifts and adds), yet offers operating points within the
`accuracy, complexity, and performance trade space which were previously unavailable. Furthermore, a
`family of increasingly sophisticated solutions is available which offers performance/complexity scalability.
`Our DCT solution maintains 13 bits for 9-bit input, and thus offers the opportunity to build the entire H.26L
`codec in a 16-bit architecture. The quantization of transform coefficients is also treated in this document.
`
`This document mainly addresses the 4x4 transform that is at the heart of the existing H.26L codec.
`However, methods such as adaptive block transforms call for other block transform sizes such as 8x8 and
`16x16. We also have binDCT solutions for both of these sizes. The 8x8 binDCT preserves 15 bits for a
`9-bit input, and the 16x16 binDCT preserves 17 bits for a 9-bit input. The latter can also be made to
`preserve 16 bits by applying a variety of available methods such as scaling, statistical sampling, and
`approximation. Our solutions have complexity, accuracy and performance advantages over other
`approximate DCT solutions recently proposed for these applications [6 - 8], which will be explored
`elsewhere. For now, we briefly present our 8x8 and 16x16 solutions diagrammatically without further
`analysis.
`
`Introduction
`
`A lifting-based fast approximation of the 4-point DCT, named the binDCT, is proposed for the H.26L
`standard, which can be implemented with only addition and right-shift operations. The binDCT can be well
`fitted into a 16-bit architecture and has the same performance as that of the integer transform used in the
`current H.26L test model. Most of the quantization can also be implemented with 16-bit operations,
`except the RD-optimized quantization. The proposed binDCT is implemented within TML 5.2, and its
`performance is demonstrated.
`
`1. Review of the integer DCT in the current H.26L test model
`
`File:VCEG—M16—Fast:VDO—4.01-KEY
`
`Page:
`
`1
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 3 of 18 PageID #: 735
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 3 of 18 Page|D #: 735
`
`1.1. Integer DCT in TML 5.2
`This section summarizes the integer transform and the corresponding quantization strategy used in the
`H.26L test model 5.2 [1].
`
`The forward integer transform used in the current H.26L test model is as follows:
`
`A=13a+13b+13c+13d,
`B=17a+ 7b- 7c-17d,
`C=13a-13b-13c+13d,
`D= 7a-17b+17c- 7d,
`
`where a, b, c. and d are 4 input pixels. The inverse transformation of transform coefficients A,B,C,D into 4
`pixels a'.b‘,c'.d‘ is defined by:
`
`a'=13A+17B+13C+ 7D,
`b'=13A+ 7B-13C-17D,
`c'=13A- 7B-13C+17D,
`d'=13A-17B+13C- 7D.
`
`This method is a scaled version of the Chen-Wang factorization of the DCT matrix [2] [3]. The relation
`between a and a' is: a' = 676a. Besides, after 2-D transform, the integer transform coefficients are about
`676 times the true DCT coefficients. This scaling is then compensated in various quantization and
`dequantization steps.
`
`1.2 Quantization in TML 5.2
`
`There are 32 different QP values used in the current H.26L. The total range of step size from smallest to
`largest is about the same as for H.263. The QP signalled in the bitstream applies for luma quantization /
`dequantization. This is called QP.u,,,a. For chroma quantization / dequantization a different value -
`QPchmma - is used. The relationship between the two is:
`
`QP.um.,, 012345678910111213141516171819202122232425262728293031
`
`QPch,om,,012345678910111213141516171718192020212222232324242525
`
`When QP is used in the following it means QP..,ma or QPch,oma depending on what is appropriate.
`
`Two arrays of numbers are used for quantization/dequantization2
`
`A(QP=0,..,31) =
`(620,553,492,439,391,348,310,276,246.219,195,174,155,138,123.110,98,87,78,69,62,55,49,44,39,35,
`31 ,27,24,22,19,17},
`
`B(QP=0,..,31) =
`
`{3881,4351.4890,5481,6154.6914.7761,8718,9781,10987,12339,13828,15523.17435,19561.21873,2455
`2,27656,30847,34870,38807.43747,49103,54683,61694,68745.77615,89113,100253,109366,126635,14
`1533}.
`
`The relation between A() and B() is: A(QP)xB(QP)x6762 = 24°.
`
`Given this notation, a transform coefficient K is quantized in the following way:
`
`LEVEL = (KxA(QP) + fx22°)/22°;
`
`|f| is 1/3 for intra and 1/6 for inter blocks and f has the same sign as K.
`
`Given the quantized result LEVEL, the dequantization is performed by:
`
`K’ = LEVELxB(QP)
`
`In summary, let Y be the 4x4 input pixels. and K0 be one of its true 2-D DCT coefficients, then the
`relationship between K0 and the corresponding 2-D integertransform coefficient K is approximately:
`
`K 1: 676 K0
`
`After quantization:
`
`File:\/CEG-M16-FastVDO-4.01-KEY
`
`Page:
`
`2
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 4 of 18 PageID #: 736
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 4 of 18 Page|D #: 736
`
`LEVEL 4 K x A(QP) / 22°: 676 x K0 x A(QP) / 22°-
`
`After dequantization, the reconstructed signal for K is:
`
`K‘ .9 LEVEL x B(QP) .9 676 x K0 x A(QP) x B(QP) / 22°-
`
`The 2-D inverse transform introduces another factor of 676, therefore the reconstructed pixel is:
`
`Y" =1 6762 x Y‘ x A(QP) x B(QP) / 22°-
`
`Where Y‘ is the reconstructed pixel by the true DCT. Since the arrays A and B are designed such that
`A(QP) x B(QP) x 6762 = 24°‘ we therefore have
`
`Y" == Y‘ x 22°-
`
`Thus a down-shift of 20 bits (with rounding) is needed.
`
`It can be seen from the above that the equivalent quantization stepsizes for the true DCT coefficients are:
`
`22° / (676 x A(QP)).
`
`When QP changes from O to 31, the stepsizes are:
`
`DCTQ(QP=0,..,31) =
`
`[2.5o19,
`
`2.8050,
`
`3.1527,
`
`3.5334,
`
`3.9671,
`
`4.4573,
`
`5.0037,
`
`5.6201,
`
`6.3055,
`
`7.0829,
`
`7.9546,
`
`8.9146,
`
`10.0074, 11.2402, 12.6110, 14.1013,
`
`15.8280,
`
`17.8293, 19.8865, 22.4804, 25.0185, 28.2027, 31.6561, 35.2534,
`
`39.7730, 44.3185, 50.0370, 57.4499, 64.6312, 70.5067, 81.6394, 91.2440];
`
`There is an increase of about 12% from one stepsize to the next one.
`
`2. The binDCT
`
`2.1 Chen-Wang Algorithm of the DCT and the Lifting Scheme
`The proposed binDCT is also based on the well-known Chen-Wang plane rotation-based factorization of
`the DCT matrix [2] [3]. For 4-point DCT, their factorization is shown in Fig. 1.
`
`Xt0I
`
`.
`
`T 99‘
`\
`3/
`X111
`._
`,«_____;D
`3/
`\>/'
`\-.
`IA
`,/419
`\
`
`M21
`mi.
`
`-
`
`_
`
`‘-.;-1
`
`3121..
`
`,
`
`L
`*3 F3 ,.
`)1‘ "$1.1
`_/__@f_ _ Y[2]
`-C1:/4
`L
`Y[1]
`c3..67:_1@_
`xg S31:/8
`_ .§;§3"’3 ti.-*1 .
`
`C3rttE%
`
`Fig 1. Plane rotation-based factorization of the 4-point DCT.
`
`where c0 and s6 mean cos(6) and sin(6), respectively. Note that a uniform scaling factor of 1/sqrt(2) is
`required at the end of the signal flow to get the true DCT coefficient.
`
`File:VCE‘.G-M16-FastVDO—4.0l—KEY
`
`Page:
`
`3
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 5 of 18 PageID #: 737
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 5 of 18 Page|D #: 737
`
`The plane rotations in this factorization are the main difficulty for the 16-bit implementation of the 4-point
`DCT. In this proposal, the lifting scheme [4] will be used to obtain a fast approximation of the DCT. Fig. 2
`illustrates the decomposition of the plane rotations in Fig. ‘l by three lifting steps, as shown in [4]:
`
`
`
`(b)
`(a)
`Fig 2. (a) Representation of a plane rotation by 3 lifting steps. (b) Inverse transform.
`
`Each lifting step is a biorthogonal transform, and its inverse also has a simple lifting structure, as shown
`in Fig. 2(b). This means that to inverse a lifting step, we simply need to subtract out what was added in at
`the forward transform. Hence the original signal can still be perfectly reconstructed even if the floating-
`point multiplication results in the lifting steps are rounded to integers, as long as the same procedure is
`applied to both the forward and inverse transform. This is the basis for many lifting-based lossless
`transforms.
`
`To obtain fast implementation, we can further approximate the floating-point lifting coefficients by
`hardware-friendly dyadic values (i.e., rational numbers in the format of k/2"‘, where k and m are integers),
`which can be implemented by only shift and addition operations. This also enables the transform to be
`implemented with a narrower data bus than the original method.
`
`2.2 The scaled Lifting Structure
`To further reduce the complexity of the lifting-based fast DCT, we use a scaled lifting structure to
`represent the plane rotation, as proposed in [5]. The general structure is given in Fig. 3 (a) and (b), where
`a general butterfly (not necessarily an orthogonal plane rotation) is represented by 2 lifting steps and 2
`scaling factors. The two scaling factors can be absorbed in the quantization stage, thus only 2 lifting steps
`are left in the transform, making it more efficient than the representation in Fig. 2.
`
`The parameters in Fig. 3(b) are given by:
`
`P = r12 / T11.
`U = F11 X F21/(V11 X F22 — F21 X T12).
`K1=F11.
`K2 = (F11 X |‘22- F21 X F12) / V11.
`
`If the parameters in Fig. 3(a) are orthogonal plane rotation, as shown in Fig. 3(c), its scaled lifting
`structure can be represented as in Fig. 3(d). As noted in [5], for rotation angles close to (k+1/2)n, this
`scaled lifting structure is sensitive to finite-length approximations, and it also increases the dynamic range
`of the intermediate results. However, it can be shown that in this case, a permuted version of the scaled
`lifting structure will lead to a more robust result, and the dynamic range is also minimized. The permuted
`rotation and its corresponding scaled lifting structure are illustrated in Fig. 3(e) and (f).
`
`E‘ile:VCEG-M16-E‘astVDO—4.01-KEY
`
`Page: 4
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 6 of 18 PageID #: 738
`Case 1:11—cv—00797—RGA Document 113-1
`Filed 01/24/12 Page 6 of 18 Page|D #: 738
`
`"E19
`
`l "El“' Y1
`
`Lg ..Ei_. Y2
`
`.
`
`-69 —L
`g3
`3;
`.
`‘\ / sina
`tuna. —uu(rx:om
`.
`/\
`_
`.
`5|
`x‘-£11111
`X2 4 —ta— Y2 x2—A~ee . pl. —~Y2
`
`cow
`V
`mm
`
`com ]—- Y1
`
`(c)
`—sina
`X1 T-\\
`‘@ "' Y2
`fa
`com
`I.
`x_\ com
`_® ____ Y1
`
`X2
`
`..v’
`
`.
`
`/5‘
`
`(d)
`
`X1 W -SiI'l(1|
`_
`L sinorcosucl
`rand
`..
`V .
`
`1
`flfét
`
`
`
`X2,
`
`* Y2
`
`0-Y1
`
`(E)
`(e)
`(c) An orthogonal plane rotation; (d)
`Fig. 3. (a) A general rotation; (b) The scaled lifting structure for (a);
`The scaled lifting structure for (c); (e) The permuted version of (c); (f) The scaled lifting structure for (e).
`
`2.3 General Structure of the binDCT
`
`From the above results. we can obtain the general structure for the lifting-based 4-point DCT, denoted by
`binDCT. The forward and inverse transforms are given in Fig. 4. The rotations of 7!:/4 and 3::/8 are both
`represented by 2 lifting steps. Note that the permuted scaled lifting structure is used for 37:/8. and
`because of the butterflies. a scaling factor of 2 is introduced after the inverse transform, which can be
`compensated by a right-shift.
`
`x[0]H
`E
`
`‘
`
`£9
`-’
`
`-+J
`in?
`
`r — - — ---
`-r.
`%___I
`
`.——~-~
`- Y[0] —~:__2__;
`
`
`
`2x[0]
`
`i_
`
`I
`
`2x[1]
`
`V
`
`,
`
`1421
`
`\>\;,
`"
`
`.
`_
`'_ _ _ _
`,_'_s_ir1—31_t/t;"_,
`
`_ _
`
`_
`
`4
`
`'\\
`.39
`
`"R
`
`__
`
`{X91
`
`2x.[3]
`
`(a)
`
`(b)
`
`Fig. 4 (a) Forward transform of the 4-point binDCT; (b) Inverse transform of the 4-point binDCT.
`
`File:VCE1G—M16-E‘astVDO-4.01-KEY
`
`Page:
`
`5
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 7 of 18 PageID #: 739
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 7 of 18 Page|D #: 739
`
`
`
`
`3/8
`
`3/8
`
`5
`
`1/2
`
`3/8
`
`4
`
`1/2
`
`1/2
`
`3
`
`10
`
`9
`
`8
`
`3/8
`
`5
`
`10
`
`
`
`
`
`Table 1 tabulates the floating-point values of the parameters p and u in Fig. 4, as well as some of their
`dyadic approximations, which allow for fast implementations with only shift and addition operations. The
`coding gains of the floating-point 4-point DCT and various binDCT configurations are also presented in
`Table 1. showing that the binDCT is very close to the true DCT in terms of the energy compaction
`capability.
`
`The first configuration in Table 1 with p = 7/16 and u = 3/8 will be used in the following tests. which
`requires 5 shifts and 10 additions. As a comparison, the integer transform in the current TML 5.2 needs 6
`fixed-point multiplications and 8 additions. Note that 7/16x can be implemented as 1/2x — 1/16x, and 3/8x
`can be obtained by 1/2x — 1/8x. That is, only right-shifts are involved, which minimizes the dynamic
`ranges of the intermediate results. The forward and inverse transforms for this configuration are (without
`considering the scaling factors):
`
`Forward transform=
`
`1
`
`1
`
`1
`
`1
`
`107/128
`
`3/8
`
`-3/8 -107/128
`
`1/2
`
`-1/2 -1/2
`
`1/2
`
`I.
`
`7/16
`
`-1
`
`1
`
`-7/16
`
`.
`
`Inverse transform =
`
`1/2
`
`1/2
`
`1
`
`1
`
`3/8
`
`7/16
`
`-1 -107/128
`
`1/2 -7/16 -1
`
`107/128
`
`'
`
`1/2
`
`-1
`
`1
`
`-3/8
`
`2.4. Scaling Factors and Quantization
`
`The scaling factors in Fig. 4 are incorporated into the quantization. The 1-D scaling vector is:
`
`81 = [0.50000O0000O0O0 0.765366864723018 1.00000000000000 0.653281482413819],
`
`and the corresponding 2-D binDCT scaling matrix is
`
`s2 = S1Tx s1 =
`0.25
`
`0.3826834323 6509
`
`0.5
`
`0.326640741219119"
`
`0.383268343236509 0.58578643762690 0.76536686473018
`0.5
`0.76536686473018
`1.0
`
`0.5
`0.653281482438|9 ‘
`
`0.32664074121909
`
`0.5
`
`0.65328148243819
`
`(}.426'/7669529664
`
`where S1T is the transpose of S1 .
`
`File:VCEG—Ml6—FastVDO-4.0l—KEY
`
`Page:
`
`6
`
`Date Printed: 1/13/12
`
`binDCT Confi
`
`.
`
`P
`
`U
`
`0.41421356237310
`
`7/16
`
`0.135355339059327
`
`Number of Shifts
`
`-
`
`-
`
`Numberof Adds
`
`
`
`
`
`
`7.5485 Codirl Gain dB 7.5701 true DCT 7.5697 7.5566 7.5493
`Table 1. Some configurations of the binDCT and their performances.
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 8 of 18 PageID #: 740
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 8 of 18 Page|D #: 740
`
`It can be verified that if X is a given 4x4 image block, and Y is its true 2-D DCT transform, the relationship
`between the binDCT transform coefficient Yb and Y is:
`
`where 69 represents entry-wise multiplications between two matrices.
`
`Y = Yb (9 82,
`
`In the current H.26L test model, the integer transform has a uniform scaling factor for all transform
`coefficients, which is no longer true in the binDCT, because of the 2-D scaling matrix S2. Hence, to
`maintain the compatibility of the binDCT and the true DCT, a quantization matrix Qb should be applied to
`the binDCT results, whose elements are defined as:
`
`Qb(i, j) = DCTQ(QP) / S2(i, j),
`
`i,j = 1 to 4.
`
`where DCTQ is the equivalent quantization stepsize for the true DCT coefficients, as derived in the first
`section.
`
`Note that the quantization matrix is determined by both the binDCT scaling matrix and the QP value.
`Since the first frame could have different QP value from other frames, and QP.um.,, and QPch,oma might be
`different, two quantization matrices should be defined for each frame --- Qbchmma for the chroma
`quantization and Qbluma for the luma part. These two matrices should be updated after the processing of
`the first frame, if the QP of the remaining frames is different from the first frame.
`
`The vector DCTQ and the binDCT scaling matrix 82 have floating-point entries. In implementation, two
`methods can be used to obtain integer quantization matrix. In the first method, floating-point operations
`can be used to obtain the floating-point values of Qb from DCTQ and S2, these results can then be
`rounded into integers.
`
`In order to generate the quantization matrix with 16-bit fixed-point operations, we can scale DCTQ by a
`factor of 8, and 1 /S2(i,j) by a factor of 16, for i,j = 1
`4. The results are then rounded to integer and
`stored as look-up tables. Denote the integer results of the scaled DCTQ and 1/S2 by SDCTQ and S82,
`the integer binDCT quantization table can be computed by 16-bit operations as:
`Qb(i, j) = (SDCTQ(QP) * SS2(i, j) + 64)/ 128,
`i,j = 1 to 4.
`Both Qbchmma and Qbluma can be performed in this way.
`
`This scaling approach introduces some round—off errors for the binDCT quantization table, compared with
`the quantization stepsizes used by the current TML, especially for smaller QP. However, experiments
`show that the effect is negligible for QP >= 8.
`
`After the quantization matrix is obtained, the quantization can be performed in a similar way to the current
`integer transform. That is,
`
`LEVEL(i, j) = ( Yb(i, j) + fx Qb(i, j) )/ Qb(i, j);
`
`4 can be calculated in advance, after the
`To speed up the quantization, the fx Qb(i, j), i, j = 1
`quantization matrix Qb is available. In the current H.26L test model, f can be one of 3 values: 1/3 for lntra
`blocks, 1/6 for inter blocks, and 0.45 for RD-optimized quantizations. As a result, 8 4x4 look-up tables can
`be defined for the quantization of each frame, storing Qbch,°ma_ 1/3Qbch,oma, 1/6Qbch,oma, 0.45Qbch,°ma,
`Qb.uma,1/3Qb.uma, 1/6Qb.u,,,a, and 0.45Qb.uma, respectively, where the 0.45Qb can be obtained by Qb/2 —
`Qb/20.
`
`Since B-frames are introduced in the latest TML 5.9, more quantization tables are needed to take care
`the QP value for B frames.
`
`The dequantization is simply performed by
`
`Yb‘(i, j) = LEVEL(i, j) * Qb(i, j).
`
`E‘ile:\/CEG-M16-l:‘astVDO-4.0l—KEY
`
`Page:
`
`7
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 9 of 18 PageID #: 741
`Case 1:11-cv-00797-RGA Document 113-1
`Filed 01/24/12 Page 9 of 18 Page|D #: 741
`
`2.5 Dynamic Range Analysis
`The elimination of the floating-point and fixed-point multiplications enables the binDCT to be implemented
`with narrower data bus than other fast algorithms.
`
`After the prediction, the input data to the binDCT are 9-bit signed integers, ranging from -255 to 255. To
`check the dynamic range of the binDCT, we examine the signs of the binDCT coefficients and find out the
`set of input data which would generate the maximum or minimum outputs in different binDCT subbands.
`For example, the second row of the forward binDCT used in this proposal is:
`
`[107/128
`
`3/8
`
`-3/8
`
`-107/128],
`
`thus the input x = [255. 255, -255, -255] will generate the maximum output for the second binDCT
`coefficient, and x = [-255, -255, 255, 255] will give its minimum.
`
`it may appear that intermediate calculations, which amount to multiplying numbers such as 107x255,
`would lead us outside a desired low-bit range (e.g., beyond 16-bits). However. given our method of
`calculating these transforms in terms of cascaded lifting steps, these multiplications never take place. We
`are able to verify that the entire calculation, beginning to end. stays within 13 bits, as shown in figure 5. In
`general, as all lifting parameters in the binDCT are less than 1, they can be implemented with addition
`and right-shift operations, which can minimize the intermediate dynamic range. Furthermore, since the
`first row has the maximum absolute sum, the dynamic range of the binDCT is therefore determined by the
`DC subband.
`
`Xw]
`
`[—255,255]
`[—255,255]
`
`(9)_
`(9)
`
`\
`\
`1
`
`[—255,255]
`[ 255 255]
`—
`
`(9)
`(9)
`
`I
`
`x[2]
`
`Km
`
`1
`
`>1
`1'
`
`I
`1/ la
`1
`J/\\ /_'_‘é_)
`E9
`_
`
`[—510_ 510] (10)
`[—510,510]
`(10)
`
`[—S10,S10]
`
`(10)
`
`I-='>10,5101<10>
`
`[-1020, 1020] (11)
`[—510,510] (11)
`
`I
`
`.
`]1/2]
`
`[—10201020] (111 W0]
`[—510, 510]
`(_11_; Ym
`
`__
`3
`-7/1'6"‘
`I
`
`[—733,733] (11)
`
`[—733,733] (I I)
`
`-
`
`1
`l3/=81
`1-5«1<x5il1>..(.{.,+I-61zL%1U12Y[,]
`
`Figure 5. Dynamic range of intermediate calculations for our 1-D 4-point binDCT. The numbers in
`brackets give the worst case range, and numbers in parentheses give the bit depth. The 1D transform
`thus maps 9-bit integers to 11-bit integers. The 2D transform takes 9-bits to 13-bits.
`
`It can be verified that given the input range of [-255, 255], the 1-D binDCT outputs are within [-1020,
`1020], which requires 11 bits to represent. The range of the 2-D binDCT output is [-4080, 4080], which
`needs 13 bits. If all the lifting parameters are implemented with only right shifts, the intermediate results
`are less than the final results in the worst cases. Therefore, the binDCT transform can be well fitted into
`16-bit architecture. The implementation of figure 5 in C is shown below.
`
`transform */
`/* Horizontal
`for
`(i=O;i<4;i++)
`
`{ M
`
`5[0]= M0[i1[ii][0][JJ] + M0[1][ii][3][jj],
`M5[3]= M0[i][ii1[0l[jJ] - M0[11[ii][3]ljj],
`M5[l]= M0[i][ii][1llJJ] + M0[1][ii][2][jj],
`M5[21= M0[i][ii1[1lijj] - M0[i][ii1[21[jj1,
`M0[i1[ii][0][jj] = M510]
`+ M5[l1;
`— M5111;
`M0[i][ii][21[jj] =
`(M0[i1[ii1[0][jj] >> 1)
`MO[i][ii][3][jj] = ((M5[3] >> 1)
`-
`(M5[3] >> 4))
`- M5[2];
`M0[i][ii][11[jj] = M513]
`- ((M0[i][ii][3][jj] >> 1)
`- (M0[i][ii][3][jj] >>
`3));
`
`File:VCEG—M16—Fast:VDO—4.01-KEY
`
`Page:
`
`8
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 10 of 18 PageID #: 742
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 10 of 18 Page|D #: 742
`
`2.6 Rate Distortion constrained quantization (RDQ)
`
`The current RD-constrained quantization function needs some modifications to be used in the binDCT
`framework. The main reason is that in the current integer transform, the quantization stepsize is uniform
`for all coefficients; therefore the RDQ function doesn't need to know the coordinates of each coefficient
`after the zigzag scanning. However, to use the binDCT, we have to pass the scanning pattern information
`to the RDQ functions, so that it can apply the correct quantization stepsizes for different binDCT
`coefficients. This can be done by simply passing to the RDQ function the pointer to the scanning pattern
`table, which could be single scan or double scan, depends on the current modes. For the quantization of
`the 2x2 chroma DC, this pointer is not used.
`
`Within the RDQ function, the program needs to check the intra or inter mode to decide whether to use
`Qbmma or Qbchmma (including 1/3Qb or 1/6Qb). Besides, the pointer to the Qb table should be returned to
`the binDCT routine for the dequantization processing.
`
`The quantization decisions in the new RDQ functions are similar to the one in the current test model. The
`only difference is in the calculation of the RD_gain. When the program scans through all coefficients and
`accumulates the 'RD_gain' by representing a coefficient by a nonzero reconstruction, a factor of 64 is
`introduced so that integer calculations may be used. The formula is:
`
`RD_gain = 64x(Bits + Ax((K‘ — Level)2 — K'2)) = 64x(Bits + }»x(Leve|2 — 2xK'xLevel) )
`
`= 64xBits + }»xLevelx(64xLevel — 128xK') = 64xBits + >»xLevelx(64xLeve|-KxA(QP)/213)
`
`where K is a transform coefficient, and K’ is a floating-point representation of the quantized result of K.
`
`in binDCT, this is simply
`
`64xBits + 7»xLevelx(64xLevel — (128 x K) / bquant)
`
`where bquant is the quantization stepsize corresponding to K, obtained from the current Qb table.
`
`This is the only part that couldn't be implemented with 16-bit operations. If RD-constrained quantization is
`not desired, the whole DCT and quantization can be implemented with 16-bit architecture.
`
`2. 7 Alternative binDCT structures:
`
`There are some other alternatives to the aforementioned binDCT structures. In Fig. 6 (a), the rotation
`
`angle of 3n/8 is implemented as 3 lifting steps. Compared with Fig. 4 (a), this structure needs a little more
`operations, but has nicer scaling factors. Some configurations of this kind of binDCT are shown in Table
`2.
`
`In Fig. 6 (b), we replace the butterflies by lifting steps, which allow easier integer to integer mapping. The
`inverse transform is simplified, because in the previous structures, the inverse transform has to take care
`of the downscaling caused by the butterflies.
`
`xlll
`
`x12]
`
`x[0l\
`
`1"
`'~__;
`‘\\f._
`‘Wu’
`A
`5
`..=’
`‘
`'\
`1-*9
`/at
`,
`3 /
`‘U- __@_
`
`Wt!
`
`_
`
`—
`“ti 1
`I
`P1_l|U| T'3_
`4;, 1-55
`
`(a)
`
`v_: %.::_.Y[0]
`
`xmI,Q
`
`__: ","‘._¥I21
`._ _
`I
`.
`'
`--. Ytll
`
`x[_11
`
`.i2i
`
`_@ _
`I
`I
`
`3}
`
`_
`
`-.
`
`YB,
`L2 2.
`_ _
`4
`
`.
`
`_
`
`_
`-‘I-*_l._.__.@.
`
`YL01
`
`'_®J‘
`
`l"?
`_@
`: Y[2]
`r"|'
`--
`'-
`__
`_
`_
`;----- -~, Y[3]
`17 "L"?‘_"“_31’?:"
`lL_’___ll_‘.‘_
`Y,
`i._..(,5: _.; w_v;?m;
`.'-.l
`.
`_ _ _ __
`
`([3)
`
`Fig. 6. Two alternative structures for the binDCT.
`
`File:VCEG—M16-Fast:VDO-4.01-KEY
`
`Page:
`
`9
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 11 of 18 PageID #: 743
`Case 1:11—cv—00797—RGA Document 113-1
`Filed 01/24/12 Page 11 of 18 Page|D #: 743
`
`-5/8
`-O66817863791930
`
`0.923879353251129
`
`
`P2
`-0.668178637919130
`-21/32
`-5/8
`-5/8
`-3/4
`
`Number of Shifts
`
`-
`
`6
`
`5
`
`6
`
`5
`
`
`
`Numberof Adds
`«
`12
`11
`12
`11
`7.5030 7.5685 7.5649
`7.5560
`
`Codin Gain (dB
`Table 2. Some configurations of the binDCT in Fig. 4 (a) and their performances.
`
`3. BinDCT for sizes 8x8 and 16x16
`
`We have fast, multiplierless solutions for the sizes 8x8 and 16x16 as well, which are briefly summarized
`in the following figures 7 and 8. As mentioned, these can be used if DCTs of these sizes are required.
`
`x[0]_
`
`.
`
`xm
`
`\
`
`«ea
`
`x[2] . _- ._,_
`
`X13] >1
`
`x[4]_._
`
`.. __
`
`x[5] -
`
`X16]
`
`x[7]
`
`'
`
`_-
`
`Figure 7. An 8x8 binDCT example, from [9]. All special constants are dyadic rationals.
`
`File:VCEG—Ml6—Fast:VDO-4.01-KEY
`
`Page:
`
`10
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 12 of 18 PageID #: 744
`Case 1:11—cv—00797—RGA Document 113-1
`Filed 01/24/12 Page 12 of 18 Page|D #: 744
`
`x[O]
`‘l
`1
`""1 V
`,
`'-.
`x[2]
`.\I-'.|I‘||z"f@
`l\||\
`I
`mi —--1--fix
`/
`1
`X141
`\ fl
`‘[51
`~\‘\'-.-
`x[6]
`\
`‘
`».[7]_--
`X181 —"-397::
`
`'1
`
`*
`
`.
`
`\.
`
`_E
`411121!/4 :_, km]
`~.’._:?>
`W.‘ * xm
`.''I
`-‘A
`If!
`‘.I\
`._."(B
`XLG]
`«,1.
`mini]
`.“
`I,»
`-~7-__.®~’— --ea Ll —é>" — e X121
`"I.-1&3
`-:
`__
`xm
`
`‘=1.
`Xi“
`15:19.1 l&*fli1»'4l
`/
`X[5]
`J)
`..F
`-2"
`xm
`'59
`X1131
`
`,_
`
`.__ _:
`.._
`N
`fa‘;
`.
`
`_
`
`_:§/3?1!?"+"E'11i«‘_5'+'1
`
`ii __,_
`-319. __1..
`4%,)
`
`_
`
`/
`
`'.\
`.39
`
`..
`.4;
`
`-
`
`I
`“'“§‘i
`
`-9 _
`
`ff)
`/'
`
`\
`'\
`
`V
`
`_+
`
`["37
`49
`
`I
`
`_
`
`'
`
`'
`
`‘
`
`',~.'\
`
`
`
`fr/"1
`
`11*
`
`'11-
`
`-'1
`
`1.
`
`=
`
`‘_—rI'a"?:i;l:_i|'—'.\%II§§\'§‘-t3
`"_,’E,'/1.'“1-1:-“G3
`x[1U|
`xl111'—',"./1 "‘ 111‘-‘\‘*"'|"'
`~:[:21—
`‘ 1'
`r}:
`xnsi
`/7 ‘
`1?
`1
`.
`x[14|
`"
`‘H3
`'1
`I
`l
`x[15:
`-— —‘
`
`-
`_
`41.3
`
`11/32.1.5-‘£5
`
`.1_1/32. 1.9-8; M W32 .2/_-12.15/16 2/_321.m=:.
`
`.1__:.2‘-?ra.=
`
`.‘-:»*'‘-.‘~
`
`1/
`
`'
`
`'-
`
`I‘.
`
`_-'i""j.--
`
`1
`
`'d‘\._
`
`;-
`_
`-6.53
`
`--
`
`-
`
`'
`—-——
`
`—
`
`.
`'{‘)_
`L?‘ _ _ —
`'
`_
`:"_-."\_‘_\_J_;;@ "\\
`‘(ll
`_ __—
`'/-13%:
`"r_$1'?_
`
`
`I‘ J _| ‘Tl.
`_ _'_?—,‘t3——|i J_. "5"T'1i="—'-15-_:—-?f**¢*Tt_._|
`l
`ll
`1 \
`9 .,_1i:’1l‘,;€er
`1
`«:3
`
`J7
`_.v;?\:‘'-_
`1 _ '-
`" "$9" "'1'
`119
`\
`J;
`1
`-_ --ta--—' ‘ --1&3
`
`I
`
`1&1
`
`«laJn3"f_31J'“ X[3]
`_-~ mi
`*;
`1".-.“ X115]
`-:__y2__; »x[1]
`I- xm
`I. —i——~‘
`‘-5233? I
`-X151
`1
`;
`4./__
`. _
`|i/15 3'15 |
`1‘ -é:"- ~;,,s,,,;,,,,;.'- ~ x[11]
`
`F“; ‘.
`
`-
`
`,
`
`_
`
`-- — — ‘W
`
`Figure 9. A 16x16 binDCT example. from [9].
`
`4. Experimental Results
`
`The attached results are obtained with 100 frames of various test sequences, running on a PC with
`Pentium III 550MHz CPU and Windows 2000 OS.
`
`In the implementation of the inverse binDCT transform, if all the AC coefficients are zero, the inverse
`transform is bypassed and all the outputs are set to one half of the DC coefficients. This can make the
`inverse transform a slightly faster.
`
`In the following tests, we encode and decode a sequence with two different methods, and measure their
`difference in term of PSNR Y and the compressed bit rate.
`
`First, we compare the two ways of obtaining the quantization table for the binDCT. The first method use
`floating-point operations to get the floating values of the entries in the quantization table, then round them
`to integer. In the second method. we start from the scaled true DCT quantization stepsize DCTQ and the
`binDCT scaling matrix 82, and compute the quantization matrix with 16-bit operations.
`
`The result with the Foreman sequence shows that for Q>=8, the maximum difference is 0.212% for PSNR
`and 2.213% for compressed bit rate, which are quite small. Therefore. the 16-bit implementation of the
`quantization matrix is used for the following tests.
`
`File:VCEG—Ml6—FastVDO-4.0l—KEY
`
`Page:
`
`11
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 13 of 18 PageID #: 745
`Case 1:11—cv—00797-RGA Document 113-1
`Filed 01/24/12 Page 13 of 18 Page|D #: 745
`
`The comparison between binDCT and the integer transform for 4 sequences show that for QP >= 8, the
`maximum difference between the two methods is 0.35% for PSNR and 3.6% in bit rate. When QP >= 16,
`most of the bit rate changes are less than 1.5%. Therefore they have very similar performance. This
`agrees well with our design target of providing compatible outputs after quantization.
`
`4. Conclusion
`
`A lifting-based fast approximation of the 4-point DCT, named the binDCT, is proposed, which can be
`implemented with only addition and right-shift operations. The binDCT can be well fitted into 16-bit
`architecture. Most of the quantization can also be implemented with 16-bit operations, except the RD-
`optimized quantization. Experimental results show that the binDCT has the same performance as the
`integer transform used in TML 5.2.
`
`References
`
`[4]
`
`[1] H.26L Document Q15-K-59: H.26L Test Model Long Term Number 5 (TML-5) Draft 0.
`[2] W. Chen, C. Harrison, and S. Fralick, A fast computational algorithm for the discrete cosine transform,
`IEEE Trans. Com., Vol. COM-25 (9), pp. 1004-1011, Sept. 1977.
`[3] Z. Wang, Fast algorithm for the discrete W transform and for the discrete Fourier transform, IEEE
`Trans. ASSP, Vol. ASSP-32 (4), pp. 803-816, Aug. 1984.
`l. Daubechies and W. Sweldens, Factoring wavelet transforms into lifting steps, Journal of Fourier
`Anal. Appl., Vol. 4 (3), pp. 247-269, 1998.
`[5] J. Liang and T. D. Tran, Fast multiplierless approximations of the DCT with the lifting scheme,
`submitted to IEEE Trans. on Signal Processing. Website for download: http://thanglong.ece.'hu.edu
`[6] M. Wien et al, "|nteger Transforms for H.26L using Adaptive Block Transforms," lTU VCEG
`q15K24.doc, Portland, OR, USA, Aug, 2000.
`[7] L. Kerofsky and S. Sun, “Results of H.26L Core Experiment on Adaptive Block Transforms," ITU VCEG-
`L11, Eibsee, Germany, January, 2001.
`[8] M. Wien and S. Sun, "ICT Comparison for Adaptive Block Transforms," ITU VCEG-L12, Eibsee,
`Germany, January, 2001.
`[9] J. Liang and T. Tran, "Fast Multiplierless Approximations of the DCT with the Lifting Scheme," SPIE,
`Appl. Dig. lm. Proc. XXll|, vol. 4115, Aug., 2000, pp. 384-395.
`
`File:VCEG—Ml6-Fasl:VDO-4.01-KEY
`
`Page:
`
`12
`
`Date Printed: 1/13/12
`
`
`
`Case 1:11-cv-00797-RGA Document 113-1 Filed 01/24/12 Page 14 of 18 PageID #: 746
`Case 1:11—cv—00797—RGA Document 113-1
`Filed 01/24/12 Page 14 of 18 Page|D #: 746
`
`Experimental Result 1. Comparison of two methods of obtaining
`quantization table for binDCT
`
`Bit Rate
`(bpp)
`Foreman.qcif QP PSNR Y PSNR U PSNR V
`BinDCT_intQ
`8
`42.55
`44.32
`46.29 569439.31
`16
`36.69
`39.88
`41.77 156044.41
`20
`34.05
`38.57
`40.17
`89662.50
`24
`31.42
`37.37
`38.50
`55437.60
`28
`28.44
`36.56
`37.34
`38102.10
`
`Change of Change of
`PSNR Y (%) Bit Rate (%)
`-
`-
`-
`-
`-
`-
`-
`—
`-
`—
`
`BinDCT__f|tQ
`
`8
`16
`20
`24
`28
`
`42.46
`36.72
`34.03
`31.39
`28.48
`
`44.24
`39.91
`38.58
`37.35
`36.51
`
`46.27 557352.63
`41.80 158942.70
`40.18
`88812.00
`38.47
`55594.50
`37.31
`3865