2009Efficient test pattern compression techniques based on complementary Huffman coding

合集下载

高质量DXT压缩使用CUDA技术（2009年）说明书

March 2009High Quality DXT Compression using OpenCL for CUDAIgnacio Castaño*******************Document Change HistoryVersion Date Responsible Reason for Change0.1 02/01/2007 Ignacio Castaño First draft0.2 19/03/2009 Timo Stich OpenCL versionAbstractDXT is a fixed ratio compression format designed for real-time hardware decompression of textures. While it’s also possible to encode DXT textures in real-time, the quality of the resulting images is far from the optimal. In this white paper we will overview a more expensivecompression algorithm that produces high quality results and we will see how to implement it using CUDA to obtain much higher performance than the equivalent CPU implementation.MotivationWith the increasing number of assets and texture size in recent games, the time required to process those assets is growing dramatically. DXT texture compression takes a large portion of this time. High quality DXT compression algorithms are very expensive and while there are faster alternatives [1][9], the resulting quality of those simplified methods is not very high. The brute force nature of these compression algorithms makes them suitable to be parallelized and adapted to the GPU. Cheaper compression algorithms have already been implemented [2] on the GPU using traditional GPGPU approaches. However, with the traditional GPGPU programming model it’s not possible to implement more complex algorithms where threads need to share data and synchronize.How Does It Work?In this paper we will see how to use CUDA to implement a high quality DXT1 texturecompression algorithm in parallel. The algorithm that we have chosen is the cluster fit algorithm as described by Simon Brown [3]. We will first provide a brief overview of the algorithm and then we will describe how did we parallelize and implement it in CUDA.DXT1 FormatDXT1 is a fixed ratio compression scheme that partitions the image into 4x4 blocks. Each block is encoded with two 16 bit colors in RGB 5-6-5 format and a 4x4 bitmap with 2 bits per pixel. Figure 1 shows the layout of the block.Figure 1. DXT1 block layoutThe block colors are reconstructed by interpolating one or two additional colors between the given ones and indexing these and the original colors with the bitmap bits. The number of interpolated colors is chosen depending on whether the value of ‘Color 0’ is lower or greater than ‘Color 1’.BitmapColorsThe total size of the block is 64 bits. That means that this scheme achieves a 6:1 compression ratio. For more details on the DXT1 format see the specification of the OpenGL S3TCextension [4].Cluster FitIn general, finding the best two points that minimize the error of a DXT encoded block is a highly discontinuous optimization problem. However, if we assume that the indices of the block are known the problem becomes a linear optimization problem instead: minimize the distance from each color of the block to the corresponding color of the palette.Unfortunately, the indices are not known in advance. We would have to test them all to find the best solution. Simon Brown [3] suggested pruning the search space by considering only the indices that preserve the order of the points along the least squares line.Doing that allows us to reduce the number of indices for which we have to optimize theendpoints. Simon Brown provided a library [5] that implements this algorithm. We use thislibrary as a reference to compare the correctness and performance of our CUDAimplementation.The next section goes over the implementation details.OpenCL ImplementationPartitioning the ProblemWe have chosen to use a single work group to compress each 4x4 color block. Work items that process a single block need to cooperate with each other, but DXT blocks are independent and do not need synchronization or communication. For this reason the number of workgroups is equal to the number of blocks in the image.We also parameterize the problem so that we can change the number of work items per block to determine what configuration provides better performance. For now, we will just say that the number of work items is N and later we will discuss what the best configuration is.During the first part of the algorithm, only 16 work items out of N are active. These work items start reading the input colors and loading them to local memory.Finding the best fit lineTo find a line that best approximates a set of points is a well known regression problem. The colors of the block form a cloud of points in 3D space. This can be solved by computing the largest eigenvector of the covariance matrix. This vector gives us the direction of the line.Each element of the covariance matrix is just the sum of the products of different colorcomponents. We implement these sums using parallel reductions.Once we have the covariance matrix we just need to compute its first eigenvector. We haven’t found an efficient way of doing this step in parallel. Instead, we use a very cheap sequential method that doesn’t add much to the overall execution time of the group.Since we only need the dominant eigenvector, we can compute it directly using the Power Method [6]. This method is an iterative method that returns the largest eigenvector and only requires a single matrix vector product per iteration. Our tests indicate that in most cases 8iterations are more than enough to obtain an accurate result.Once we have the direction of the best fit line we project the colors onto it and sort them along the line using brute force parallel sort. This is achieved by comparing all the elements against each other as follows:cmp[tid] = (values[0] < values[tid]);cmp[tid] += (values[1] < values[tid]);cmp[tid] += (values[2] < values[tid]);cmp[tid] += (values[3] < values[tid]);cmp[tid] += (values[4] < values[tid]);cmp[tid] += (values[5] < values[tid]);cmp[tid] += (values[6] < values[tid]);cmp[tid] += (values[7] < values[tid]);cmp[tid] += (values[8] < values[tid]);cmp[tid] += (values[9] < values[tid]);cmp[tid] += (values[10] < values[tid]);cmp[tid] += (values[11] < values[tid]);cmp[tid] += (values[12] < values[tid]);cmp[tid] += (values[13] < values[tid]);cmp[tid] += (values[14] < values[tid]);cmp[tid] += (values[15] < values[tid]);The result of this search is an index array that references the sorted values. However, this algorithm has a flaw, if two colors are equal or are projected to the same location of the line, the indices of these two colors will end up with the same value. We solve this problem comparing all the indices against each other and incrementing one of them if they are equal:if (tid > 0 && cmp[tid] == cmp[0]) ++cmp[tid];if (tid > 1 && cmp[tid] == cmp[1]) ++cmp[tid];if (tid > 2 && cmp[tid] == cmp[2]) ++cmp[tid];if (tid > 3 && cmp[tid] == cmp[3]) ++cmp[tid];if (tid > 4 && cmp[tid] == cmp[4]) ++cmp[tid];if (tid > 5 && cmp[tid] == cmp[5]) ++cmp[tid];if (tid > 6 && cmp[tid] == cmp[6]) ++cmp[tid];if (tid > 7 && cmp[tid] == cmp[7]) ++cmp[tid];if (tid > 8 && cmp[tid] == cmp[8]) ++cmp[tid];if (tid > 9 && cmp[tid] == cmp[9]) ++cmp[tid];if (tid > 10 && cmp[tid] == cmp[10]) ++cmp[tid];if (tid > 11 && cmp[tid] == cmp[11]) ++cmp[tid];if (tid > 12 && cmp[tid] == cmp[12]) ++cmp[tid];if (tid > 13 && cmp[tid] == cmp[13]) ++cmp[tid];if (tid > 14 && cmp[tid] == cmp[14]) ++cmp[tid];During all these steps only 16 work items are being used. For this reason, it’s not necessary to synchronize them. All computations are done in parallel and at the same time step, because 16 is less than the warp size on NVIDIA GPUs.Index evaluationAll the possible ways in which colors can be clustered while preserving the order on the line are known in advance and for each clustering there’s a corresponding index. For 4 clusters there are 975 indices that need to be tested, while for 3 clusters there are only 151. We pre-compute these indices and store them in global memory.We have to test all these indices and determine which one produces the lowest error. In general there are indices than work items. So, we partition the total number of indices by the number of work items and each work item loops over the set of indices assigned to it. It’s tempting to store the indices in constant memory, but since indices are used only once for each work group, and since each work item accesses a different element, coalesced global memory loads perform better than constant loads.Solving the Least Squares ProblemFor each index we have to solve an optimization problem. We have to find the two end points that produce the lowest error. For each input color we know what index it’s assigned to it, so we have 16 equations like this:i i i x b a =+βαWhere {}i i βα, are {}0,1, {}32,31, {}21,21, {}31,32 or {}1,0 depending on the index and the interpolation mode. We look for the colors a and b that minimize the least square error of these equations. The solution of that least squares problem is the following:∑∑⋅∑∑∑∑= −i i i i i ii i i i x x b a βαββαβαα122 Note: The matrix inverse is constant for each index set, but it’s cheaper to compute it everytime on the kernel than to load it from global memory. That’s not the case of the CPU implementation.Computing the ErrorOnce we have a potential solution we have to compute its error. However, colors aren’t stored with full precision in the DXT block, so we have to quantize them to 5-6-5 to estimate the error accurately. In addition to that, we also have to take in mind that the hardware expands thequantized color components to 8 bits replicating the highest bits on the lower part of the byte as follows:R = (R << 3) | (R >> 2); G = (G << 2) | (G >> 4); B = (B << 3) | (B >> 2);Converting the floating point colors to integers, clamping, bit expanding and converting them back to float can be time consuming. Instead of that, we clamp the color components, round the floats to integers and approximate the bit expansion using a multiplication. We found the factors that produce the lowest error using an offline optimization that minimized the average error.r = round(clamp(r,0.0f,1.0f) * 31.0f); g = round(clamp(g,0.0f,1.0f) * 63.0f); b = round(clamp(b,0.0f,1.0f) * 31.0f); r *= 0.03227752766457f; g *= 0.01583151765563f; b *= 0.03227752766457f;Our experiment show that in most cases the approximation produces the same solution as the accurate solution.Selecting the Best SolutionFinally, each work item has evaluated the error of a few indices and has a candidate solution.To determine which work item has the solution that produces the lowest error, we store the errors in local memory and use a parallel reduction to find the minimum. The winning work item writes the endpoints and indices of the DXT block back to global memory.Implementation DetailsThe source code is divided into the following files:•DXTCompression.cl: This file contains OpenCL implementation of the algorithm described here.•permutations.h: This file contains the code used to precompute the indices.dds.h: This file contains the DDS file header definition. PerformanceWe have measured the performance of the algorithm on different GPUs and CPUscompressing the standard Lena. The design of the algorithm makes it insensitive to the actual content of the image. So, the performance depends only on the size of the image.Figure 2. Standard picture used for our tests.As shown in Table 1, the GPU compressor is at least 10x faster than our best CPUimplementation. The version of the compressor that runs on the CPU uses a SSE2 optimized implementation of the cluster fit algorithm. This implementation pre-computes the factors that are necessary to solve the least squares problem, while the GPU implementation computes them on the fly. Without this CPU optimization the difference between the CPU and GPU version is even larger.Table 1. Performance ResultsImage TeslaC1060 Geforce8800 GTXIntel Core 2X6800AMD Athlon64 DualCore 4400Lena512x51283.35 ms 208.69 ms 563.0 ms 1,251.0 msWe also experimented with different number of work-items, and as indicated in Table 2 we found out that it performed better with the minimum number.Table 2. Number of Work Items64 128 25654.66 ms 86.39 ms 96.13 msThe reason why the algorithm runs faster with a low number of work items is because during the first and last sections of the code only a small subset of work items is active.A future improvement would be to reorganize the code to eliminate or minimize these stagesof the algorithm. This could be achieved by loading multiple color blocks and processing them in parallel inside of the same work group.ConclusionWe have shown how it is possible to use OpenCL to implement an existing CPU algorithm in parallel to run on the GPU, and obtain an order of magnitude performance improvement. We hope this will encourage developers to attempt to accelerate other computationally-intensive offline processing using the GPU.High Quality DXT Compression using CUDAMarch 2009 9References[1] “Real-Time DXT Compression”, J.M.P. van Waveren. /cd/ids/developer/asmo-na/eng/324337.htm[2] “Compressing Dynamically Generated Textures on the GPU”, Oskar Alexandersson, Christoffer Gurell, Tomas Akenine-Möller. http://graphics.cs.lth.se/research/papers/gputc2006/[3] “DXT Compression Techniques”, Simon Brown. /?article=dxt[4] “OpenGL S3TC extension spec”, Pat Brown. /registry/specs/EXT/texture_compression_s3tc.txt[5] “Squish – DXT Compression Library”, Simon Brown. /?code=squish[6] “Eigenvalues and Eigenvectors”, Dr. E. Garcia. /information-retrieval-tutorial/matrix-tutorial-3-eigenvalues-eigenvectors.html[7] “An Experimental Analysis of Parallel Sorting Algorithms”, Guy E. Blelloch, C. Greg Plaxton, Charles E. Leiserson, Stephen J. Smith /blelloch98experimental.html[8] “NVIDIA CUDA Compute Unified Device Architecture Programming Guide”.[9] NVIDIA OpenGL SDK 10 “Compress DXT” sample /SDK/10/opengl/samples.html#compress_DXTNVIDIA Corporation2701 San Tomas ExpresswaySanta Clara, CA 95050 NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT,MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or otherrights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.TrademarksNVIDIA, the NVIDIA logo, GeForce, and NVIDIA Quadro are trademarks or registeredtrademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright© 2009 NVIDIA Corporation. All rights reserved.。

Training Process ESEC 2009SSI Level 3 中文版

Presentation Name
company confidential
Page 19
Machine working principle 机器工作原理
特点 Single or dual wire mode Fully motorized in X, Y, Z Solder wire diameter range
Welcome to the World of
Soft Solder Process Training 2009SSI
Softsolder Technology
Time Planning
09:00AM – 09:30AM 09:30AM – 10:00AM 10:30AM – 10:45AM 10:45AM – 11:15AM 11:15PM – 12:00PM 13:30PM – 14:30PM 14:45PM – 15:00PM 15:00PM – 17:00PM
Page 4
<Name Presentaion>
Presentation Name
company confidential
Page 4
Machine Overview
Page 5
<Name Presentaion>
Presentation Name
company confidential
Page 5
Flow of Material材料流程
Leadframe 框架
Solder wire 焊锡丝
Wafer晶元
Die Bonded Leadframes Loaded in magazine 粘有芯片的框架装载入料盒中
Page 8
<Name Presentaion>

Electric Field Gradient Theory with Surface Effect for Nano-Dielectrics

2 Corresponding author. Tel & Fax: 86-29-82660977; E-mail: sshen@
2 Copyright © 2009 Tech Science Press
CMES, vol.1389, no.1,boratory for Strength and Vibration, School of Aerospace, Xi’an Jiaotong University, 28 West Xianning Road, Xi’an, Shaanxi 710049, P.R. China
Shuling Hu1, Shengping Shen1;2
Abstract: The electric ﬁeld gradient effect is very strong for nanoscale dielectrics. In addition, neither the surface effect nor electrostatic force can be ignored. In this paper, the electric Gibbs free energy variational principle for nanosized dielectrics is established with the strain/electric ﬁeld gradient effects, as well as the effects of surface and electrostatic force. As regards the surface effects both the surface stress and surface polarization are considered. From this variational principle, the governing equations and the generalized electromechanical Young-Laplace equations, which take into account the effects of strain/electric ﬁeld gradient, surface and electrostatic force, are derived. The generalized bulk and surface electrostatic stress are obtained from the variational principle naturally. The form are different from those derived from the ﬂexoelectric theory. Based on the present theory, the size-dependent electromechanical phenomenon in nano-dielectrics can be predicted.

1 - Rob Walker - EN ISPE-09

ISPE-CCPIE CHINA CONFERENCE 2009
EU GMP Chapter 5 Production
• 5.18 Contamination of a starting material or of a product by another material or product must be avoided. • 5.19e using cleaning and decontamination procedures of known effectiveness, as ineffective cleaning of equipment is a common source of cross-contamination; • 5.19g testing for residues and use of cleaning status labels on equipment.
ISPE-CCPIE CHINA CONFERENCE 2009
PI 006 version 2
ISPE-CCPIE CHINA CONFERENCE 2009
PI 006 version 2
ISPE-CCPIE CHINA CONFERENCE 2009
What requires validation Normally only cleaning procedures for product contact surfaces of the equipment need to be validated. Consideration should be given to noncontact parts into which product may migrate. For example, seals, flanges, mixing shaft, fans of ovens, heating elements etc. Generally in case of batch-to-batch production it is not necessary to clean after each batch. However, cleaning intervals and methods should be determine PICS JUL 2004

空间面板数据模型的最新进展

(2010b) investigate the asymptotic properties of the quasi-maximum likelihood estimators (QMLEs) for spatial panel data models with spatial lags, ﬁxed effects and SAR disturbances. Mutl and Pfaffermayr (2008) consider the estimation of spatial panel data models with spatial lags under both ﬁxed and random effects speciﬁcations, and propose a Hausman type speciﬁcation test. These spatial panel data models have a wide range of applications. They can be applied to agricultural economics (Druska and Horrace, 2004), transportation research (Frazier and Kockelman, 2005), public economics (Egger et al., 2005), and good demand (Baltagi and Li, 2006), to name a few. The above panel models are static ones which do not incorporate time lagged dependent variables in the regression equation. By allowing dynamic features in the spatial panel data models, Anselin (2001) and Anselin et al. (2008) divide spatial dynamic models into four categories, namely, “pure space recursive” if only a spatial time lag is included; “time–space recursive” if an individual time lag and a spatial time lag are included; “time–space simultaneous” if an individual time lag and a contemporaneous spatial lag are speciﬁed; and “time–space dynamic” if all forms of lags are included. Korniotis (forthcoming) investigates a time–space recursive model with ﬁxed effects, and the model is applied to the growth of consumption in each state in the United States. As a recursive model, the parameters, including the ﬁxed effects, can be estimated by OLS. Korniotis (forthcoming) has also considered a bias adjusted within estimator, which generalizes Hahn and Kuersteiner (2002). For a

ENHANCED ARTIFICIAL BEE COLONY OPTIMIZATION-2009(TaiWan)

International Journal of Innovative Computing, Information and Control Volume 5, Number 12, December 2009
c 2009 ISSN 1349-4198 ICIC International ⃝
pp. 1–ISII08–247
1
2
PEI-WEI TSAI, JENG-SHYANG PAN, BIN-YIH LIAO, AND SHU-CHUAN CHU
the eﬃciency of the ABC are compared with the Diﬀerential Evolution (DE), the PSO and the Evolutionary Algorithm (EA) for numeric problems with multi-dimensions [15]. By observing the operation and the structure of the ABC algorithm, we notice that the operation of the agent, e.g. the artiﬁcial bee, can only move straight to one of the nectar sources of those are discovered by the employed bees. Nevertheless, this characteristic may narrow down the zones of which the bees can explore and may become a drawback of the ABC. Hence, we propose an interactive strategy in this paper by considering the universal gravitation between the artiﬁcial bees for the ABC to retrieve the disadvantages. To test and verify the advantages, which we gain in the proposed method, series of experiments are executed and are compared with the original ABC and the PSO. The experimental results exhibit that the IABC performs the best on solving the problems of numerical optimization. 2. The Artiﬁcial Bee Colony Optimization Algorithm. The ABC algorithm is proposed by Karaboga [16] in 2005, and the performance of ABC is analyzed in 2007 [15]. The ABC algorithm is developed by inspecting the behaviors of the real bees on ﬁnding food source, which is called the nectar, and sharing the information of food sources to the bees in the nest. In the ABC, the artiﬁcial agents are deﬁned and classiﬁed into three types, namely, the employed bee, the onlooker bee, and the scout. Each of them plays diﬀerent role in the process: the employed bee stays on a food source and provides the neighborhood of the source in its memory; the onlooker gets the information of food sources from the employed bees in the hive and select one of the food source to gather the nectar; and the scout is responsible for ﬁnding new food, the new nectar, sources. The process of the ABC algorithm is presented as follows: Step 1. Initialization: Spray �� percentage of the populations into the solution space randomly, and then calculate their ﬁtness values, which are called the nectar amounts, where �� represents the ratio of employed bees to the total population. Once these populations are positioned into the solution space, they are called the employed bees. Step 2. Move the Onlookers: Calculate the probability of selecting a food source by the equation (1), select a food source to motion for every onlooker bees and then determine the nectar amounts of them. The movement of the onlookers follows the equation (2). Step 3. Move the Scouts: If the ﬁtness values of the employed bees do not be improved by a continuous predetermined number of iterations, which is called ”��”, those food sources are abandoned, and these employed bees become the scouts. The scouts are moved by the equation (3). Step 4. Update the Best Food Source Found So Far: Memorize the best ﬁtness value and the position, which are found by the bees. Step 5. Termination Checking: Check if the amount of the iterations satisﬁes the termination condition. If the termination condition is satisﬁed, terminate the program and output the results; otherwise go back to the Step 2. �� (�� ) �� = ∑�� =1 �� (�� )

Microsoft PowerPoint - EMC_Asia_Fall_2009

7 • M. Majeika • Fall 2009
Confidential Proprietary
EMC Tests
8 • M. Majeika • Fall 2009
Confidential Proprietary
EMC Standards
• Acceptance tests Car-level
Agenda
• Basic and Market Trends • EMC Tests • Design Flow
– IC Examples
• Case Studies
– – – – LDO SMPS Drivers IVN
• General PCB Guidelines
2 • M. Majeika • Fall 2009 Confidential Proprietary
14 • M. Majeika • Fall 2009 Confidential Proprietary
IEC 62132-4
Board Example
15 • M. Majeika • Fall 2009
Confidential Proprietary
IEC 62132-4
DPI Results
Oscilloscope
DUT
Vout Good signal
Printed Circuit Board
or Failure signal
PC Monitoring
Wattmeter
13 • M. Majeika • Fall 2009
Confidential Proprietary
IEC 62132-4
Confidential Proprietary

TPS54060

VIN
EN
BOOT
Input voltage
VSENSE COMP
PWRGD
SS/TR
RT/CLK
BOOT-PH
Output voltage
PH
PH, 10-ns Transient
Voltage Difference
PAD to GND
EN
BOOT
Source current
VSENSE
PH
RT/CLK
APPLICATIONS
• 12-V, 24-V and 48-V Industrial and Commercial Low Power Systems
• Aftermarket Auto Accessories: Video, GPS, Entertainment
DESCRIPTION
The TPS54060 device is a 60V, 0.5A, step down regulator with an integrated high side MOSFET. Current mode control provides simple external compensation and flexible component selection. A low ripple pulse skip mode reduces the no load, regulated output supply current to 116mA. Using the enable pin, shutdown supply current is reduced to 1.3mA, when the enable pin is low.
The TPS54060 is available in 10 pin thermally enhanced MSOP and 10 pin 3mm x 3mm SON PowerPad™ packages.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Efficient Test Pattern Compression Techniques Based on Complementary Huffman CodingShyue-Kung Lu, Hei-Ming Chuang, Guan-Ying Lai, Bi-Ting Lai, and Ya-Chen HuangDepartment of Electronic Engineering, Fu-Jen Catholic UniversityTaipei, Taiwan, 242, R.O.Csklu@.twAbstract⎯In this paper, complementary Huffman encodingtechniques are proposed for test data compression of complex SOC designs during manufacturing testing. The correlations of blocks of bits in a test data set are exploited such that moreblocks can share the same codeword. Therefore, beside the compatible blocks used in previous works, the complementary property betweens blocks can also be used. Based on thisproperty, two methods are proposed for Huffman encoding. According to this technique, more blocks will share the same codeword and the size of the Huffman tree can be reduced. This will not only reduce the area overhead of the decoding circuitry but also substantially increase the compression ratio. In order to facilitate the proposed complementary encoding techniques, a don’t-care assignment algorithm is also proposed. According to experimental results, the area overhead of the decompression circuit is lower than that of the full Huffman coding technique. Moreover, the compression ratio is higher that that of the selective and optimal selective Huffman coding techniques.I. I NTRODUCTIONDue to the ever-increasing complexity of SoC designs, a large amount of test patterns should be sent from the ATE’s to test embedded cores through the A TE channels. However, since the bandwidth of ATE channels and the capacity of ATE memory are limited, the cost of testing and test application time will increase considerably. To alleviate these problems, test compression techniques are usually adopted in today’s SOC era. There are three kinds of test compression techniques⎯linear-decompression-based scheme, broadcast-scan-based scheme [1], and code-based scheme [2-10].The code-based scheme is appropriate to compress the pre-computed test set without requiring any structural information of IP cores. In [3, 4], new dictionary-based compression schemes are proposed. Due to the regularity of the schemes, the advantages of a multiple-scan architecture can be preserved, and very low test time can be achieved. In [2], the RLE (run-length encoding) techniques used in JPEG still image compression systems are used to reduce test data and to make the area overhead almost negligible. Test compression techniques based on full Huffman encoding are proposed in [5, 6, 7, 8]. However, the size of the decoder will increase exponentially as the encoded symbols increase [5]. To cure this problem, the selective Huffman encoding technique was proposed in [6]. It only encodes the symbols with higher occurrence frequencies. The size of the decompression circuit can be reduced substantially. The technique was further improved in [7] by the optimal selective Huffman coding technique, which regards all non-encoded symbols as one symbol.In this paper, complementary Huffman encoding techniques are proposed. The correlations between blocks of bits in the test set are exploited to further reduce the size of Huffman tree. That is, more blockss that are compatible or complementary compatible can be merged and encoded with the same codeword. Two methods are proposed to encode compatible and complementary compatible blocks⎯Method 1 and Method 2. In order to maximum the merging process, a don’t-care assignment algorithm is also proposed. The size of the Huffman tree can be reduced significantly. Moreover, according to experimental results, the compression ratio will achieve 45% to 87% with Method 1, and achieve66% to 93% with Method 2 by using ISCAS89 benchmark circuits. The area overhead for implementing our decompression circuit is lower than that of the full Huffman technique.II. O VERVIEW O F T HE C OMPLEMENTARY H UFFMANC ODING T ECHNIQUESHuffman coding technique is basically a statistical coding scheme, which is a fix-to-variable coding scheme. It represents data blocks of fixed length by variable length codewords. Symbols occur more frequently have shorter codewords than symbols that occur less frequently. This will reduce the average number of bits for a codeword. Therefore, compression is achieved. An example of deriving the Huffman code is shown in Fig. 1 [6]. In this example, the test set is divided into 4-bit blocks. Table 1 [6] shows the frequency of occurrence of each of the possible blocks. The compression ratio is determined by how skewed the frequency of occurrence is.978-1-4244-2587-7/09/$25.00 ©2009 IEEE0010 0100 0010 0110 0000 0010 1011 0100 0010 0100 0110 0010 0010 0100 0010 0110 0000 0110 0010 0100 0110 0010 0010 0000 0010 0110 0010 0010 0010 0100 0100 0110 0010 0010 1000 0101 0001 0100 0010 0111 0010 0010 0111 0111 0100 0100 1000 0101 1100 0100 0100 0111 0010 0010 0111 1101 0010 0100 1111 0011Fig. 1: The example test set divided into 4-bit blocks.Table 1: Huffman Coding of the example [6]In this example, 13 symbols are used to construct the Huffman tree. The correlations between symbols are not exploited. We say that two symbols S i and S j are complementary compatible if the corresponding bits of these two symbols are bit-wise complementary and can be merged as a new symbol S i, j. For example, in Table 1, S1 (0100) and S7 (1011) are complementary compatible. Therefore, they can be merged as a new symbol S1,7. Similarly, S0 (0010) and S10 (1101) are complementary compatible and can be merged as S0, 10. The merged results of the symbols in Table 1 are shown in Table 2. The “Joint Frequency” column denotes the accumulated probability of symbols S i and S j. From this table we can see that instead of the original 16 symbols, we only need to encode the 8 merged symbols. Therefore, the size of the Huffman tree can be reduced significantly. Moreover, the skewing of the occurrence probabilities is helpful for the compression of test data. The average length of codewords can also be reduced significantly. This will in turn increase the compression ratio.III. D ON’T C ARE A SSIGNMENT T ECHNIQUESIn general, test patterns generated by commercial ATPG tools contain massive unspecified bits (don’t-care bits) that can be assigned with 1’s and 0’s in a way to skew the frequency distribution. This is helpful to maximize the compression efficiency. In [6], two don’t care assignment techniques are proposed. However, only compatible characteristics are used in their techniques. Two blocks areTable 2: Merged Symbols of Table 1 compatible if there is no conflict in any bit position of the two blocks. For our complementary coding techniques, beside the compatible blocks are exploited, complementary blocks are also searched. Therefore, the Alg_comp algorithm is proposed in this paper to perform don’t care assignment.In Alg_comp., we first search for the most frequently occurring unspecified block F1. Thereafter, it is compared with the next most frequently occurring unspecified block F2. If there is no conflict in any bit position, these two blocks are merged by specifying every bits position if there is specified bit in either block. All F2 blocks in test set are appended with 0 to indicate that it is merged with F1 block. For example, if block F1 00XX is merged with block F2 0X1X, the result is 001X. All F2 blocks in the test set will be changed to 0001X.Alternatively, if there is a conflict, F1 is compared with the complement of F2. If there is no conflict in any bit position, then they are merged. All F2 blocks in the test set are appended with 1 to indicate that it is merged with F1 block in its complementary form. For example, if F1 is 00XX and F2 is 1X0X, block 00XX is merged with block 0X1X, which is the complement of block 1X0X. The resulted block is 001X and all F2 blocks in the test set are changed to 1001X. Thereafter, F1 is compared with all the other unspecified blocks in decreasing order of frequency and merged if possible. This process is finished if there is no more block that can be merged with F1. This process is then repeated until there is no more blocks that can be merged in the test set. Any remaining X’s can be randomly assigned 0’s or 1’s since it has no effect on the compression ratio.Consider the example test set shown in Fig. 2(a), the test set contains five unique unspecified blocks⎯1X01, 10X1, 01XX, 01X1, and 101X. Let B uniq denote the set of unique blocks. In the test set, the frequency of occurrence of block 10X1 is 3, that of blocks 1X01 and 01XX is 2, and that of other blocks is 1. After applying the first step of Alg_comp., the most frequently occurring unspecified block 10X1 is found. It is compared with 1X01, which is the next most frequently occurring unspecified block. SinceSymbol Freq. Block Huff. Code S022 0010 10S113 0100 00S27 0110 110S3 5 0111 010S4 3 0000 0110 S5 2 1000 0111 S6 2 0101 11100 S7 1 1011 111010 S8 1 1100 111011 S9 1 001 111100 S10 1 1101 111101 S11 1 1111 111110 S12 1 0011 111111 S130 1110 -S140 1010 -S150 1001 - Symbol Joint Freq. Pattern S0, 1023 0010 S1, 714 0100 S2,157 0110 S3, 57 0111 S4, 11 4 0000 S6, 14 2 0101 S8, 12 2 1100 S9, 13 1 0001there are no conflicts, these two blocks can be merged and the result is 1001. All 1X01 in the test set are appended with 0. The merged block 1001 is compared with other unspecified blocks. It is merged with 10XX, which is the complement of the unspecified block 01XX. All 01XX blocks in the test set are appended with 1. At this time, B uniq = {1001, 01X1, 101X} and blocks 10X1, 1X01, and 01XX are transferred to 01001, 01001, and 11001, respectively.This process is repeated and 01X1 is merged with 010X, which is the complement of the unspecified block 101X. Since there are no more unspecified blocks that can be merged in the test set, therefore, the process is finished. The blocks 01X1 and 101X are changed to 00101 and 10101, respectively. Finally, the set B uniq is {1001, 0101}. The final test set is shown in Fig. 2(b).1X01 10X1 1X01 01001 01001 0100110X1 01XX 01X1 01001 11001 0010110X1 101X 01XX 01001 10101 11001(a) (b)Fig. 2: (a) Test set before don’t care assignment, and (b)after applying Alg_comp.IV. C OMPLEMENTARY H UFFMAN E NCODING T ECHNIQUES When all don’t-care bits of unspecified blocks are assigned by Alg_comp algorithm, the first bit of each merged block denotes whether the block is complementary merged or not. In this section, two methods⎯Method 1 and Method 2 are proposed for encoding the specified blocks. Test volume and area overhead of the decoding circuitry can be reduced significantly by using these two methods. Method 1: One Control Bit AddedAfter using the alg_comp algorithm for filling the don’t care bits, the blocks with complementary relationship are regard as one symbol to reduce the number of symbols such that it can reduce the size of Huffman tree. However, in order to distinguish the complementary pair, one control bit C is appended to the codeword of the complementary pair. This control bit is used to indicate the decoder whether the decoded blocks should be complemented or not.An example is shown in Fig. 3 and Table 3. The blocks 0110 and 1001 are a complementary pair. Therefore, they share the same symbol S0. Similarly, blocks 1010 and 0101 share the symbol S1. The occurrence frequencies of all symbols are shown in the third column. All symbols are encoded based on the conventional Huffman coding techniques. However, one control bit is appended to each codeword. For example, the original Huffman code is 0 for blocks 0110 and 1001. However, a “0” is appended to form the final codeword for block 0110 and an “1” is appended to form the final codeword for block 1001. In the decoding circuitry, this control bit is used to determine whether the decoded blocks should be complemented or not.00110 00110 10110 00110 00110 00110 10110 10110 00110 00110 00110 00110 00110 00110 10110 00110 00110 00110 00110 00110 00110 00110 01010 01010 00110 00110 00110 00110 00110 00110 10110 00110 00110 00110 00110 00110 01110 00110 00110 00111 11010 00111 10110 00110 00110 00110 00011 10110 10111 00110 11010 00110 00110 00110 00110 11010 01010 10110 00110 10110 Fig. 3: The test set after don’t-care assignment (block size= 4)Table 3: Complementary EncodingSymbol Freq. BlockMethod 1 Method 2HuffmanCodeFinalCodeHuffmanCodeFinalCode S0 49400110009100110 100S group 13------------------10 ------S1 6310101001011011030101110 10110 S2 320011110011011101110111001110 101110 S3 111111111001110111101111000000------------S4 110111111101111111111111101000------------Method 2: Regarding all complementary blocks as a symbolFor Method 1 encoding technique, one control bit is required for each codeword. This will impact the compression ratio. Therefore, in order to substantially improve the compression ratio and reduce the area overhead for Huffman decoding. Method 2 treats all complementary blocks as a symbol (S group). For example, in Table 3, the complementary blocks 1001, 0101, and 1100 are treated as symbol S group. The occurrence frequency of S group is 13. This symbol is then encoded along with other symbols by using the conventional Huffman coding technique. The Huffman code for S group is 10 as shown in this table. This control code is then appended to the Huffman code of other symbols to form the final code of complementary blocks. For example, The Huffman code for block 0110 is 0. When the control code “10” is appended, the final code for block 1001 is 100.V. EXPERIMENTAL RESULTS Experimental results based on ISCAS89 benchmark circuits are shown in Table 4. The block size can be 4, 8, 12, or 16 bits. The results of selective Huffman coding, optimal selective Huffman coding, and the methods proposed in this paper are compared as shown in this table. We can see thatTable 4: Comparisons of compression ratiothe compression ratio of Method 2 is better than that of Method 1. Moreover, the proposed two methods usually have higher compression ratio than the selective and optimal selective Huffman coding techniques.The results of hardware overhead of the decoder are shown in Table 5. Since Method 1 and Method 2 consider the complementary relationship between blocks, the number of leaf nodes is reduced. Therefore, the number of states will be reduced in the decoder. In this table, we can see that the area overhead of Method 1 and Method 2 is smaller than the full Huffman encoding technique. If the block size increases, the compression efficiency is better. However, the hardware overhead will increase. Therefore, the block size provides an easy way to tradeoff between area overhead and compression efficiency. Although the area overhead of the proposed complementary Huffman coding technique is larger than that of the selective Huffman coding technique, it has 10％~30% improvement in compression ratio.Table 5: Comparisons of area overhead.VI. C ONCLUSIONSIn this paper, complementary Huffman coding techniques are proposed for test compression of SOC designs. Instead of exploiting compatible blocks as the conventional techniques, complementary compatible blocks are also searched to further increase the compression ratio. Two methods are proposed for the encoding of blocks. For don’t-care assignment, the Alg_comp algorithm is proposed. According to experimental results, the area overhead of the decoding circuitry is lower than that of the full Huffman coding technique. Moreover, the compression ratio is higher that that of the selective and optimal selective Huffman coding techniques. Although the area overhead is larger than [6], however, the compression ratio is much better than it.R EFERENCE[1] K. J. Lee, J. J. Chen, C. H. Huang, ”Broadcasting test patternsto multiple circuits,” IEEE Trans. Computer-Aided Des. Integr.Circuits and Syst., vol. 18, no. 12, pp. 1793-1802, Dec. 1999. [2] H. Ichihara, Y. Setohara, Y. Nakashima, and T. Inoue, “Testcompression/decompression based on JPEG VLC algorithm,”in Proc. Asian Test Symposium., Oct. 2007,pp. 87-90.[3] A. Wurtenberger, C. S. Tautermann, and S. Hellebrand, Datacompression for multiple scan chains using dictionaries withcorrections,” in Proc. Int’l Test Conf. (ITC 2004), pp. 926-935, Oct. 2004.[4] F. G. Wolff, C. Papachristou, “Multiscan-based testcompression and hardware decompression using LZ77,” in Proc.Int’l Test Conf., Oct. 2002, pp. 331-339.[5] V. Iyengar, K. Chakrabarty, and B. T. Murray, ”Huffmanencoding of test sets for sequential circuits,” IEEE Trans.Instrumentation and Measurement, vol. 47, no. 1, pp. 21-25,Feb. 1998.[6] A. Jas, J. Ghosh-Dastidar, M. E. Ng, and N. A. Touba, “AnEfficient Test Vector Compression Scheme Using Selective Huffman Coding,” IEEE put.-Aided Des.Integr.Circuits Syst., vol. 22, no. 6, pp. 797-806, Jun. 2003.[7] X. Kavousianos, E. Kalligeros, D. Nikolos, “Optimal SelectiveHuffman Coding for Test-Data Compression,” IEEE Trans.Computers, vol. 56, no. 8, pp. 1146-1152, Aug. 2007.[8] X. Kavousianos, E. Kalligeros, D. Nikolos, “Test DataCompression Based on Variable-to-Variable Huffman Encoding With Codeword Reusability,” IEEE Trans.Comput.-Aided Des.Integr. Circuits Syst., vol. 27, no. 7, pp.1333-1338, Jul. 2008.[9] L. Lingappan, S. Ravi, A. Raghunathan, N. K. Jha, and S. T.Chakradhar, “Test-Volume Reduction in Systems-on-a-Chip Using Heterogeneous and Multilevel Compression Techniques,” IEEE put.-Aided Des.Integr.Circuits Syst., vol. 25, no. 10, pp. 2193–2206, Oct. 2006. [10] M. H. Tehranipour, M. Nourani, K. Arabi, and A.Afzali-Kusha, ”Mixed RL-Huffman encoding for powerreduction and data compression in scan test,” in Proc. IEEESymp.Circuits and System. vol. 2, pp. 681-684, May 2004.CircuitSelective Huffman Optimal Selective Huffman Method 1 Method 28 enc. dist. blocks 16 enc. dist. blocks 8 enc. dist. blocks 16 enc. dist. blocks--- ---4 8 12 16 8 12 16 4 8 12 16 8 12 16 4 8 12 16 4 8 12 16S5378 28.9 50.1 53.0 53.0 50.2 55.1 53.8 47.1 54.754.951.155.857.5 55.046.6967.6766.4057.81 66.15 75.54 70.8760.46 S9234 30.0 50.4 50.7 46.1 50.2 54.2 51.0 51.4 55.351.846.157.756.8 51.949.2773.4880.0782.11 70.41 82.65 85.4485.36 S13207 45.6 69.2 76.6 78.6 69.2 77.1 79.7 69.9 80.283.483.180.684.3 84.549.9374.8483.1587.12 74.53 86.92 91.1293.12 S15850 38.8 60.0 63.6 61.6 59.9 65.6 64.8 62.1 67.967.363.669.370.4 67.249.8874.4082.3485.79 73.49 85.00 88.9990.36 S38417 34.9 55.3 56.9 54.4 55.5 58.9 57.3 55.9 60.959.355.262.861.8 58.650.0074.8478.9780.08 74.65 86.99 87.3186.29 S38584 37.8 58.5 62.2 61.7 58.5 63.9 64.1 60.4 65.465.263.366.968.0 66.149.9474.8982.7186.15 72.31 84.82 88.7390.3Circuit Full Huff. SelectiveHuff. Method 1Method 2S5378 79.2 10.4 38.57 41.83S9234 51.1 8.6 16.8 17.78S13207 23.6 3.7 6.96 7.14S15850 26.1 4.5 8.22 8.90S38417 21.3 1.3 1.3 1.52S38584 19.7 1.3 1.75 1.88。