An Implementation of Pipelined Prallel Processing System for Multi-Access Memory System
中文翻译

QAM is a widely used multilevel modulation technique,with a variety of applications in data radio communication systems.Most existing implementations of QAM-based systems use high levels of modulation in order to meet the high data rate constraints of emerging applications.This work presents the architecture of a highly parallel QAM modulator,using MPSoC-based design flow and design methodology,which offers multirate modulation.The proposed MPSoC architecture is modular and provides dynamic reconfiguration of the QAM utilizing on-chip interconnection networks,offering high data rates(more than1 Gbps),even at low modulation levels(16-QAM).Furthermore,the proposed QAM implementation integrates a hardware-based resource allocation algorithm that can provide better throughput and fault tolerance,depending on the on-chip interconnection network congestion and run-time faults.Preliminary results from this work have been published in the Proceedings of the18th IEEE/IFIP International Conference on VLSI and System-on-Chip(VLSI-SoC2010).The current version of the work includes a detailed description of the proposed system architecture,extends the results significantly using more test cases,and investigates the impact of various design parameters.Furthermore,this work investigates the use of the hardware resource allocation algorithm as a graceful degradation mechanism,providing simulation results about the performance of the QAM in the presence of faulty components.Quadrature Amplitude Modulation(QAM)is a popular modulation scheme,widely used in various communication protocols such as Wi-Fi and Digital Video Broadcasting(DVB).The architecture of a digital QAM modulator/demodulator is typically constrained by several, often conflicting,requirements.Such requirements may include demanding throughput, high immunity to noise,flexibility for various communication standards,and low on-chip power.The majority of existing QAM implementations follow a sequential implementation approach and rely on high modulation levels in order to meet the emerging high data rate constraints.These techniques,however,are vulnerable to noise at a given transmission power,which reduces the reliable communication distance.The problem is addressed by increasing the number of modulators in a system,through emerging Software-Defined Radio (SDR)systems,which are mapped on MPSoCs in an effort to boost parallelism.These works, however,treat the QAM modulator as an individual system task,whereas it is a task that can further be optimized and designed with further parallelism in order to achieve high data rates,even at low modulation levels.Designing the QAM modulator in a parallel manner can be beneficial in many ways.Firstly, the resulting parallel streams(modulated)can be combined at the output,resulting in a system whose majority of logic runs at lower clock frequencies,while allowing for high throughput even at low modulation levels.This is particularly important as lower modulation levels are less susceptible to multipath distortion,provide power-efficiency and achieve low bit error rate(BER).Furthermore,a parallel modulation architecture can benefit multiple-input multiple-output(MIMO)communication systems,where information is sent and received over two or more antennas often shared among many ing multiple antennas at both transmitter and receiver offers significant capacity enhancement on many modern applications,including IEEE802.11n,3GPP LTE,and mobile WiMAX systems, providing increased throughput at the same channel bandwidth and transmit power.Inorder to achieve the benefit of MIMO systems,appropriate design aspects on the modulation and demodulation architectures have to be taken into consideration.It is obvious that transmitter architectures with multiple output ports,and the more complicated receiver architectures with multiple input ports,are mainly required.However,the demodulation architecture is beyond the scope of this work and is part of future work.This work presents an MPSoC implementation of the QAM modulator that can provide a modular and reconfigurable architecture to facilitate integration of the different processing units involved in QAM modulation.The work attempts to investigate how the performance of a sequential QAM modulator can be improved,by exploiting parallelism in two forms:first by developing a simple,pipelined version of the conventional QAM modulator,and second, by using design methodologies employed in present-day MPSoCs in order to map multiple QAM modulators on an underlying MPSoC interconnected via packet-based network-on-chip (NoC).Furthermore,this work presents a hardware-based resource allocation algorithm, enabling the system to further gain performance through dynamic load balancing.The resource allocation algorithm can also act as a graceful degradation mechanism,limiting the influence of run-time faults on the average system throughput.Additionally,the proposed MPSoC-based system can adopt variable data rates and protocols simultaneously,taking advantage of resource sharing mechanisms.The proposed system architecture was simulated using a high-level simulator and implemented/evaluated on an FPGA platform.Moreover, although this work currently targets QAM-based modulation scenarios,the methodology and reconfiguration mechanisms can target QAM-based demodulation scenarios as well. However,the design and implementation of an MPSoC-based demodulator was left as future work.While an MPSoC implementation of the QAM modulator is beneficial in terms of throughput, there are overheads associated with the on-chip network.As such,the MPSoC-based modulator was compared to a straightforward implementation featuring multiple QAM modulators,in an effort to identify the conditions that favor the MPSoC implementation. Comparison was carried out under variable incoming rates,system configurations and fault conditions,and simulation results showed on average double throughput rates during normal operation and~25%less throughput degradation at the presence of faulty components,at the cost of approximately35%more area,obtained from an FPGA implementation and synthesis results.The hardware overheads,which stem from the NoC and the resource allocation algorithm,are well within the typical values for NoC-based systems and are adequately balanced by the high throughput rates obtained.Most of the existing hardware implementations involving QAM modulation/demodulation follow a sequential approach and simply consider the QAM as an individual module.There has been limited design exploration,and most works allow limited reconfiguration,offering inadequate data rates when using low modulation levels.The latter has been addressed through emerging SDR implementations mapped on MPSoCs,that also treat the QAM modulation as an individual system task,integrated as part of the system,rather than focusing on optimizing the performance of the modulator.Works inuse a specific modulation type;they can,however,be extended to use higher modulation levels in order toincrease the resulting data rate.Higher modulation levels,though,involve more divisions of both amplitude and phase and can potentially introduce decoding errors at the receiver,as the symbols are very close together(for a given transmission power level)and one level of amplitude may be confused(due to the effect of noise)with a higher level,thus,distorting the received signal.In order to avoid this,it is necessary to allow for wide margins,and this can be done by increasing the available amplitude range through power amplification of the RF signal at the transmitter(to effectively spread the symbols out more);otherwise,data bits may be decoded incorrectly at the receiver,resulting in increased bit error rate(BER). However,increasing the amplitude range will operate the RF amplifiers well within their nonlinear(compression)region causing distortion.Alternative QAM implementations try to avoid the use of multipliers and sine/cosine memories,by using the CORDIC algorithm, however,still follow a sequential approach.Software-based solutions lie in designing SDR systems mapped on general purpose processors and/or digital signal processors(DSPs),and the QAM modulator is usually considered as a system task,to be scheduled on an available processing unit.Works inutilize the MPSoC design methodology to implement SDR systems,treating the modulator as an individual system task.Results in show that the problem with this approach is that several competing tasks running in parallel with QAM may hurt the performance of the modulation, making this approach inadequate for demanding wireless communications in terms of throughput and energy efficiency.Another particular issue,raised in,is the efficiency of the allocation algorithm.The allocation algorithm is implemented on a processor,which makes allocation slow.Moreover,the policies used to allocate tasks(random allocation and distance-based allocation)to processors may lead to on-chip contention and unbalanced loads at each processor,since the utilization of each processor is not taken into account.In,a hardware unit called CoreManager for run-time scheduling of tasks is used,which aims in speeding up the allocation algorithm.The conclusions stemming from motivate the use of exporting more tasks such as reconfiguration and resource allocation in hardware rather than using software running on dedicated CPUs,in an effort to reduce power consumption and improve the flexibility of the system.This work presents a reconfigurable QAM modulator using MPSoC design methodologies and an on-chip network,with an integrated hardware resource allocation mechanism for dynamic reconfiguration.The allocation algorithm takes into consideration not only the distance between partitioned blocks(hop count)but also the utilization of each block,in attempt to make the proposed MPSoC-based QAM modulator able to achieve robust performance under different incoming rates of data streams and different modulation levels. Moreover,the allocation algorithm inherently acts as a graceful degradation mechanism, limiting the influence of run-time faults on the average system throughput.we used MPSoC design methodologies to map the QAM modulator onto an MPSoC architecture,which uses an on-chip,packet-based NoC.This allows a modular, "plug-and-play"approach that permits the integration of heterogeneous processing elements, in an attempt to create a reconfigurable QAM modulator.By partitioning the QAM modulator into different stand-alone tasks mapped on Processing Elements(PEs),weown SURF.This would require a context-addressable memory search and would expand the hardware logic of each sender PE's NIRA.Since one of our objectives is scalability,we integrated the hop count inside each destination PE's packet.The source PE polls its host NI for incoming control packets,which are stored in an internal FIFO queue.During each interval T,when the source PE receives the first control packet,a second timer is activatedfor a specified number of clock cycles,W.When this timer expires,the polling is halted and a heuristic algorithm based on the received conditions is run,in order to decide the next destination PE.In the case where a control packet is not received from a source PE in the specified time interval W,this PE is not included in the algorithm.This is a key feature of the proposed MPSoC-based QAM modulator;at extremely loaded conditions,it attempts to maintain a stable data rate by finding alternative PEs which are less busy.QAM是一种广泛应用的多级调制技术,在数据无线电通信系统中应用广泛。
赛林思V5除法IP手册

•
Radix-2 Solution
Radix-2 Feature Summary
• Provides quotient with integer or fractional remainder
© 2006-2009 Xilinx, Inc. XILINX, the Xilinx logo, Virtex, Spartan, ISE and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.
Applications
Division is the most complex of the four basic arithmetic operations. Because hardware solutions are correspondingly larger and more complex than the solutions for other operations, it is best to minimize the number of divisions in any algorithm. There are many forms of division implementation, each of which may offer the optimal solution in different circumstances. The divider generator core provides two division algorithms, offering solutions targeted at small operands and large operands. The Radix-2 non-restoring algorithm solves one bit of the quotient per cycle using addition and subtraction. The design is fully pipelined, and can achieve a throughput of one division per clock cycle. If the throughput required is smaller, the divisions per clock parameter allows compromises of throughput and resource use. This algorithm naturally generates a remainder, so is the choice for applications requiring integer remainders or modulus results. The High Radix with prescaling algorithm resolves multiple bits of the quotient at a time. It is implemented by reusing the quotient estimation block, and so throughput is a function of the number of iterations required. The operands must be conditioned in preparation for the iterative operation. This overhead makes this algorithm less suitable for smaller operands. Although the iterative calculation is more complex than for Radix-2, taking more cycles to perform, the number of bits of quotient resolved per iteration and its use of XtremeDSP slices makes this the preferred option for larger operand widths.
Pipeline approach used for recognition of dynamic

Figure 1: Screenshot of MapEditor. The GUI front end of MVE-2. It shows a simple convolution2.5 Advanced pipeline examplesExecution mechanism implemented in the core can do more than simple pipeline execution.We support multiple execution, module driven executionand cycles. Any map can be run N - times. Module can run a subbranch to provide its data more than once. It is also possible to create cycles in module map graph.Sinus is an example of sub-branch construction. Execution of Sinus module is controlled by GenerateGraph module. In this particular case the module-map runs only once while the Sinus module runs 100 times. (See Figure 2)Counter is an example of DelayModule usage. The DelayModule acts as a single place memory with initialization. It returns data form previous (N-1) step. In the first step it returns data from initialization port, allowing cycles in module-map graph. This example counts fromzero to number of runs minus one. The delaymodules can be chained. (See Figure 3)2.6 Module creationSimple creation of modules is one of the most interesting features of our system. By inheriting a new class from the MveCore.Module abstract class, a fully functional module is created.There are only two methods that have to be overridden. The first one is the constructor ,which creates ports and defines their namesand accepted data types. The second one is Execute method that represents the activity of module.We are using features of the .NET system to provide comfort to module authors. For example any public property of a module is automatically displayed in a module setup dialog, and saved/restored with the module map. Adding a user-editable parameter is therefore a matter of exposing it using the property mechanism.There is a set of advanced methods that can becalled and set of events that can be handled by a module. These additional methods make it possible to create a module with advanced features, such as immediate reaction on incoming data, advanced module GUI creation, execution of subbranch etc.2.7 Documenting MVE-2Documentation for MVE-2 module libraries can be generated automatically using the MMDOC utility that is part of the system. It uses attribute classes and comments that describe modules, ports and data types, and generates electronic documentation in a number of formats (html, chm, ...). It can bealso used to generate a list of uncommentedentities (methods, modules etc.), thus enforcingcareful commenting. Figure 2: (left) Simple example of module driven subbranch execution.Figure 3: (right) Simple example of loop with delay module.3 Application of MVE-2 for AI and CG The recognition task was one of the main topics of AI research in the past decades. Many efforts appeared in the field of voice recognition,image recognition and mesh recognition,as this task is crucial for understanding the environment.Recognition algorithms allow the AI to reduce the amount of information to be processed;it allows to understand the relations in the environment and to make correct decisions quickly.The task we are addressing using MVE-2 is recognition of dynamic meshes, i.e. animations in surface representation.Our goal is to provide not only static information(i.e.like “the object in front of the camera is a human”), but also dynamic (“the object in front of the camera is a human,who is jumping”). This will not only allow the system to better analyze the scene in current time, but it will also help the system to predict future states of its environment.The task of dynamic mesh extraction is one of the state of the art problems that is being investigated by many recent papers ([3], [4]), but for our purposes we can assume that the extraction was already performed. Our input is therefore a dynamic mesh M, that consists of n static meshes.Our task is to qualitatively evaluate the dynamic mesh and to produce information about it that will help an AI system to plan its actions.Our approach is based on template comparison. We suppose that there is a library of dynamic meshes that represent actions known to the system. The information we are extracting is the correspondence of the given dynamic mesh M to the meshes present in the library. Namely we want to create a metric in the space of dynamic meshes, that will tell us which of the known animations is most similar to the one extracted from the environment of the system.Our method is based on the approach used for static mesh comparison ([1], [2]), i.e. using the Hausdorff distance of two objects.We represent each dynamic mesh in E3 by a static tetrahedral mesh in4D,subsequently we compute the approximate Hausdorff distance of given mesh to each of the library meshes, and finally we pick the one with the smallest distance.Following this scheme however requires addressing some non trivial issues, which will be briefly discussed in the following paragraphs.3.1 Dynamic mesh representationOur approach is to represent a dynamic triangle mesh by a static tetrahedral mesh in space-time. This can be easily done for meshes of constant connectivity (i.e. where each triangle corresponds to exactly one triangle in any frame of the animation). In such a case we can see the evolution of a triangle between two frames as a prism in 4D. We can now break this prism into three tetrahedra. If implemented carefully,this approach leads to consistent tetrahedral mesh representation,even though the faces of the 4D prism are non-planar. Another issue to be addressed is the used units. We must use consistent units for all the meshes, and we must define relation between time and space units.In order to unify space units we have decided to use relative lengths only, i.e. all sizes and positions are expressed as fractions of the body diagonal of the object.This allows us to measure spatial difference consistently for all meshes.On the other hand,time can be measured absolutely and should never be scaled.The only thing that needs to be done is to relate the time units to spatial units in order to produce the Euclidean metric in space-time that will be needed for the Hausdorff distance computation.The purpose of the space-time representation is to find how similar two animations are.In other words,distance in the space should represent difference between meshes. Therefore the distance represented by a single unit in each direction should represent equal difference. We wish to find a constant that will relate time (measured absolutely in seconds) and space (measured as a fraction of the main diagonal).We don’t know the value of this constant,but we can do following considerations in order to estimate its value: 1.time span of1/100s is almostunrecognizable for a human observer, while spatial shifts of 10% is on the limit of acceptability,therefore we expect the constant to be larger than 0.01/0.1 = 0.1 2.time spans of units of seconds are on thelimit of acceptability, while spatial shift of0.1%is almost unrecognizable,thereforewe expect the constant to be smaller than 1/0.001 = 1000Saying that,we can guess the value of the relation coefficient to be about10,i.e.time span of 100ms is equal to spatial shift of 1%. 3.2 ImplementationWe have implemented the proposed method in a set of MVE-2modules.First,we have debugged a simple module for computation of a distance from a point to a tetrahedron in 4D. Constructing a module that composes a set of triangular meshes into a space-time tetrahedral mesh was very easy thanks to the generality of data structures provided by the Visualization library. It is also easy to use a variety of input formats.In order to speed the computation up we have also constructed a module called AnimationDistanceEvaluator that encapsulates the distance evaluation from each vertex of one mesh to each tetrahedron of the other.This module provides a significant speedup of the process by using advanced acceleration techniques(spatial subdivision etc.),while it preserves reusability of code, because it calls public methods of the PointToTetrahedronDistanceEvalu ator module.A typical map may consist of two loops that compose two space-time tetrahedral meshes. For each of them a new point attribute is computed using the PointToTetrahedron DistanceEvaluator module that represents the distance from each point to the other mesh.A general AttributeMax module can then be used for computing the one-way mesh distance,and a general ScalarMax module finally produces the symmetric estimate of Hausdorff distance.The resulting point attribute can also be used in other ways.We may display its value distribution by the standard Attribute Histogram and CurveRenderer modules.Such visualization helps when considering similarity of animations.We can also transform this attribute into color attribute and display it using some MVE-2 renderer. It allows us to see exactly where and when the two animations are similar or distinct.Such information can be also very useful in many AI tasks, for example machine learning, where a trainee can see how precisely she follows some pattern.4 ConclusionsWe have shown a method for comparing dynamic meshes. This method can be used for a variety of AI applications, from animation recognition to automated learning or teaching. The implementation in the MVE-2 environment allows easy experimenting with the method in various setups and algorithms. The current implementation is still not fast enough to compare moderately complex animations in real time,but we are still working on speeding the method up.We believe that the performance of the optimized。
义守大学计算及组织Chapter 4 The Processor

Use multiplexers where alternate data sources are used for different instructions
Chapter 4 — The Processor — 18
R-Type/Load/Store Datapath
Chapter 4 — The Processor — 19
Chapter 4 — The Processor — 7
How to Design a Processor?
Analyze instruction set (datapath requirement)
Select set of datapath components and establish
PC Extender for zero- or sign-extension Add 4 or extended immediate to PC
Chapter 4 — The Processor — 11
§ 4.3 Building a Datapath
Building a Datapath
clocking methodology
Build datapath meeting the requirements
Analyze implementation of each instruction to determine setting of control points effecting
Read two register operands Perform arithmetic/logical operation Write register result
NLFM脉冲压缩及其FPGA时域实现

第40卷第4期2018年7月湖北大学学报(自然科学版)Journal of Hubei University(Natural Science)Vol.40㊀No.4㊀July,2018㊀收稿日期:20170904基金项目:国家自然科学基金(61601175)资助作者简介:陆聪(1993-),男,硕士生;王旭光,通信作者,博士,讲师,硕士生导师,E-mail:109278484@ 文章编号:10002375(2018)04038406NLFM 脉冲压缩及其FPGA 时域实现陆聪,杨维明,王旭光,曾张帆(湖北大学计算机与信息工程学院,湖北武汉430062)摘要:介绍非线性调频(NLFM)信号的产生原理和设计匹配滤波器实现脉冲压缩技术的方法.使用MATLAB 工具产生NLFM 脉冲及雷达回波信号,基于FPGA 器件EP2C35F672C8设计分布式FIR 结构的匹配滤波器,实现脉冲压缩技术,对采样㊁量化后的回波信号进行脉冲压缩处理,最后使用Modelsim 对脉冲压缩后的回波信号进行波形仿真,检测匹配滤波器的设计效果.整个电路设计采用全流水线并行执行的结构,占用硬件资源:2468个逻辑单元㊁2073个寄存器㊁25KB 的RAM.利用FPGA 芯片丰富的BRAM 和LAB 代替乘法器IP,打破硬件资源对滤波器长度的限制.关键词:NLFM 信号;时域脉冲压缩;FPGA;匹配滤波器;分布式滤波算法中图分类号:TN713㊀㊀文献标志码:A㊀㊀DOI :10.3969/j.issn.1000-2375.2018.04.013NLFM pulse compression and its time domain implementation by FPGALU Cong,YANG Weiming,WANG Xuguang,ZENG Zhangfan(School of Computer &Information Engineering,Hubei University,Wuhan 430062,China)Abstract :The generation principle of nonlinear frequency modulated (NLFM)signal and the method of design matched filter realized pulse compression technique were analyzed in this paper.NLFM signal and the radar echo signal were generated by MATLAB tools,distributed FIR structure for realizing pulse compression technology was designed based on FPGA device EP2C35F672C8,which processed the sampled and quantized echo signals finally.The simulated waveform of the signal which was handled by Modelsim software to detect the effect of the matched filter.The whole circuit of the filter was designed by using the structure of full pipelined parallel execution.The FPGA hardware resources that the circuit occupied include 2468logical units,2073registers,and 25KB of RAM.By using BRAM and LAB of the FPGA chip instead of the multipliers IP,the limitation of hardware resources on the length of the filter was broken.Key words :NLFM signal;time domain pulse compression;FPGA;matched filter;distributed filter algorithm 0㊀引言现代雷达通常采用脉冲压缩技术提高系统的速度分辨力和距离分辨力[1].脉冲压缩就是将雷达发射端发射的宽脉冲调频信号,在接收端经数字匹配滤波器的处理,获得窄脉冲回波信号的过程.经过脉冲压缩后的信号同时具备大时宽㊁大带宽的特点,能保证雷达的探测距离和目标分辨精度[2].LFM 信号和NLFM 信号是常用于脉冲压缩中的两种基本信号.LFM 信号易于产生,应用广泛,但是LFM 信号的回波直接经过匹配滤波器,脉压后的信号旁瓣较大,一般需用窗函数对脉压后的输出信号进行旁瓣抑制,不同程度地造成主瓣展宽;NLFM 信号一般是基于窗函数设计产生的[3],优点是若对其回波信号直接匹配滤波,就能得到旁瓣很低的信号,省去了加权环节.第4期陆聪,等:NLFM脉冲压缩及其FPGA时域实现385㊀脉冲压缩可采用频域法和时域法两种方式实现[4].频域法实现时速度较快,但需多次用到快速傅里叶变换(FFT)和逆快速傅里叶变换(IFFT),硬件开销较大;时域法实现时电路结构简单,但速度较慢.本文中设计基于分布式算法的FIR匹配滤波器[5-6],采用全流水线并行执行结构,基于FPGA完成NLFM 信号脉冲压缩的时域实现,既节省硬件开销,又提高运算速度.1㊀NLFM信号产生及脉冲压缩技术的实现1.1㊀NLFM信号的产生㊀NLFM信号的产生比较复杂,且数学模型较多,没有统一标准,目前都是采用近似的方式实现.比较经典的是采用逗留相位原理产生NLFM信号,具体实现是将LFM信号的加权窗函数转变成频谱函数,使设计出的NLFM信号具有近似的窗形频谱,这样的信号进行脉冲压缩时,相对于LFM信号,省去了中间的加权环节,具有更好的旁瓣抑制效果和较为陡峭的过渡带.以Hamming窗为例设计NLFM信号(其他窗函数的设计方法类似),设计原理如下[7]:Hamming窗函数的表达式:W(f)=0.54+0.46cos(2πf/B)㊀-B/2ɤfɤB/2(1)则基于窗函数的群延时为:T(f)=K Tʏf-ɕW(y)d y(2)其中常数K T=(T/B)/0.54,将(1)式带入(2)式得:T(f)=(T/B)f+(0.426T/π)sin(2πf/B),㊀-B/2ɤfɤB/2(3)进一步对上式求T的反函数得:f(T)=T-1(f)(4)为了更加直观,使用t代替T,即f(t)为基于Hamming窗函数设计的NLFM信号.对于较简单的群延时函数,利用MATLAB的自带函数可以直接求得其反函数,但是,当群延时函数比较复杂时,需要采用数值分析方法推导函数的反函数.可以基于数字频率合成(DDS)产生NLFM信号,也可以使用MATLAB数学工具,本文中采用基于MATLAB的数值分析方法产生雷达的发射信号与回波信号.1.2㊀脉冲压缩技术的实现㊀脉冲压缩原理就是对雷达接收端的宽脉冲回波信号进行压缩,降低信号的时宽,提高了压缩后信号的峰值,使信号的时宽带宽积远大于1.采用脉冲压缩技术的雷达系统,可以同时兼顾速度分辨力和距离分辨力,而采用FPGA设计的匹配滤波器是目前实现脉冲压缩技术的主流方式,现在用数学推导的方式说明脉冲压缩的处理过程[8].时域脉冲压缩就是匹配滤波器的传输函数h(t)与雷达回波信号s(t)的线性卷积过程,即:y(t)=s(t)∗h(t)=ʏt-ɕs(τ)h(t-τ)dτ(5)根据最佳匹配原则,当输出信号的信噪比达到最佳时,匹配滤波器的传输函数h(t)为:h(t)=Ks∗(t0-t)(6)其中K是常数,t0为延时,s∗(t)表示共轭;当K=1,t0=0时,滤波器的传输函数为回波信号的复共轭.考虑到回波信号携带噪声的多样性以及目标信息的不确定性,在设计时域匹配滤波器时,采用近似替代的方法,使用发射信号的复共轭作为滤波器的传输函数,发射信号为已知信号,大大方便滤波器的设计.另外,相较于线性调频信号脉冲压缩过程中采用窗函数加权来抑制旁瓣的方式,直接使用NLFM信号作为雷达的发射脉冲,使得电路设计更简单有效.2㊀匹配滤波器的设计及实现2.1㊀FIR滤波器结构分析㊀传统FIR结构的匹配滤波器的结构如图1所示.匹配滤波器的输出:ðN-1i=0x(N-1)h(i)(7)386㊀湖北大学学报(自然科学版)第40卷图1㊀传统FIR匹配滤波器的结构㊀从(7)式可以看出,N阶传统FIR结构匹配滤波器需要N个乘法器和N-1个加法器,而回波信号和匹配滤波器的传输函数都是复数形式,设计N阶匹配滤波器,则需要4N个乘法器和4N-1个加法器.当N值较大时,FPGA内嵌的IP资源将不能满足滤波器的设计要求,而且乘法运算比较复杂㊁延时较高;采用分布式算法,BRAM和LAB代替乘法器的使用,不仅节约乘法器资源对滤波器设计的限制,而且保证滤波器的运算速率.2.2㊀分布式滤波器原理分析㊀分布式滤波器就是利用嵌入在FPGA芯片的BRAM和丰富的LUT,采用数据存储㊁地址转换的方式代替卷积运算中的乘法器.分布式滤波器的设计是,先将N阶卷积运算的所有可能值预先存储在RAM模块中,接着将输入数据转换成存储模块的寻址,对RAM进行查表,然后将存储模块的输出进行移位求和得到卷积运算的结果.该算法实现乘法到存储器㊁寄存器的转换,充分利用FPGA芯片资源,节省了硬件成本.分布式算法原理是,对回波信号x(t)进行采样,得到滤波器的输入信号x(n),其二进制表示形式为:x(n)=ðb-1k=0x k(n)2k(8)其中x k(n)表示x(n)的第k位,b是采样数据的位长,则N阶匹配滤波器输出:y(n)=ðN-1x(n)h(N-1-n)=ðN-1h(N-1-n)ðb-1x k(n)2k=ðb-12kðN-1h(N-1-n)x k(n)(9)由(9)式看出,首先输入数据第k位的值(1或0)与滤波器系数进行与运算并求和,然后将累加和左移k位(2k相当于左移k位)并求和,最终得到卷积和y(n),分布式算法就是将卷积运算由乘积项累加转变为移位求和的过程[9].分析算法可以看出,只要知道第一步的累加值,再进行移位求和就可以得到时域卷积的值,所以在电路设计中首先将第一次累加和的所有可能值预先存储在RAM块中,然后将输入数据转换为存储器的寻址数据,并对存储器输出的数据进行移位求和,这就是分布式算法的原理.2.3㊀分布式滤波器的FPGA实现㊀由(9)式可知,本次滤波器设计长度为48阶,式ðN-1n=0h(N-1-n)x k(n)的可能乘积项有248种,考虑到复数乘法,则直接采用ROM表进行数据存储,需要22ˑ248个存储单元,对于现有的FPGA芯片是不可能实现的.所以针对阶数较长的情况,可采用多条流水线并行执行的结构,对总流水线进行切割,就可以减少存储资源的使用量.采用6条流水线并行处理的结构,此时每条流水线都为一个8阶FIR结构的滤波器,每条流水线的存储大小为28单位,流水线设计将卷积运算的RAM使用量降到6ˑ28ˑ22单位,这使得一般的FPGA 芯片都可以满足.甚至可以增加流水线的数量,进一步缩减存储资源的使用量,分布式滤波器的设计框图如图2所示.图2中k表示输入数据的第k位,每个ROM表存储8阶FIR结构采用分布式算法的所有可能乘积项,需要28个存储单元,对输入数据进行转换作为存储器的寻址,接着将ROM表的输出数据进行移位(左移k位)求和,整个滤波器设计需要4条这样的流水线结构,输出的值分别为图3中I1㊁I2㊁Q1㊁Q2中的一个值.因为滤波器的输入数据是复数,由复数乘法可知,需要4条图2的流水线设计.分布式滤波器整体设计结构如图3所示.由图3可知,首先对回波信号的实部虚部进行分解,然后分别进行采样㊁量化,这个过程通过第4期陆聪,等:NLFM 脉冲压缩及其FPGA 时域实现387㊀图2㊀分布式滤波器的流水线结构㊀图3㊀分布式滤波器的总体结构㊀MATLAB 工具实现.滤波器最终需要输出的是信号的模值,然而传统的求模方式依旧用乘法器和开方运算,运算复杂且延时较高,所以需要找到一种简单的模值估算方法求取信号的模值,且能降低延时.设信号的模值为Y ,估算算法为[10]:Y =MAX{MAX(|I |,|Q |),7/8MAX(|I |,|Q |)+1/2MIN(|I |,|Q |)}(10)据统计,采用该复数求模公式对信号的损失不超过0.13dB,其中7/8MAX (|I |,|Q |)可以采用移位寄存器与加法器的结合来实现.至此匹配滤波器的整体结构完成,整个设计完全使用寄存器和加法器资源,理论上只要FPGA 的ROM 和加法器资源足够,就可以设计任意长度的滤波器.图4㊀分布式滤波器电路原理图3㊀脉冲压缩的FPGA 实现与测试3.1㊀FPGA 硬件电路设计㊀分布式滤波器的硬件电路实现,采用全流水线并行执行的结构进行设计,其特点是运算快,资源使用量大.选用ALTERA 公司FPGA 器件EP2C35F672C8进行电路设计,硬件电路原理图设计如图4所示.回波信号实部和虚部经过采样㊁量化后存储在片内存储模块ROM_real㊁ROM _imag 中,经时钟信号CLK 驱动,通过计数器counter 寻址,作为匹配滤波器的输入数据;address 模块将输入数据的第k 位转变成存储模块ROM 表的寻址,完成对乘积项的提取,这个过程是分布式算法的核388㊀湖北大学学报(自然科学版)第40卷心部分,完成卷积乘法器到查找表的转化;最后将ROM输出的值进行移位求和得到回波信号脉冲压缩后的实部I和虚部Q.分布式滤波器输出脉冲压缩后回波信号的实部和虚部,需要进一步求信号的模值.由(10)式可知,可完全采用加法器和移位寄存器完成该近似算法,电路原理设计如图5所示.采用xor2模块求实部和虚部的绝对值(输入数据与其最高位逐位异或),比较器的数据选择㊁加法器的累加求和完成复数求模运算,输出data4[18ʒ0]为回波信号脉压后的近似模值.至此基于FPGA的匹配滤波器硬件电路设计完成,需要进一步编写Test Bench驱动程序,完成匹配滤波器性能检测.图5㊀求模电路原理图㊀3.2㊀Modelsim设计仿真㊀使用MATLAB工具设计NLFM信号[11],并产生雷达回波信号作为滤波器的输入信号,设计参数:带宽B=5MHz,时宽T=5μs,根据奈奎斯特采样定理:采样频率要大于等于信号最高频率的两倍,否则会发生混叠效应;设定回波信号采样频率为:f p=2.5㊃B.由雷达分辨力精度公式:δ=c/2B,c为光速,则理论精度值δ=30m,尽管实测值受采样精度与滤波器阶数的限制与理论值有差异,但现代雷达在不断追求这个理论值.设定双目标信号间距为45m,MATLAB端的仿真结果如图6所示.图6㊀脉冲压缩技术的MATLAB仿真效果㊀从图6看出,雷达发射的脉冲信号为NLFM信号,设定间距为45m的两个检测目标,经过一段时间在雷达接收端收到回波信号,从回波信号的波形无法获得目标信号的数量和间距等信息,而采用脉冲压缩技术处理后,可以看出波形的旁瓣受到抑制,代表目标信号的两个主瓣更加明显,使信号能量集中在主瓣,降低了能量的损失.对回波信号进行采集㊁量化作为FPGA设计匹配滤波器的输入数据,检测匹配滤波器采用分布式算法实现脉冲压缩技术的效果,结果如图7所示.与图6对比可以看出,采样㊁量化后的回波信号经过FPGA设计的匹配滤波器处理,可以很好地达到脉冲压缩的效果,这也表明采样分布式算法完全可以代替线性卷积中的乘法运算.测量实验数据得到:主瓣间距Δt=12ns,两个目标之间的测量间距为45m,与设定值相符.考虑到采样频率㊁量化精度的影响,增加滤波器阶数可以进一步提高目标间距的分辨力.验证结果表明,基于窗函数产生的NLFM信号作为雷达系统的发射脉冲,雷达的回波经脉冲压缩后的波形旁瓣抑制性能好,过渡带陡峭,具有较强的目标识别能力.FPGA验证结果显示整个电路占用硬件资源:2468个逻辑单元㊁2073个寄存器㊁25K 字节的RAM,可以看出全流水线结构实现分布式算法对资源的需求较高,但是随着工艺水平的提升,芯片集成的基本资源将更加丰富,使用分布式算法实现脉冲压缩技术的应用将愈加广泛.㊀第4期陆聪,等:NLFM脉冲压缩及其FPGA时域实现389图7㊀回波信号经FPGA电路处理后的波形㊀4㊀结束语本文中采用分布式FIR结构的匹配滤波器实现NLFM信号脉冲压缩,利用FPGA的寄存器㊁加法器和ROM资源代替传统滤波器中的乘法器以及求模的开方运算,大大减小硬件开销;采用全流水线并行执行的结构实现,保证时域脉冲压缩的运算速率.通过对比NLFM信号与LFM信号脉冲压缩后的仿真结果,可以看出,采用NLFM信号作为雷达系统的发射脉冲,在接收端可以获得旁瓣低㊁过渡带陡峭的回波波形,减小了有效带宽内雷达信号的能量损失,而且具有较强的目标分辨力.对NLFM信号来说,匹配滤波器的阶数N要接近甚至等于f p㊃D/B,当时宽带宽积D值较大时,时域实现脉冲压缩的成本也较大.5㊀参考文献[1]潘琳.基于FPGA的雷达脉冲压缩系统的研究与实现[D].上海:上海交通大学,2008.[2]梁丽.基于FPGA的雷达信号处理系统设计[D].南京:南京理工大学,2006.[3]阮黎婷.非线性调频信号的波形设计与脉冲压缩[D].西安:西安电子科技大学,2009.[4]汪堃.基于FPGA的脉冲压缩系统研究与实现[D].武汉:华中科技大学,2009.[5]程远东,郑晶翔.一种用于数字下变频的高阶分布式FIR滤波器及FPGA实现[J].电子技术应用,2011,37(2):57-59.[6]李书华,曾以成.基于分布式算法的高阶FIR滤波器及其FPGA实现[J].计算机工程与应用,2010,46(12): 136-138.[7]徐飞.基于FPGA的非线性调频信号脉冲压缩的实现[D].西安:西安电子科技大学,2014.[8]孙宝鹏.基于FPGA的雷达信号处理算法设计与实现[D].北京:北京理工大学,2014.[9]崔永强,高晓丁,贺素馨.基于FPGA分布式算法的滤波器设[J].现代电子技术,2010,33(16):117-119.[10]杨维明.一种基于EPLD技术的信号取模方法[J].湖北大学学报(自然科学版),1999,11(2):138-141.[11]杜勇.数字滤波器的MATLAB与FPGA实现[M].2版.北京:电子工业出版社,2015.(责任编辑㊀郭定和,赵㊀燕)。
计算机组成与设计第五版答案

计算机组成与设计:《计算机组成与设计》是2010年机械工业出版社出版的图书,作者是帕特森(DavidA.Patterson)。
该书讲述的是采用了一个MIPS 处理器来展示计算机硬件技术、流水线、存储器的层次结构以及I/O 等基本功能。
此外,该书还包括一些关于x86架构的介绍。
内容简介:这本最畅销的计算机组成书籍经过全面更新,关注现今发生在计算机体系结构领域的革命性变革:从单处理器发展到多核微处理器。
此外,出版这本书的ARM版是为了强调嵌入式系统对于全亚洲计算行业的重要性,并采用ARM处理器来讨论实际计算机的指令集和算术运算。
因为ARM是用于嵌入式设备的最流行的指令集架构,而全世界每年约销售40亿个嵌入式设备。
采用ARMv6(ARM 11系列)为主要架构来展示指令系统和计算机算术运算的基本功能。
覆盖从串行计算到并行计算的革命性变革,新增了关于并行化的一章,并且每章中还有一些强调并行硬件和软件主题的小节。
新增一个由NVIDIA的首席科学家和架构主管撰写的附录,介绍了现代GPU的出现和重要性,首次详细描述了这个针对可视计算进行了优化的高度并行化、多线程、多核的处理器。
描述一种度量多核性能的独特方法——“Roofline model”,自带benchmark测试和分析AMD Opteron X4、Intel Xeo 5000、Sun Ultra SPARC T2和IBM Cell的性能。
涵盖了一些关于闪存和虚拟机的新内容。
提供了大量富有启发性的练习题,内容达200多页。
将AMD Opteron X4和Intel Nehalem作为贯穿《计算机组成与设计:硬件/软件接口(英文版·第4版·ARM版)》的实例。
用SPEC CPU2006组件更新了所有处理器性能实例。
图书目录:1 Computer Abstractions and Technology1.1 Introduction1.2 BelowYour Program1.3 Under the Covers1.4 Performance1.5 The Power Wall1.6 The Sea Change: The Switch from Uniprocessors to Multiprocessors1.7 Real Stuff: Manufacturing and Benchmarking the AMD Opteron X41.8 Fallacies and Pitfalls1.9 Concluding Remarks1.10 Historical Perspective and Further Reading1.11 Exercises2 Instructions: Language of the Computer2.1 Introduction2.2 Operations of the Computer Hardware2.3 Operands of the Computer Hardware2.4 Signed and Unsigned Numbers2.5 Representing Instructions in the Computer2.6 Logical Operations2.7 Instructions for Making Decisions2.8 Supporting Procedures in Computer Hardware2.9 Communicating with People2.10 ARM Addressing for 32-Bit Immediates and More Complex Addressing Modes2.11 Parallelism and Instructions: Synchronization2.12 Translating and Starting a Program2.13 A C Sort Example to Put lt AU Together2.14 Arrays versus Pointers2.15 Advanced Material: Compiling C and Interpreting Java2.16 Real Stuff." MIPS Instructions2.17 Real Stuff: x86 Instructions2.18 Fallacies and Pitfalls2.19 Conduding Remarks2.20 Historical Perspective and Further Reading2.21 Exercises3 Arithmetic for Computers3.1 Introduction3.2 Addition and Subtraction3.3 Multiplication3.4 Division3.5 Floating Point3.6 Parallelism and Computer Arithmetic: Associativity 3.7 Real Stuff: Floating Point in the x863.8 Fallacies and Pitfalls3.9 Concluding Remarks3.10 Historical Perspective and Further Reading3.11 Exercises4 The Processor4.1 Introduction4.2 Logic Design Conventions4.3 Building a Datapath4.4 A Simple Implementation Scheme4.5 An Overview of Pipelining4.6 Pipelined Datapath and Control4.7 Data Hazards: Forwarding versus Stalling4.8 Control Hazards4.9 Exceptions4.10 Parallelism and Advanced Instruction-Level Parallelism4.11 Real Stuff: theAMD OpteronX4 (Barcelona)Pipeline4.12 Advanced Topic: an Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipelineand More Pipelining Illustrations4.13 Fallacies and Pitfalls4.14 Concluding Remarks4.15 Historical Perspective and Further Reading4.16 Exercises5 Large and Fast: Exploiting Memory Hierarchy5.1 Introduction5.2 The Basics of Caches5.3 Measuring and Improving Cache Performance5.4 Virtual Memory5.5 A Common Framework for Memory Hierarchies5.6 Virtual Machines5.7 Using a Finite-State Machine to Control a Simple Cache5.8 Parallelism and Memory Hierarchies: Cache Coherence5.9 Advanced Material: Implementing Cache Controllers5.10 Real Stuff: the AMD Opteron X4 (Barcelona)and Intel NehalemMemory Hierarchies5.11 Fallacies and Pitfalls5.12 Concluding Remarks5.13 Historical Perspective and Further Reading5.14 Exercises6 Storage and Other I/0 Topics6.1 Introduction6.2 Dependability, Reliability, and Availability6.3 Disk Storage6.4 Flash Storage6.5 Connecting Processors, Memory, and I/O Devices6.6 Interfacing I/O Devices to the Processor, Memory, andOperating System6.7 I/O Performance Measures: Examples from Disk and File Systems6.8 Designing an I/O System6.9 Parallelism and I/O: Redundant Arrays of Inexpensive Disks6.10 Real Stuff: Sun Fire x4150 Server6.11 Advanced Topics: Networks6.12 Fallacies and Pitfalls6.13 Concluding Remarks6.14 Historical Perspective and Further Reading6.15 Exercises7 Multicores, Multiprocessors, and Clusters7.1 Introduction7.2 The Difficulty of Creating Parallel Processing Programs7.3 Shared Memory Multiprocessors7.4 Clusters and Other Message-Passing Multiprocessors7.5 Hardware Multithreading 637.6 SISD,MIMD,SIMD,SPMD,and Vector7.7 Introduction to Graphics Processing Units7.8 Introduction to Multiprocessor Network Topologies7.9 Multiprocessor Benchmarks7.10 Roofline:A Simple Performance Model7.11 Real Stuff:Benchmarking Four Multicores Using theRooflineMudd7.12 Fallacies and Pitfalls7.13 Concluding Remarks7.14 Historical Perspective and Further Reading7.15 ExercisesInuexC D-ROM CONTENTA Graphics and Computing GPUSA.1 IntroductionA.2 GPU System ArchitecturesA.3 Scalable Parallelism-Programming GPUSA.4 Multithreaded Multiprocessor ArchitectureA.5 Paralld Memory System G.6 Floating PointA.6 Floating Point ArithmeticA.7 Real Stuff:The NVIDIA GeForce 8800A.8 Real Stuff:MappingApplications to GPUsA.9 Fallacies and PitflaUsA.10 Conduding RemarksA.1l HistoricalPerspectiveandFurtherReadingB1 ARM and Thumb Assembler InstructionsB1.1 Using This AppendixB1.2 SyntaxB1.3 Alphabetical List ofARM and Thumb Instructions B1.4 ARM Asembler Quick ReferenceB1.5 GNU Assembler Quick ReferenceB2 ARM and Thumb Instruction EncodingsB3 Intruction Cycle TimingsC The Basics of Logic DesignD Mapping Control to HardwareADVANCED CONTENTHISTORICAL PERSPECTIVES & FURTHER READINGTUTORIALSSOFTWARE作者简介:David A.Patterson,加州大学伯克利分校计算机科学系教授。
咸阳 英文翻译

沈阳工业大学本科生毕业设计(论文)英文翻译毕业设计题目:一个用于网络应用程序的高速DES实现学院:信息科学与工程学院专业班级:计算机科学与技术0704班学生姓名:咸阳学生学号:070405127指导教师:刘革A High-speed DES Implementationfor Network ApplicationsAbstractA high-speed data encryption chip implementing the Data Encryption Standard (DES) has been developed. The DES modes of operation supported are Electronic Code Book and Cipher Block Chaining. The chip is based on a gallium arsenide (GaAs) gate array containing 50K transistors. At a clock frequency of 250MHz, data can be encrypted or decrypted at a rate of 1 GBit/second, making this the fastest singlechip implementation reported to date. High performance and high density have been achieved by using custom-designed circuits to implement the core of the DES algorithm. These circuits employ precharged logic. a methodology novel to the design of GaAs devices. A pipelined flowthrough architecture and an efficient key exchange mechanism make this chip suitable for low-latency network controllers.1. IntroductionNetworking and secure distributed systems are major research areas at the Digital Equipment Corporation's Systems Research Center. A prototype network called Autonet with 100 MBit/s links has been in service there since early 1990 [14]. We are currently working on a follow-on network with link data rates of 1GBit/s.The work described here was motivated by the need for data encryption hardware for this new high-speed network. Secure transmission over a network requires encryption hardware that operates a t link speed. Encryption will become an integral part of future high-speed networks.We have chosen the Data Encryption Standard (DES) since it iswidely used in commercial applications and allows for efficient hardware implementations. Several single-chip implementations of the DES algorithm exist or have been announced. Commercial products include the AmZ8068/Am9518 [l]with an en-cryption rate of 14 hiIBit/s and the recently announced VM007 with a throughput of 192 MBit/s [18].An encryption rate of 1GBit/s can be achieved by using a fast VLSI technology. Possible candidates are GaAs direct-coupled field-effect transistor logic (DCFL) and silicon emitter-coupled logic (ECL). As a semiconductor materid GaAs is attractive because of the high electron mobility which makes GaAs circuits twice as fast as silicon circuits. In addition, electrons reach maximum velocity in GaAs at a lower voltage than in silicon, allowing for lower internal OPerating voltages, which decreases power consumption. These properties position GaAs favorably with respect to silicon in particular for high speed applications. The disadvantage of GaAs technology is its immaturity compared with silicon technology. GaAs has been recognized as a possible alternative to silicon for over twenty years, but only recently have the difficulties with manufacturing been overcome. GaAs is becoming a viable contender for VLSI designs [8, 101 and motivated us to explore the feasibility of GaAs for our design.In this paper, we will describe a new implementation of the DES algorithm with a GaAs gate array. We will show how high performance can be obtained even with the limited flexibility of a semi-custom design. Our approach was to use custom-designed circuits to implement the core of the DES algorithm and an unconventional chip layout that optimizes the data paths. Further, we will describe how encryption can be incorporated into network controllers without compromising network throughput or latency. We will show that low latency can be achieved with a fully pipelined DES chip architecture and hardwaresupport for a key exchange mechanism that allows for selecting the key on the fly.Section 2 of this paper outlines the DES algorithm. Section 3 describes the GaAs gate array that we used for implementing the DES algorithm. Section 4 provides a detailed description of our DES implementation. Section 5 shows how the chip can be used for network applications and the features that make it suitable for building low-latency network controllers. This section also includes a short analysis of the economics of breaking DES enciphered data. Finally, section 6 contains some concluding remarks.2. DES AlgorithmThe DES algorithm was issued by the Kational Bureau of Standards (NBS) in 1977. A detailed description of the algorithm can be found in [Il, 131. The DES algorithm enciphers 64-bit data blocks using a 56-bit secret key (not including parity bits which are part of the 6 1-bit key block). The algorithm employs three different types of operations: permutations, rotations, and substitutions. The exact choices for these transformations, i.e. the permutation and substitution tables are not important to this paper. They are described in [ll].As shown in Fig. 1,a block to be enciphered is first subjected to an initial permutation (IP), then to 16 iterations, or rounds, of a complex key-dependent computation, and finally to the inverse initial permutation (IP-'1. The key schedule transforms the 56-bit key into sixteen 48-bit partial keys by using each of the key bits several times.It shows an expanded version of the 16 DES iterations for encryption. The inputs to the 16 rounds are the output of IP and sixteen 48-bit keys X1..16 that are derived from the supplied 56-bit key. First, the 64-bit output shows an expanded version of the 16 DES iterations for encryption. The inputs to the 16 rounds are the output of IP andsixteen 48-bit keys X1..16 that are derived from the supplied 56-bit key. First, the 64-bit output data block of IP is divided into two halves LOand ROeach consisting of 32 bits.Decryption and encryption use the same data path, and differ only in the order in which the key bits are presented to function f. That is, for decryption K16 is used in the first iteration, K15 in the second, and so on, with XI used in the 16th iteration. The order is reversed simply by changing the direction of the rotate operation performed on C O .For enciphering data streams that are longer than 64 bits the obvious solution is to cut the stream into 64-bit blocks and encipher each of them independently. This method is known as Electronic Code Book (ECB) mode [12]. Since for a given key and a given plaintext block the resulting ciphertext block wiIl always be the same, frequency analysis could be used to retrieve the original data. There exist alternatives to the ECB mode that use the concept of diffusion SO that each ciphertext block depends on all previous plaintext blocks. These modes are called Cipher Block Chaining (CBC) mode, Cipher Feedback (CFB) mode, and Output Feedback (OFB) mode.3. DES Chip ImplementationThis section describes how we implemented the DES algorithm.3.1. OrganizationThere are two ways to improve an algorithm's performance. One can choose a dense but slow technology such as silicon CMOS and increase performance by parallelizing the algorithm or flattening the logic. Alternatively, one can choose a fast but low-density technology such as silicon ECL or GaAs DCFL. The DES algorithm imposes limits on the former approach. The CBC mode of operation combines the result obtained by encrypting a block with the next input block. Since the result has to be available before the next block can be processed, it is impossible toparallelize the algorithm and operate on more than one block at a time. It is, however, possible to unroll the 16 rounds of Fig. 1 and implement all 16 iterations in sequence. Flattening the design in this manner will save the time needed to latch the intermediate results in aregister on every iteration. Even though the density of CMOS chips is sufficient for doing this, the speed requirements of a 1GBit/s CMOS implementation might still be challenging.Since we wanted to use GaAs technology, we had to choose a different approach. The limited density of GaAs gate arrays forced us to implement only one of the 16 rounds and reuse it for all 16 iterations. Even without unrolling the 16 rounds, fitting the implementation into the available space and meeting the speed requirements was a major challenge. In order to achieve a data rate of 1GBit/s, each block has to be processed in 64 ns, which corresponds to 4ns per iteration or a clock rate of 250MHz.The register-level block diagrams for encryption and decryption are shown in Figures 5 and 6. The DES chip realizes a rigid %stage pipeline, that is, a block is first written into the input register I, is then moved into register LR, where it undergoes the 16 iterations of the cipher function f, and finally is written into the output register 0.3.2. Implementation CharacteristicsThe implementation of the DES chip contains 480 flipflops, 2580 gates, and 8 PLAs. There are up to ten logic levels that have to be passed during the 4ns clock period. The chip uses 84% of the transistors available in the VSC15K gate array. The high utilization is the result of a fully manual placement. Timing constraints further forced us to lay out signal wires partially by hand.The chip's interface is completely asynchronous. The data ports are 8, 16, or 32 bits wide. A separate 7-bit wide port is available for loading the master key. Of the 211 available pins, 144 are used for signals and 45 are used forpower and ground. With the exception of the 250MHz clock, which is ECL compatible, all input and output signals are TTL compatible. The chip requires power supply voltages of -2 V for the GaAs logic and 5 V for the TTL-compatible output drivers. The maximum power consumption is 8 W.3.3. Asynchronous InterfaceAsynchronous ports are provided in order to avoid synchronization with the 250 MHz clock. The data input and output registers are controlled by two-way handshake signals which determine when the registers can be written or read. The data ports are 8, 16, or 32 bits wide. The variable width allows for reducing the width of the external data path at lower operating speeds. With the 32-bit wide port, a new data word must be loaded every 32ns in order to achieve an encryption rate of 1GBit/s. The master key register is loaded through a separate, also fully asynchronous 7-bit wide port. Our implementation does not check the byte parity bits included in the 64-bit key. The low speed of the data and key ports makes it possible to use TTL-levels for all signals except for the 250 MHz clock which is a differential ECL-compatible signal.Thanks to the fully asynchronous chip interface, the chip manufacturer was able to do at-speed testing even without being able to supply test vectors at full speed. For this purpose, the 250MHz clock was generated by a separate generator, while the test vectors were supplied asynchronously by a tester running at only 40 MHz. At-speed testing was essential particularly in testing the precharged logic which wdl be described in the following section.4. ApplicationsOur implementation of the DES algorithm is tailored for high-speed network applications. This requires not only encryption hardware operating at link speed but also support for low-latencycontrollers. Operating at link data rates of 1 GBit/s requires a completely pipelined controller structure. Low latency can be achieved by buffering data in the controller as little as possible and by avoiding protocol processing in the controller. In this respect, the main features of the DES chip are a pipelined flow-through design and an efficient key exchange mechanism.As described in the previous section, the chip is implemented as a rigid 3- stage pipeline with separate input and output ports. Each 64-bit data block is entered into the pipeline together with a command word. While the data block flows through the pipeline, the accompanying command instructs the pipeline stages which operations to apply to the data block. On a block-by-block basis it is possible to enable or disable encryption, to choose ECB or CBC mode, and to select the master key in MK or the key in CD. None of these commands causes the pipeline to stall. It is further possible to instruct the pipeline to load a block from the output register 0 into register CD. Typical usage of this feature is as follows: a data block is decrypted with the master key, is loaded into CD, and is then used for encrypting or decrypting subsequent data blocks. This operation requires a one-cycle delay slot; that is, the new key in CD cannot be applied to the data block immediately following.5.ConclusionsWe began designing the DES chip in early 1989 and received the first prototypes at the beginning of 1991. The parts were logically functional, but exhibited electrical problems and failed at high temperature. A minor design change fixed this problem. In the fall of 1991, we received 25 fully functional parts that we plan to use in future high-speed network controllers.With an encryption rate of lGBit/s, the design presented in this paper is the fastest DES implementation reported to date. Both ECB and CBC modes of operation are supported at full speed. This data rate is based on a worst case timing analysis and a clock frequency of 250MHz. The fastest chips we tested run at 350 MHz or 1.4 GBit/s.We have shown that a high-speed implementation of the DES algorithm is possible even with the limited flexibility of a semi-custom design. An efficient implementation of the S boxes offering both high performance and high density has been achieved with a novel approach to designing PLA structures in GaAs. An unconventional floorplan has been presented that eliminates long wires caused by permuted data bits in the critical path.The architecture of the DES chip makes it possible to build very low-latency network controllers. A pipelined design together with separate fully asynchronous input and output ports allows for easy integration into controllers with a flowthrough architecture. ECL levels are required only for the 250MHz clock; TTL levels are used for all the data and control pins, thus providing a cost-effective interface even at data rates of 1 GBit/s. The provision of a data path for loading the key from the data stream allows for selecting the encryption or decryption key on the fly. These features make it possible to use encryption hardware for network applications with very little overhead.References1. Advanced Micro Devices: ArnZ8068,/Am9518 Data Ciphering Processor. Datasheet, July 19842. National Bureau of Standards: Data Encryption Standard. Federal Information3. National Bureau of Standards: DES Modes of Operation. Federal Information Processing Standards Publication FIPS PUB 81, December 19804.Diffie, W., Hellman, M.: Exhaustive cryptananlysis of the NBS Data EncryptionStandard. Computer, voi. 10, no. 6, June 1977, pp. 74-845. McCluskey, E.: Logic Design Principles. Prentice-Hall, 19866. National Bureau of Standards: Guidelines for Implementing and Using the NBS Data Encryption Standard. Federal Information Processing Standards Publication FIPS PUB 74, April 19817. VLSI Technology: VM007 Data Encryption Processor. Datasheet, October 19918. Brassard, G.: Modern Cryptology. Lecture Notes in Computer Science, no. 325, Springer-V erlag, 1988一个用于网络应用程序的高速DES实现摘要一个实现数据加密标准(DES)的高速数据加密芯片已经研制成功。
电子信息工程专业英语教程_第5版 题库

《电子信息工程专业英语教程(第5版)》题库Section A 术语互译 (1)Section B 段落翻译 (5)Section C阅读理解素材 (12)C.1 History of Tablets (12)C.2 A Brief History of satellite communication (13)C.3 Smartphones (14)C.4 Analog, Digital and HDTV (14)C.5 SoC (15)Section A 术语互译Section B 段落翻译Section C阅读理解素材C.1 History of TabletsThe idea of the tablet computer isn't new. Back in 1968, a computer scientist named Alan Kay proposed that with advances in flat-panel display technology, user interfaces, miniaturization of computer components and some experimental work in WiFi technology, you could develop an all-in-one computing device. He developed the idea further, suggesting that such a device would be perfect as an educational tool for schoolchildren. In 1972, he published a paper about the device and called it the Dynabook.The sketches of the Dynabook show a device very similar to the tablet computers we have today, with a couple of exceptions. The Dynabook had both a screen and a keyboard all on the same plane. But Key's vision went even further. He predicted that with the right touch-screen technology, you could do away with the physical keyboard and display a virtual keyboard in any configuration on the screen itself.Key was ahead of his time. It would take nearly four decades before a tablet similar to the one he imagined took the public by storm. But that doesn't mean there were no tablet computers on the market between the Dynabook concept and Apple's famed iPad.One early tablet was the GRiDPad. First produced in 1989, the GRiDPad included a monochromatic capacitance touch screen and a wired stylus. It weighed just under 5 pounds (2.26 kilograms). Compared to today's tablets, the GRiDPad was bulky and heavy, with a short battery life of only three hours. The man behind the GRiDPad was Jeff Hawkins, who later founded Palm.Other pen-based tablet computers followed but none received much support from the public. Apple first entered the tablet battlefield with the Newton, a device that's received equal amounts of love and ridicule over the years. Much of the criticism for the Newton focuses on its handwriting-recognition software.It really wasn't until Steve Jobs revealed the first iPad to an eager crowd that tablet computers became a viable consumer product. Today, companies like Apple, Google, Microsoft and HP are trying to predict consumer needs while designing the next generation of tablet devices.C.2 A Brief History of satellite communicationIn an article in Wireless World in 1945, Arthur C. Clarke proposed the idea of placing satellites in geostationary orbit around Earth such that three equally spaced satellites could provide worldwide coverage. However, it was not until 1957 that the Soviet Union launched the first satellite Sputnik 1, which was followed in early 1958 by the U.S. Army’s Explorer 1. Both Sputnik and Explorer transmitted telemetry information.The first communications satellite, the Signal Communicating Orbit Repeater Experiment (SCORE), was launched in 1958 by the U.S. Air Force. SCORE was a delayed-repeater satellite, which received signals from Earth at 150 MHz and stored them on tape for later retransmission. A further experimental communication satellite, Echo 1, was launched on August 12, 1960 and placed into inclined orbit at about 1500 km above Earth. Echo 1 was an aluminized plastic balloon with a diameter of 30 m and a weight of 75.3 kg. Echo 1 successfully demonstrated the first two-way voice communications by satellite.On October 4, 1960, the U.S. Department of Defense launched Courier into an elliptical orbit between 956 and 1240 km, with a period of 107 min. Although Courier lasted only 17 days, it was used for real-time voice, data, and facsimile transmission. The satellite also had five tape recorders onboard; four were used for delayed repetition of digital information, and the other for delayed repetition of analog messages.Direct-repeated satellite transmission began with the launch of Telstar I on July 10, 1962. Telstar I was an 87-cm, 80-kg sphere placed in low-Earth orbit between 960 and 6140 km, with an orbital period of 158 min. Telstar I was the first satellite to be able to transmit and receive simultaneously and was used for experimental telephone, image, and television transmission. However, on February 21, 1963, Telstar I suffered damage caused by the newly discovered Van Allen belts.Telstar II was made more radiation resistant and was launched on May 7, 1963. Telstar II was a straight repeater with a 6.5-GHz uplink and a 4.1-GHz downlink. The satellite power amplifier used a specially developed 2-W traveling wave tube. Along with its other capabilities, the broadband amplifier was able to relay color TV transmissions. The first successful trans-Atlantic transmission of video was accomplished with Telstar II , which also incorporated radiation measurements and experiments that exposed semiconductor components to space radiation.The first satellites placed in geostationary orbit were the synchronous communication (SYNCOM ) satellites launched by NASA in 1963. SYNCOM I failed on injection into orbit. However, SYNCOM II was successfully launched on July 26, 1964 and provided telephone, teletype, and facsimile transmission. SYNCOM III was launched on August 19, 1964 and transmitted TV pictures from the Tokyo Olympics. The International Telecommunications by Satellite (INTELSAT) consortium was founded in July 1964 with the charter to design, construct, establish, and maintain the operation of a global commercial communications system on a nondiscriminatory basis. The INTELSAT network started with the launch on April 6, 1965, of INTELSAT I, also called Early Bird. On June 28, 1965, INTELSAT I began providing 240 commercial international telephone channels as well as TV transmission between the United States and Europe.In 1979, INMARSAT established a third global system. In 1995, the INMARSAT name was changed to the International Mobile Satellite Organization to reflect the fact that the organization had evolved to become the only provider of global mobile satellite communications at sea, in the air, and on the land.Early telecommunication satellites were mainly used for long-distance continental and intercontinental broadband, narrowband, and TV transmission. With the advent of broadband optical fiber transmission, satellite services shifted focus to TV distribution, and to point-to-multipoint and very small aperture terminal (VSAT) applications. Satellite transmission is currently undergoing further significant growth with the introduction of mobile satellite systems for personal communications and fixed satellite systems for broadband data transmission.C.3 SmartphonesThink of a daily task, any daily task, and it's likely there's a specialized, pocket-sized device designed to help you accomplish it. You can get a separate, tiny and powerful machine to make phone calls, keep your calendar and address book, entertain you, play your music, give directions, take pictures, check your e-mail, and do countless other things. But how many pockets do you have? Handheld devices become as clunky as a room-sized supercomputer when you have to carry four of them around with you every day.A smartphone is one device that can take care of all of your handheld computing and communication needs in a single, small package. It's not so much a distinct class of products as it is a different set of standards for cell phones to live up to.Unlike many traditional cell phones, smartphones allow individual users to install, configure and run applications of their choosing. A smartphone offers the ability to conform the device to your particular way of doing things. Most standard cell-phone software offers only limited choices for re-configuration, forcing you to adapt to the way it's set up. On a standard phone, whether or not you like the built-in calendar application, you are stuck with it except for a few minor tweaks. If that phone were a smartphone, you could install any compatible calendar application you like.Here's a list of some of the things smartphones can do:•Send and receive mobile phone calls•Personal Information Management (PIM) including notes, calendar and to-do list•Communication with laptop or desktop computers•Data synchronization with applications like Microsoft Outlook•E-mail•Instant messaging•Applications such as word processing programs or video games•Play audio and video files in some standard formatsC.4 Analog, Digital and HDTVFor years, watching TV has involved analog signals and cathode ray tube (CRT) sets. The signal is made of continually varying radio waves that the TV translates into a picture and sound. An analog signal can reach a person's TV over the air, through a cable or via satellite. Digital signals, like the ones from DVD players, are converted to analog when played on traditional TVs.This system has worked pretty well for a long time, but it has some limitations:•Conventional CRT sets display around 480 visible lines of pixels. Broadcasters have been sending signals that work well with this resolution for years, and they can't fit enough resolution to fill a huge television into the analog signal.•Analog pictures are interlaced - a CRT's electron gun paints only half the lines for each pass down the screen. On some TVs, interlacing makes the picture flicker.•Converting video to analog format lowers its quality.United States broadcasting is currently changing to digital television (DTV). A digital signal transmits the information for video and sound as ones and zeros instead of as a wave. For over-the-air broadcasting, DTV will generally use the UHF portion of the radio spectrum with a 6 MHz bandwidth, just like analog TV signals do.DTV has several advantages:•The picture, even when displayed on a small TV, is better quality.• A digital signal can support a higher resolution, so the picture will still look good when shown on a larger TV screen.•The video can be progressive rather than interlaced - the screen shows the entire picture for every frame instead of every other line of pixels.•TV stations can broadcast several signals using the same bandwidth. This is called multicasting.•If broadcasters choose to, they can include interactive content or additional information with the DTV signal.•It can support high-definition (HDTV) broadcasts.DTV also has one really big disadvantage: Analog TVs can't decode and display digital signals. When analog broadcasting ends, you'll only be able to watch TV on your trusty old set if you have cable or satellite service transmitting analog signals or if you have a set-top digital converter.C.5 SoCThe semiconductor industry has continued to make impressive improvements in the achievable density of very large-scale integrated (VLSI) circuits. In order to keep pace with the levels of integration available, design engineers have developed new methodologies and techniques to manage the increased complexity inherent in these large chips. One such emerging methodology is system-on-chip (SoC) design, wherein predesigned and pre-verified blocks often called intellectual property (IP) blocks, IP cores, or virtual components are obtained from internal sources, or third parties, and combined on a single chip.These reusable IP cores may include embedded processors, memory blocks, interface blocks, analog blocks, and components that handle application specific processing functions. Corresponding software components are also provided in a reusable form and may include real-time operating systems and kernels, library functions, and device drivers.Large productivity gains can be achieved using this SoC/IP approach. In fact, rather than implementing each of these components separately, the role of the SoC designer is to integrate them onto a chip to implement complex functions in a relatively short amount of time.The integration process involves connecting the IP blocks to the communication network, implementing design-for-test (DFT) techniques and using methodologies to verify and validate the overall system-level design. Even larger productivity gains are possible if the system is architected as a platform in such as way that derivative designs can be generated quickly.In the past, the concept of SoC simply implied higher and higher levels of integration. That is, it was viewed as migrating a multichip system-on-board (SoB) to a single chip containing digital logic, memory, analog/mixed signal, and RF blocks. The primary drivers for this direction were the reduction of power, smaller form factor, and lower overall cost. It is important to recognize that integrating more and more functionality on a chip has always existed as a trend by virtue of Moore’s Law, which predicts that the number of transistors on a chip will double every 18-24 months. The challenge is to increase designer productivity to keep pace with Moore’s Law. Therefore, today’s notion of SoC is defined in terms of overall productivity gains through reusable design and integration of components.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
An Implementation of Pipelined Prallel Processing System for Multi-Access Memory SystemHyung Lee1, Hyeon-Koo Cho2, Dae-Sang You1, and Jong-Won Park11 Department of Information & Communications Engineering,Chungnam National University,220 Gung-Dong Yu sung-Gu, Daejeon, 305-764, KOREATel.: +82-42-821-7793, Fax.: +82-42-825-77922 Virtual I Tech. Inc.,Room 503, Engineering Building 3,220 Gung-Dong Yusung-Gu, Daejeon, 305-764, KOREAe-mail : {hyung, dsyou, hkcho, jwpark}@u.ac.krAbstract: We had been developing the variety of parallel processing systems in order to improve the processing speed of visual media applications. These systems were using multi-access memory system(MAMS) as a parallel memory system, which provides the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree, which is required in many low-level image processing operations such as edge or line detection in a particular direction, and so on. But, the performance of these systems did not give a faithful speed because of asynchronous feature between MAMS and processing elements.To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system using MAMS. Although the system is considered as being the single instruction multiple data(SIMD) type like the early developed systems, the performance of the system yielded about 2.5 times faster speed.1. IntroductionThere already exit the variety of machines that are capable of performing h igh-speed image applications. In general, these machines can be divided into two classes[1]: the first basically comprises two dimensional(2-D) array processors that operate on an entire image or subimage in a set of parallel processes. Examples of this type of machine are CLIP, MPP, and PIXIE-5000. In general, all of these machines can also be considered as being the single instruction multiple data (SIMD) type. The main drawback of 2D array processors is their cost. In addition, due to the inherent serial nature of the input-image data, full utilization of the processors may not be attained.The second class of machines – local-windows processors – scans an image and performs operations on a small neighborhood window. Examples of such machines include MITE, PIPE, and Cytocomputer. Note that, with this type of processor, an increase in image size requires a quadratic increase in processor speed in order to maintain a constant processing speed.Most of the above local-window processors are general purpose in nature in that they are programmable. Although these general-purpose cellular machines are flexible because of their programmability, they do not provide the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree, which is required in many low-level image processing operations such as edge or line detection in a particular direction, and so on. We have been developing the parallel processing system which is considered as being the pipelined SIMD type with a multi-access memory system(MAMS) satisfying to provide the capability of the simultaneous accesses of image points.In this paper, we propose 5-stage pipelined parallel processing system involving MAMS for accessing the data elements within three access types with a constant interval simultaneously. Each processing element(PE) is designed with 2 states for which memory access instructions and general instructions. Although two states are performed in parallel, the cycle of general instructions is depended on one of memory access instructions because memory access instructions have to access data via MAMS.The remainder of this paper is organized as follows. Section 2 introduces multi-access memory system which is redesigned for the proposed system, and the propose d pipelined parallel processing system is described in Section 3. Section 4 presents experimental results yielded by simulations. Finally, we conclude this paper in Section 5 followed by the references.2. Multi-Access Memory SystemFor a parallel processing system with PEs, it is necessory to use an MAMS[2,3] to reduce the memory access time. Also, the memory system has the important goals to provide the efficient utilization for )(pqn=PEs of the pipelined parallel processing system we proposed, where p and q are design parameters. The goals are as follows: various access types and constant interval between the data elements, simultaneous access with no restriction on the location, simple and fast address calculation and routing circuitry, and small number of memory modules.The memory system consists of a memory module selection circuitry, a data routing circuitry for WRITE, an address and a routing circuitry, memory modules, and a data routing circuitry for READ. In order to distribute the data elements of the NM×array (*,*)I among)1(+=pqm memory modules, a memory module assignment function must place in distinct memory modules array elements that are to be accessed simultaneously. Also, an address assignment function must allocate differentaddresses to array elements assigned to the same memory module.The MAMS we redesigned is implemented to the pipelined with three stages because the parallel processing system will be introduced in Section 3 is designed for pipelined architecture. In the case of sequential memory operations, therefore, memory access times are reduced in comparing with that of the original. The block diagram of the multi-access memory system is presented in Figure 1.Access Memory System.3. Pipelined Parallel Processing SystemTo perform high-speed visual applications, we had been developed the variety of parallel processing systems using MAMS. But, the performance of these systems did not give us faithful speed because two modules, which are MAMS module and the modules including processing elements, are asynchronous[4,5]. That is, when a memory instruction followed by a general instruction in the early developed parallel processing system, the general instruction had to wait to be excuted until the previous memory instruction was done completely. Although these systems give efficient fuctionality for accessing data in logical 2-D memory array simultaneously, memory access time in them is longer than one of general memory system, which does not support parallelism, because these system were depended on the time going through MAMS. It was main drawback to grow the processing speed up to the limitation of Amdal’s law. To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system together with MAMS.The pipelined parallel processing system we proposed consists of a processing unit that is made up of a processor module(Motorola MPC860P Processor) for global processing and controling all devices and a PCI Controller (PLX9056) for transferring data to/from host computer; a local memory which stores instructions and common data for its programmability; DMA controller which has set of registers to fetch instructions from the local memory and issue them to n PEs; PE which interprets and executes instructions synchronously; a multi-access memory controller (MAMC) which provides n data elements to n PEs simultaneously; and m external memory modules which store n data elements to be manipulated.The processor unit(PU) controls the system and communicates with host computer via PCI bus. DMA controller fetches an instruction and stores it in a register pool also synchronously transfers data or an instruction to n PEs and controls them. PE was designed as ALU with the functionality that is interpreting an issued instruction. And, to perform an application, DMA controller steels bus cycles until the end of the application.PE can execute two kinds of instructions: memory-reference instructions for accessing m external memory modules via MAMC and 16 general instructions including register-reference instructions and I/O instructions. Therefore, an application to be processed on the system is compiled to operation codes within 18 instructions. And, when each of two instructions is in different set of instructions, they are executed at the same time. That is, one of memory-reference instructions and one of general instructions are executed simultaneously. Hence, a memory-reference instruction followed by a general instruction is executed in a memory access cycle and vise versa. It is reducing processing time in some modules, for example, convolution mask operators which is frequently used in the spatial domain.The system can provide logically two-dimensional addressing way, which is used in (r,c)-based image domain, to system programmers because MAMS gets away sementic gap. For this feature of the system, most of image processing in spatial domain are done with enough parallel processing power. The block diagram of the system is presented in Figure 2.Parallel Processing System.4. ExperimentsTo verify the performance of the proposed system, we chose one of convolution mask operator that is used in image processing methods frequently but it is one of time consuming works.We tried to transform the codes of the operator into adaptable codes to the proposed system in order to achieve the parallel version of this operation. The codes for the operation are presented in Code 1. Notice that two instructions, one is of memory-reference instructions andthe other is of general instructions, are executed simultaneously.Code 1. Read (1,0,0) ValTran AC 0 Read (1,1,0) Add rd_reg Read (1,2,0) Add rd_reg ……NOP Add rd_reg NOP Div rd_reg Write(1,1,1)NopThe dedicated parallel processing system was described by Verilog-HDL and simulated by CADENCE Verilog-XL hardware simulation package in order to verify its functionality. And the system was compiled and fitted into EFP10K200EGC599-1 device by MAXPLUX II. The final simulation was performed after getting delay files and doing back-annotation. The waveform generated during the simulation is illustrated in Figure 3.Figure 3. A Waveform obtained through post-layout simulation.The waveform shows that fetching an instruction, one of memory access instructions, and one of arithmetic logic instructions were overlapped. And, we know that execution time of an application applied to the proposed system was depended on the memory access time. In the simulation results, the proposed system yielded 2.5 times faster speed than the early developed systems[4,5].Unfortunately, we can not present real measuring values because the circuit board manufactured has not yet been verified. The work is going right now and the test environments are depicted in Figure 4.5. ConclusionsThe demands for processing multimedia data in real-time using unified and scalable architecture are ever increasing with the proliferation of multimedia applications. We had been developing the variety of parallel processing systems in order to improve the processing speed of visual media applications. These systems were using MAMS as a parallel memory system, which provides the capability of the simultaneous accesses of image points in a line-segment with an arbitrary degree. But, the performance of these systems did not give a faithful speed because ofasynchronous feature between MAMS and processing elements.Figure 4. The test board : to verify the system.To improve the processing speed of these systems, we have been investigated a pipelined parallel processing system using MAMS. Although the system is considered as being the SIMD type like the early developed systems, the performance of the system yielded 2.5 times faster speed.Although the comparison values previously mentioned were obtained through simulations and speedup performances during each application on the system were achieved, they were just estimated values because the proposed system has not yet been verified on the circuit board manufactured.Unfortunately, some problems are yet occurred in transferring a lot of data elements from the host to the system and vise versa during processing the application on the system. That is, the time for transferring data allocated more than the processing time. To solve this, the bus bandwidth needs to be improved on the system side and new specific methods to the system be developed on the method side.References[1] Alexander C. P. Loui, et. al, “Flexible Architecture forMorphological Image Processing and Analysis,” IEEE Transaction on Circuits and Systems for Video Technology , Vol. 2, no. 1, Mar. 1992[2] J. W. Park, “An efficient memory system for imageprocessing,” IEEE transactions on Computers , vol. C-35, no. 7, pp.33-39, 1986.[3] J. W. Park and D. T. Harper III, “an Efficient MemorySystem for SIMD Construction of a Gaussian Pyramid,” IEEE Transactions on Parallel and Distributed System , Vol. 7, No. 7 July 1996.[4] Hyung Lee, K. A. Moon, J. W. Park, "Design of parallelprocessing system for facial image retrieval", 4th International ACPC’99, Salzburg, Austria, Feb. 1999. [5] Hyung Lee, J. W. Park, “A study on Parallel ProcessingSystem for Automatic Segm entation of Moving Object in Image Sequences,” ITS-CSCC 2000, Vol. 2, pp. 429-432, July 2000.。