图形处理器架构(GPU Architecture)与图形管线(Graphics Pipeline)入门

合集下载

【计算机科学】_gpu_期刊发文热词逐年推荐_20140722

科研热词推荐指数图形处理器 3 采样 1 边缘检测 1 计算统一设备架构 1 行压缩存储 1 自然场景模拟 1 统一计算架构 1 索引 1 算法优化 1 稀疏矩阵 1 积雪-融雪模型 1 深度缓存 1 数据管理 1 排序 1 实时绘制 1 大规模数据场 1 在图形处理单元 1 图形处理器gpu(graphic processing 1 unit) 图像处理 1 医学图像 1 动态场景 1 加速求交 1 切割距离场 1 全局光照 1 光线跟踪 1 光线投射算法 1 优化策略 1 三维切割模拟 1 gpu 1
科研热词推荐指数并行计算 6 gpu 6 cuda 4 异构平台 2 并行 2 图形处理器 2 opencl 2 高性能计算 1 非标记定量 1 通用计算 1 输入感知 1 跨平台 1 计算设备统一构架 1 计算统一设备架构(cuda) 1 计算统一设备架构 1 蛋白质组 1 聚类算法 1 联合迭代重构法 1 网络安全 1 统一计算设备架构 1 组合优化 1 线程优化 1 稀疏矩阵向量乘 1 直方图生成 1 电子断层三维重构 1 现代优化算法 1 混合精度算法 1 棋盘划分 1 格子boltzmann 1 本征问题 1 有限元 1 拉普拉斯算法 1 应用 1 并行效率 1 并行图像处理 1 实时性 1 天河一号 1 多维线性哈希 1 图形处理器(gpu) 1 图形处理单元通用计算(gpgpu) 1 图像拼接 1 图像对象 1 图像处理单元 1 可扩展性 1 内存填充 1 全源最短路径 1 交互型库函数 1 主特征向量计算 1 rrtm 1 ransac 1 quantwiz 1 mrrr 1
2009年序号 1 2 3 4 5 6 7 8 9
科研热词图形处理器通用计算计算统一设备架构粒子系统最少次方svm 方位角支持向量机拉格朗日插值仰角

GPU简介

一、GPGPU的定义与原理GPU英文全称Graphic Processing Unit，中文翻译为“图形处理器”。

GPU计算或GPGPU 就是利用图形处理器（GPU）来进行通用科学与工程计算。

GPU专用于解决可表示为数据并行计算的问题——在许多数据元素上并行执行的程序，具有极高的计算密度（数学运算与存储器运算的比率）。

GPU计算的模式是，在异构协同处理计算模型中将CPU与GPU结合起来加以利用。

应用程序的串行部分在CPU上运行，而计算任务繁重的部分则由GPU来加速。

从用户的角度来看，应用程序只是运行得更快了。

因为应用程序利用了GPU的高性能来提升性能。

在过去几年里，GPU的浮点性能已经上升到Teraflop级的水平。

GPGPU的成功使CUDA 并行编程模型相关的编程工作变得十分轻松。

在这种编程模型中，应用程序开发者可修改他们的应用程序以找出计算量繁重的程序内核，将其映射到GPU上，让GPU来处理它们。

应用程序的剩余部分仍然交由CPU处理。

想要将某些功能映射到GPU上，需要开发者重新编写该功能，在编程中采用并行机制，加入“C”语言关键字以便与GPU之间交换数据。

开发者的任务是同时启动数以万计的线程。

GPU硬件可以管理线程和进行线程调度。

英伟达™ Tesla（NVIDIA® Tesla）20系列GPU基于“Fermi”架构，这是最新的英伟达™ CUDA（NVIDIA® CUDA）架构。

Fermi专为科学应用程序而进行了优化、具备诸多重要特性，其中包括：支持500 gigaflop以上的IEEE标准双精度浮点硬件、一级和二级高速缓存、ECC存储器错误保护、本地用户管理的数据高速缓存（其形式为分布于整个GPU中的共享存储器）以及合并存储器访问等等。

"GPU（图形处理器）已经发展到了颇为成熟的阶段，可轻松执行实际应用程序并且其运行速度已远远超过了使用多核系统时的速度。

未来计算架构将是并行核心GPU与多核CPU串联运行的混合型系统。

一文详解GPU结构及工作原理

一文详解GPU结构及工作原理
GPU全称是GraphicProcessing Unit－－图形处理器，其最大的作用就是进行各种绘制计算机图形所需的运算，包括顶点设置、光影、像素操作等。

GPU实际上是一组图形函数的集合，而这些函数有硬件实现，只要用于3D 游戏中物体移动时的坐标转换及光源处理。

在很久以前，这些工作都是由CPU配合特定软件进行的，后来随着图像的复杂程度越来越高，单纯由CPU 进行这项工作对于CPU的负荷远远超出了CPU的正常性能范围，这个时候就需要一个在图形处理过程中担当重任的角色，GPU也就是从那时起正式诞生了。

从GPU的结构示意图上来看，一块标准的GPU主要包括通用计算单元、控制器和寄存器，从这些模块上来看，是不是跟和CPU的内部结构很像呢？
事实上两者的确在内部结构上有许多类似之处，但是由于GPU具有高并行结构（highly parallel structure），所以GPU在处理图形数据和复杂算法方面拥有比CPU更高的效率。

上图展示了GPU和CPU在结构上的差异，CPU大部分面积为控制器和寄存器，与之相比，GPU拥有更多的ALU（Arithmetic Logic Unit，逻辑运算单元）用于数据处理，而非数据高速缓存和流控制，这。

gpu

Gpu简介及其基本工作原理GPU英文全称Graphic Processing Unit，中文翻译为“图形处理器”。

GPU是相对于CPU 的一个概念，由于在现代的计算机中（特别是家用系统，游戏的发烧友）图形的处理变得越来越重要，需要一个专门的图形的核心处理器。

GPU的作用GPU是显示卡的“大脑”，它决定了该显卡的档次和大部分性能，同时也是2D显示卡和3D显示卡的区别依据。

2D显示芯片在处理3D图像和特效时主要依赖CPU的处理能力，称为“软加速”。

3D显示芯片是将三维图像和特效处理功能集中在显示芯片内，也即所谓的“硬件加速”功能。

显示芯片通常是显示卡上最大的芯片（也是引脚最多的）。

现在市场上的显卡大多采用NVIDIA和AMD-A TI 两家公司的图形处理芯片。

今天，GPU已经不再局限于3D图形处理了，GPU通用计算技术发展已经引起业界不少的关注，事实也证明在浮点运算、并行计算等部分计算方面，GPU可以提供数十倍乃至于上百倍于CPU的性能，如此强悍的“新星”难免会让CPU厂商老大英特尔为未来而紧张，NVIDIA和英特尔也经常为CPU和GPU谁更重要而展开口水战。

GPU通用计算方面的标准目前有OPEN CL、CUDA、A TI STREAM。

其中，OpenCL(全称Open Computing Language，开放运算语言)是第一个面向异构系统通用目的并行编程的开放式、免费标准，也是一个统一的编程环境，便于软件开发人员为高性能计算服务器、桌面计算系统、手持设备编写高效轻便的代码，而且广泛适用于多核心处理器(CPU)、图形处理器(GPU)、Cell类型架构以及数字信号处理器(DSP)等其他并行处理器，在游戏、娱乐、科研、医疗等各种领域都有广阔的发展前景，AMD-A TI、NVIDIA现在的产品都支持OPEN CL。

1985年8月20日A Ti公司成立，同年10月A Ti使用ASIC技术开发出了第一款图形芯片和图形卡，1992年4月A Ti发布了Mach32 图形卡集成了图形加速功能，1998年4月A Ti被IDC评选为图形芯片工业的市场领导者，但那时候这种芯片还没有GPU的称号，很长的一段时间A TI都是把图形处理器称为VPU，直到AMD收购A TI之后其图形芯片才正式采用GPU的名字。

了解电脑图形处理器的不同型号

了解电脑图形处理器的不同型号电脑图形处理器，即Graphics Processing Unit（GPU），是一种用于处理计算机图形和图像的重要组件。

它能够加速图像和视频的处理、呈现复杂的三维图形以及进行高性能计算。

随着科技的发展，市场上出现了各种不同型号的GPU，本文将介绍几种常见的电脑图形处理器型号，帮助读者更好地了解和选择。

一、NVIDIA GeForce系列NVIDIA GeForce系列是目前市场上最为知名和广泛应用的图形处理器之一。

它以出色的性能和可靠性而闻名，适用于各种应用场景，包括游戏、设计和科学计算等。

GeForce系列根据性能和功能的不同，分为不同的型号和系列，如GeForce RTX、GeForce GTX和GeForce MX。

1. GeForce RTX系列GeForce RTX系列是NVIDIA的旗舰级图形处理器，具备强大的实时光线追踪和人工智能计算能力。

它采用了图灵架构和光线追踪技术，能够提供逼真的图形效果和更高的渲染性能。

适用于对图形要求较高的游戏爱好者和专业设计人员。

2. GeForce GTX系列GeForce GTX系列是中高端游戏显卡的代表，性能稳定可靠，适用于大部分游戏和设计应用。

GTX系列拥有出色的处理能力和帧率表现，为玩家提供流畅的游戏体验。

同时，GTX系列还支持VR虚拟现实技术，使用户能够沉浸于虚拟的游戏世界中。

3. GeForce MX系列GeForce MX系列是为轻薄笔记本电脑设计的入门级图形处理器。

它拥有低功耗和高效能的特点，能够提供基本的图形处理能力和满足日常使用需求。

MX系列适合日常办公、浏览网页和轻度图形应用，是性价比较高的选择。

二、AMD Radeon系列AMD Radeon系列是另一家知名的图形处理器制造商。

与NVIDIA GeForce竞争激烈，广受消费者欢迎。

Radeon系列同样提供了丰富的型号和系列，适用于不同的应用场景。

1. Radeon RX系列Radeon RX系列是AMD的高性能图形处理器，主要面向游戏和虚拟现实等应用。

【计算机科学】_信息可视化_期刊发文热词逐年推荐_20140723

科研热词推荐指数颜色增强 1 雷达探测范围 1 隧道灾害风险预警 1 重建 1 计算机仿真 1 融合算法 1 虚拟现实 1 自然语言 1 脑血管 1 线积分卷积 1 类间方差 1 空间得分 1 磁共振造影 1 矢量场可视化 1 灾难恢复 1 标准差 1 数字地形 1 攻防对抗信息 1 感知算法 1 实时评估 1 子算法簇 1 大数据 1 在线图处理 1 可重构 1 可视化 1 动态攻击图 1 分层超图 1 分割 1 几何命题 1 几何作图 1 决策算法 1 修正算法 1 信息网络可视化框架(invf) 1 信息网络 1 信息可视化 1 三维地质建模 1 三维可视化 1 tree-lib 1 mysql cluster 1 infobright 1 gpu加速 1 brighthouse引擎 1
2012年序号 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
科研热词本体重瓣花朵透明化运行支撑平台体系结构跨组织数据语义虚拟环境视图维护自动进化聚合网页抓取网络释义图网络中心化仿真移动神经网络模型热点事件本体映射最终用户数据视图数据服务扩展l系统情感趋势监测情感挖掘微博客可视化及分析变量选择即时构建信息提取信息抽取体系结构描述语言任务共同体三角化 xyz/adl x3d owl bezier曲面
2011年序号 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
2011年科研热词群体交互语义碰撞检测碰撞响应朴素贝叶斯曲面分割层次包围盒作物可视化会议进程保护论坛数据移动信息推荐相关主题模型核无干扰理论数据划分拓扑结构微型博客度层次可视化好友推荐多维信息复杂网络可视化可信终端可信根仿真云计算主题扩展 skyline计算 p2p逻辑网络 mapreduce internet拓扑推荐指数 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

了解电脑图形处理器的不同类型及如何选择

了解电脑图形处理器的不同类型及如何选择电脑图形处理器（Graphics Processing Unit，GPU）是电脑中一个重要的组件，它负责处理图形相关的运算和图像渲染，对于游戏、图形设计和视频编辑等领域的用户来说尤为重要。

然而，市面上存在着各种不同类型的GPU，选择适合自己需求的图形处理器成为了一个重要的问题。

本文将对电脑图形处理器的不同类型进行了解，并探讨如何选择合适的GPU。

一、集成显卡（Integrated Graphics）集成显卡是指将图形处理器集成在主板芯片组或处理器中，常见于低端办公电脑和便携式设备中。

由于集成显卡与处理器共享内存和计算资源，其性能相对较低。

然而，集成显卡价格低廉，功耗低，对于一般日常办公和娱乐使用已经足够。

如果你是一个轻度使用电脑的用户，集成显卡可以满足你的需求。

二、独立显卡（Dedicated Graphics）独立显卡是一种拥有自己独立的显存和处理器的图形处理器，常见于游戏玩家和图像设计师的电脑中。

独立显卡性能较强，可以应对大多数游戏和图形处理软件的要求。

它具有更好的图像渲染效果和更高的帧率，能够提供流畅的游戏体验和高品质的图形设计。

如果你是一个游戏爱好者或者从事图形设计相关工作，独立显卡是你的首选。

三、专业绘图卡（Professional GPUs）专业绘图卡主要应用于专业的图形设计、计算机辅助设计（CAD）和虚拟现实等领域。

它们具有较高的计算性能和图像处理能力，能够处理大规模的复杂计算和三维图形渲染。

对于专业工作者来说，专业绘图卡是非常重要的工具。

然而，专业绘图卡价格昂贵，一般只有专业需求的用户才需要考虑购买。

四、选择适合的GPU选择适合的GPU需要根据自身的需求和预算来进行。

以下是一些选购GPU时需要考虑的因素：1. 预算：首先要明确自己的预算范围，不同价位的GPU性能和功能各异，根据预算来确定选择范围。

2. 使用需求：要考虑自己的使用需求，是进行日常办公、娱乐还是要处理大型游戏或图形设计。

游戏爱好者的福音电脑技术宅的图形处理器选择指南

游戏爱好者的福音电脑技术宅的图形处理器选择指南随着电子游戏行业的发展，越来越多的人成为了游戏爱好者，而那些对电脑技术着迷的宅男宅女们也在不断追求更好的游戏体验。

而图形处理器（Graphics Processing Unit，GPU）作为电脑中负责图像处理的核心组件，对于游戏爱好者来说具有至关重要的作用。

本指南将为电脑技术宅们在选择图形处理器时提供一些建议和指导。

一、什么是图形处理器（GPU）图形处理器是一种专门用于计算机图形处理的芯片，其应用范围涵盖了游戏、动画、视频编辑等领域。

在电脑游戏中，GPU扮演着渲染三维场景、控制帧率和图像质量的重要角色。

因此，在选择GPU时，游戏爱好者应该考虑到自己的需求和预算。

二、不同厂商的GPU选择目前市场上主要有两大GPU厂商，英伟达（NVIDIA）和AMD。

根据自己的情况选择适合的GPU是关键。

1. NVIDIANVIDIA是图形处理器领域的领先厂商之一，其产品性能出色，适用于高负载的游戏和专业应用。

NVIDIA的GPU以其稳定性和出色的兼容性而闻名，被广泛应用于高端游戏电脑和工作站。

然而，NVIDIA 的产品价格相对较高，对于预算有限的玩家来说，需要权衡性能和经济性。

2. AMDAMD是另一个重要的GPU厂商，其产品不仅具备高性能，同时价格更为亲民。

对于预算有限的游戏爱好者来说，AMD的GPU是更好的选择。

AMD的产品也广泛应用于家用电脑和办公场景，并且在某些领域中具有超越NVIDIA的优势。

因此，玩家们可以根据自己的需求和预算选择适合自己的AMD GPU。

三、GPU的性能对游戏体验的影响在选择GPU时，了解其性能参数对于游戏爱好者来说非常重要。

1. 绘制速度绘制速度是指GPU处理图像并将其呈现在屏幕上的能力。

较高的绘制速度可以提供更流畅的画面表现，减少卡顿和画面撕裂现象。

因此，对于追求高帧率的玩家来说，选择绘制速度较快的GPU是明智之举。

2. 显存容量显存容量直接决定了电脑在运行游戏时可以同时存储多少图像数据。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

GPUs-Graphics Processing UnitsMinh Tri Do DinhMinh.Do-Dinh@student.uibk.ac.atVertiefungsseminar Architektur von Prozessoren,SS2008Institute of Computer Science,University of InnsbruckJuly7,2008This paper is meant to provide a closer look at modern Graphics Processing Units.It explorestheir architecture and underlying design principles,using chips from Nvidia’s”Geforce”series asexamples.1IntroductionBefore we dive into the architectural details of some example GPUs,we’ll have a look at some basic concepts of graphics processing and3D graphics,which will make it easier for us to understand the functionality of GPUs1.1What is a GPU?A GPU(G raphics P rocessing U nit)is essentially a dedicated hardware device that is responsible for trans-lating data into a2D image formed by pixels.In this paper,we will focus on the3D graphics,since that is what modern GPUs are mainly designed for.1.2The anatomy of a3D sceneFigure1:A3D scene3D scene:A collection of3D objects and lights.Figure2:Object,triangle and vertices3D objects:Arbitrary objects,whose geometry consists of triangular polygons.Polygons are composed of vertices.Vertex:A Point with spatial coordinates and other information such as color and texture coordinates.Figure3:A cube with a checkerboard textureTexture:An image that is mapped onto the surface of a3D object,which creates the illusion of an object consisting of a certain material.The vertices of an object store the so-called texture coordinates (2-dimensional vectors)that specify how a texture is mapped onto any given surface.Figure4:Texture coordinates of a triangle with a brick textureIn order to translate such a3D scene to a2D image,the data has to go through several stages of a”Graphics Pipeline”1.3The Graphics PipelineFigure5:The3D Graphics PipelineFirst,among some other operations,we have to translate the data that is provided by the application from 3D to2D.1.3.1Geometry StageThis stage is also referred to as the”Transform and Lighting”stage.In order to translate the scene from 3D to2D,all the objects of a scene need to be transformed to various spaces-each with its own coordinate system-before the3D image can be projected onto a2D plane.These transformations are applied on a vertex-to-vertex basis.Mathematical PrinciplesA point in3D space usually has3coordinates,specifying its position.If we keep using3-dimensional vectors for the transformation calculations,we run into the problem that diﬀerent transformations require diﬀerent operations(e.g.:translating a vertex requires addition with a vector while rotating a vertex requires multiplication with a3x3matrix).We circumvent this problem simply by extending the3-dimensional vector by another coordinate(the w-coordinate),thus getting what is called homogeneous coordinates.This way, every transformation can be applied by multiplying the vector with a speciﬁc4x4matrix,making calculations much easier.Figure6:Transformation matrices for translation,rotation and scalingLighting,the other major part of this pipeline stage is calculated using the normal vectors of the surfaces of an object.In combination with the position of the camera and the position of the light source,one can compute the lighting properties of a given vertex.Figure7:Calculating lightingFor transformation,we start out in the model space where each object(model)has its own coordinate system,which facilitates geometric transformations such as translation,rotation and scaling.After that,we move on to the world space,where all objects within the scene have a uniﬁed coordinate system.Figure8:World space coordinatesThe next step is the transformation into view space,which locates a camera in the world space and then transforms the scene,such that the camera is at the origin of the view space,looking straight into the positive z-direction.Now we can deﬁne a view volume,the so-called view frustrum,which will be used to decide what actually is visible and needs to be rendered.Figure9:The camera/eye,the view frustrum and its clipping planesAfter that,the vertices are transformed into clip space and assembled into primitives(triangles or lines), which sets up the so-called clipping process.While objects that are outside of the frustrum don’t need to be rendered and can be discarded,objects that are partially inside the frustrum need to be clipped(hence the name),and new vertices with proper texture and color coordinates need to be created.A perspective divide is then performed,which transforms the frustrum into a cube with normalized coordinates(x and y between-1and1,z between0and1)while the objects inside the frustrum are scaled accordingly.Having this normalized cube facilitates clipping operations and sets up the projection into2D space(the cube simply needs to be”ﬂattened”).Figure10:Transforming into clip spaceFinally,we can move into screen space where x and y coordinates are transformed for proper2D display (in a given window).(Note that the z-coordinate of a vertex is retained for later depth operations)Figure11:From view space to screen spaceNote,that the texture coordinates need to be transformed as well and additionally besides clipping,sur-faces that aren’t visible(e.g.the backside of a cube)are removed as well(so-called back face culling).The result is a2D image of the3D scene,and we can move on to the next stage.1.3.2Rasterization StageNext in the pipeline is the Rasterization stage.The GPU needs to traverse the2D image and convert the data into a number of”pixel-candidates”,so-called fragments,which may later become pixels of the ﬁnal image.A fragment is a data structure that contains attributes such as position,color,depth,texture coordinates,etc.and is generated by checking which part of any given primitive intersects with which pixel of the screen.If a fragment intersects with a primitive,but not any of its vertices,the attributes of that fragment have to be additionally calculated by interpolating the attributes between the vertices.Figure12:Rasterizing a triangle and interpolating its color valuesAfter that,further steps can be made to obtain theﬁnal pixels.Colors are calculated by combining textures with other attributes such as color and lighting or by combining a fragment with either another translucent fragment(so-called alpha blending)or optional fog(another graphical eﬀect).Visibility checks are performed such as:•Scissor test(checking visibility against a rectangular mask)•Stencil test(similar to scissor test,only against arbitrary pixel masks in a buﬀer)•Depth test(comparing the z-coordinate of fragments,discarding those which are further away)•Alpha test(checking visibility against translucent fragments)Additional procedures like anti-aliasing can be applied before we achieve theﬁnal result:a number of pixels that can be written into memory for later display.This concludes our short tour through the graphics pipeline,which hopefully gives us a better idea of what kind of functionality will be required of a GPU.2Evolution of the GPUSome historical key points in the development of the GPU:•Eﬀorts for real time graphics have been made as early as1944(MIT’s project”Whirlwind”)•In the1980s,hardware similar to modern GPUs began to show up in the research community(“Pixel-Planes”,a a parallel system for rasterizing and texture-mapping3D geometry•Graphic chips in the early1980s were very limited in their functionality•In the late1980s and early1990s,high-speed,general-purpose microprocessors became popular for implementing high-end GPUs(e.g.Texas Instruments’TMS340)•1985Theﬁrst mass-market graphics accelerator was included in the Commodore Amiga•1991S3introduced theﬁrst single chip2D-accelerator,the S386C911•1995Nvidia releases one of theﬁrst3D accelerators,the NV1•1999Nvidia’s Geforce256is theﬁrst GPU to implement Transform and Lighting in Hardware •2001Nvidia implements theﬁrst programmable shader units with the Geforce3•2005ATI develops theﬁrst GPU with uniﬁed shader architecture with the ATI Xenos for the XBox 360•2006Nvidia launches theﬁrst uniﬁed shader GPU for the PC with the Geforce88003From Theory to Practice-the Geforce68003.1OverviewModern GPUs closely follow the layout of the graphics pipeline described in theﬁrst ing Nvidia’s Geforce6800as an example we will have a closer look at the architecture of modern day GPUs.Since being founded in1993,the company NVIDIA has become one of the biggest manufacturers of GPUs (besides ATI),having released important chips such as the Geforce256,and the Geforce3.Launched in2004,the Geforce6800belongs to the Geforce6series,Nvidia’s sixth generation of graphics chipsets and the fourth generation that featured programmability(more on that later).The following image shows a schematic view of the Geforce6800and its functional units.Figure13:Schematic view of the Geforce6800You can already see how each of the functional units correspond to the stages of the graphics pipeline. We start with six parallel vertex processors that receive data from the host(the CPU)and perform oper-ations such as transformation and lighting.Next,the output goes into the triangle setup stage which takes care of primitive assembly,culling and clipping,and then into the rasterizer which produces the fragments.The Geforce6800has an additional Z-cull unit which allows to perform an early fragment visibility check based on depth,further improving the eﬃciency.We then move on to the sixteen fragment processors which operate in4parallel units and computes the output colors of each fragment.The fragment crossbar is a linking element that is basically responsible for directing output pixels to any available pixel engine(also called ROP,short for R aster O perator),thus avoiding pipeline stalls.The16pixel engines are theﬁnal stage of processing,and perform operations such as alpha blending, depth tests,etc.,before delivering theﬁnal pixel to the frame buﬀer.3.2In DetailFigure14:A more detailed view of the Geforce6800While most parts of the GPU areﬁxed function units,the vertex and fragment processors of the Geforce 6800oﬀer programmability which wasﬁrst introduced to the geforce chipset line with the geforce3(2001). We’ll have a more detailed look at the units in the following sections.3.2.1Vertex ProcessorFigure15:A vertex processorThe vertex processors are the programmable units responsible for all the vertex transformations and at-tribute calculations.They operate with4-dimensional data vectors corresponding with the aforementioned homogeneous coordinates of a vertex,using32bits per coordinate(hence the128bits of a register).Instruc-tions are123bits long and are stored in the Instruction RAM.The data path of a vertex processor consists of:•A multiply-add unit for4-dimensional vectors•A scalar special function unit•A texture unitInstruction set:Some notable instructions for the vertex processor include:dp4dst,src0,src1Computes the four-component dot product of the source registersexp dst,src Provides exponential2xdst dest,src0,src1Calculates a distance vectornrm dst,src Normalize a3D vectorrsq dst,src Computes the reciprocal square root(positive only)of the sourcescalarRegisters in the vertex processor instructions can be modiﬁed(with few exceptions):•Negate the register value•Take the absolute value of the register•Swizzling(copy any source register component to any temporary register component)•Mask destination register componentsOther technical details:•Vertex processors are MIMD units(Multiple Instruction Multiple Data)•They use VLIW(Very Long Instruction Words)•They operate with32-bitﬂoating point precision•Each vertex processor runs up to3threads to hide latency•Each vertex processor can perform a four-wide MAD(Multiply-Add)and a special function in one cycle3.2.2Fragment ProcessorFigure16:A fragment processorThe Geforce6800has16fragment processors.They are grouped to4bigger units which operate simulta-neously on4fragments each(a so-called quad).They can take position,color,depth,fog as well as other arbitrary4-dimensional attributes as input.The data path consists of:•An Interpolation block for attributes•2vector math(shader)units,each with slightly diﬀerent functionality•A fragment texture unitSuperscalarity:A fragment processor works with4-vectors(vector-oriented instruction set),where sometimes components of the vector need be treated seperately(e.g.color,alpha).Thus,the fragment processor supports co-issueing of the data,which means splitting the vector into2parts and executing diﬀerent operations on them in the same clock.It supports3-1and2-2splitting(2-2co-issue wasn’t possible earlier).Additionally,it also features dual issue,which means executing diﬀerent operations on the2vector math units in the same clock.Texture Unit:The texture unit is aﬂoating-point texture processor which fetches andﬁlters the texture data.It is con-nected to a level1texture cache(which stores parts of the textures that are used).Shader units1and2:Each shader unit is limited in its abilities,oﬀering complete functionality when used together.Figure17:Block diagram of Shader Unit1and2Shader Unit1:Green:A crossbar which distributes the input coming eiter from the rasterizer or from the loopback Red:InterpolatorsYellow:A special function unit(for functions such as Reciprocal,Reciprocal Square Root,etc.)Cyan:MUL channelsOrange:A unit for texture operations(not the fragment texture unit)The shader unit can perform2operations per clock:A MUL on a3-dimensional vector and a special function,a special function and a texture operation,or2 MULs.The output of the special function unit can go into the MUL channels.The texture gets input from the MUL unit and does LOD(Level Of Detail)calculations,before passing the data to the actual fragment texture unit.The fragment texture unit then performs the actual sampling and writes the data into registers for the second shader unit.The shader unit can simply pass data as well.Shader Unit2:Red:A crossbarCyan:4MUL channelsGray:4ADD channelsYellow:1special function unitThe crossbar splits the input onto5channels(4components,1channel stays free).The ADD units are additionally connected,allowing advanced operations such as a dotproduct in one clock. Again,the shader unit can handle2independent operations per cycle or it can simply pass data.If no special function is used,the MAD unit can perform up to2operations from this list:MUL,ADD,MAD,DP,or any other instruction based on these operations.Instruction set:Some notable instructions for the vertex processor include:cmp dst,src0,src1,src2Choose src1if src0>=0.Otherwise,choose src2.The comparisonis done per channeldsx dst,src Compute the rate of change in the render target’s x-directiondsy dst,src Compute the rate of change in the render target’s y-direction sincos dst.{x|y|xy},src0.{x|y|z|w}Computes sine and cosine,in radianstexld dst,src0,src1Sample a texture at a particular sampler,using provided texturecoordinatesRegisters in the fragment processor instructions can be modiﬁed(with few exceptions):•Negate the register value•Take the absolute value of the register•Mask destination register componentsOther technical details:•The fragment processors can perform operations within16or32ﬂoating point precision(e.g.the fog unit uses only16bit precision for its calculations since that is suﬃcient)•The quads operate as SIMD units•They use VLIW•They run up to100s of threads to hide texture fetch latency(˜256per quad)•A fragment processor can perform up to8operations per cycle/4math operations if there’s a texture fetch in shader1Figure18:Possible operations per cycle•The fragment processors have a2level texture cache•The fog unit can perform fog blending on theﬁnal pass without performance penalty.It is implemented withﬁxed point precision since that’s suﬃcient for fog and saves performance.The equation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)•There’s support for multiple render targets,the pixel processor can output to up to four seperate buﬀers(4x4values,color+depth)3.2.3Pixel EngineFigure19:A pixel engineLast in the pipeline are the16pixel engines(raster operators).Each pixel engine connects to a speciﬁc memory partition of the GPU.After the lossless color and depth compression,the depth and color units perform depth,color and stencil operations before writing theﬁnal pixel.When activated the pixel engines also perform multisample antialiasing.3.2.4MemoryFrom“GPU Gems2,Chapter30:The GeForce6Series GPU Architecture”:“The memory system is partitioned into up to four independent memory partitions,eachwith its own dynamic random-access memories(DRAMs).GPUs use standard DRAM modulesrather than custom RAM technologies to take advantage of market economies and thereby reducecost.Having smaller,independent memory partitions allows the memory subsystem to operateeﬃciently regardless of whether large or small blocks of data are transferred.All rendered surfacesare stored in the DRAMs,while textures and input data can be stored in the DRAMs or insystem memory.The four independent memory partitions give the GPU a wide(256bits),ﬂexible memory subsystem,allowing for streaming of relatively small(32-byte)memory accessesat near the35GB/sec physical limit.”3.3Performance•425MHz internal graphics clock•550MHz memory clock•256-MB memory size•35.2GByte/second memory bandwidth•600million vertices/second•6.4billion texels/second•12.8billion pixels/second,rendering z/stencil-only(useful for shadow volumes and shadow buﬀers)•6four-wide fp32vector MADs per clock cycle in the vertex shader,plus one scalar multifunction operation(a complex math operation,such as a sine or reciprocal square root)•16four-wide fp32vector MADs per clock cycle in the fragment processor,plus16four-wide fp32 multiplies per clock cycle•64pixels per clock cycle early z-cull(reject rate)•120+Gﬂops peak(equal to six5-GHz Pentium4processors)•Up to120W energy consumption(the card has two additional power connectors,the power sources are recommended to be no less than480W)4Computational PrinciplesStream Processing:Typical CPUs(the von Neumann architecture)suﬀer from memory bottlenecks when processing.GPUs are very sensitive to such bottlenecks,and therefore need a diﬀerent architecture,they are essentially special purpose stream processors.A stream processor is a processor that works with so-called streams and kernels.A stream is a set of data and a kernel is a small program.In stream processors,every kernel takes one or more streams as input and outputs one or more streams,while it executes its operations on every single element of the input streams. In stream processors you can achieve several levels of parallelism:•Instruction level parallelism:kernels perform hundreds of instructions on every stream element,you achieve parallelism by performing independent instructions in parallel•Data level parallelism:kernels perform the same instructions on each stream element,you achieve parallelism by performing one instruction on many stream elements at a time•Task level parallelism:Have multiple stream processors divide the work from one kernelStream processors do not use caching the same way traditional processors do since the input datasets are usually much larger than most caches and the data is barely reused-with GPUs for example the data is usually rendered and then discarded.We know GPUs have to work with large amounts of data,the computations are simpler but they need to be fast and parallel,so it becomes clear that the stream processor architecture is very well suited for GPUs. Continuing these ideas,GPUs employ following strategies to increase output:Pipelining:Pipelining describes the idea of breaking down a job into multiple components that each perform a single task.GPUs are pipelined,which means that instead of performing complete processing of a pixel before moving on to the next,youﬁll the pipeline like an assembly line where each component performs a task on the data before passing it to the next stage.So while processing a pixel may take multiple clock cycles,you still achieve an output of one pixel per clock since youﬁll up the whole pipe.Parallelism:Due to the nature of the data-parallelism can be applied on a per-vertex or per-pixel basis -and the type of processing(highly repetitive)GPUs are very suitable for parallelism,you could have an unlimited amount of pipelines next to each other,as long as the CPU is able to keep them busy.Other GPU characteristics:•GPUs can aﬀord large amounts ofﬂoating point computational power since they have lower control overhead•They use dedicated functional units for specialized tasks to increase speeds•GPU memory struggles with bandwidth limitations,and therefore aims for maximum bandwidth usage, employing strategies like data compression,multiple threads to cope with latency,scheduling of DRAM cycles to minimize idle data-bus time,etc.•Caches are designed to support eﬀective streaming with local reuse of data,rather than implementinga cache that achieves99%hit rates(which isn’t feasible),GPU cache designs assume a90%hit ratewith many misses inﬂight•GPUs have many diﬀerent performance regimes all with diﬀerent characteristics and need to be de-signed accordingly4.1The Geforce6800as a general processorYou can see the Geforce6800as a general processor with a lot ofﬂoating-point horsepower and high memory bandwidth that can be used for other applications as well.Figure20:A general view of the Geforce6800architectureLooking at the GPU that way,we get:•2serially running programmable blocks with fp32capability.•The Rasterizer can be seen as a unit that expands the data into interpolated values(from one data-”point”to multiple”fragments”).•With MRT(Multiple Render Targets),the fragment processor can output up to16scalarﬂoating-point values at a time.•Several possibilities to control the dataﬂow by using the visibility checks of the pixel engines or the Z-cull unit5The next step:the Geforce8800After the Geforce7series which was a continuation of the Geforce6800architecture,Nvidia introduced the Geforce8800in2006.Driven by the desire to increase performance,improve image quality and facilitate programming,the Geforce8800presented a signiﬁcant evolution of past designs:a uniﬁed shader architec-ture(Note,that ATI already used this architecture in2005with the XBOX360GPU).Figure21:From dedicated to uniﬁed architectureFigure22:A schematic view of the Geforce8800The uniﬁed shader architecture of the Geforce8800essentially boils down to the fact that all the diﬀerent shader stages become one single stage that can handle all the diﬀerent shaders.As you can see in Figure22,instead of diﬀerent dedicated units we now have a single streaming processor array.We have familiar units such as the raster operators(blue,at the bottom)and the triangle setup, rasterization and z-cull unit.Besides these units we now have several managing units that prepare and manage the data as itﬂows in the loop(vertex,geometry and pixel thread issue,input assembler and thread processor).Figure23:The streaming processor arrayThe streaming processor array consists of8texture processor clusters.Each texture processor cluster in turn consists of2streaming multiprocessors and1texture pipe.A streaming multiprocessor has8streaming processors and2special function units.The streaming pro-cessors work with32-bit scalar data,based on the idea that shader programs are becoming more and more scalar,making a vector architecture more ineﬃcient.They are driven by a high-speed clock that is seperate from the core clock and can perform a dual-issued MUL and MAD at each cycle.Each multiprocessor can have768hardware scheduled threads,grouped together to24SIMD”warps”(A warp is a group of threads). The texture pipe consists of4texture addressing and8textureﬁltering units.It performs texture prefetching andﬁltering without consuming ALU resources,further increasing eﬃcency.It is apparent that we gain a number of advanteges with such a new architecture.For example,the old problem of constantly changing workload and one shader stage becoming a processing bottleneck is solved since the units can adapt dynamically,now that they are uniﬁed.Figure24:Workload balancing with both architecturesWith a single instruction set and the support of fp32throughout the whole pipeline,as well as the support of new data types(integer calculations),programming the GPU now becomes easier as well.6General Purpose Programming on the GPU-an exampleWe use the bitonic merge sort algorithm as an example for eﬃciently implementing algorithms on a GPU. Bitonic merge sort:Bitonic merge sort works by repeatedly building bitonic lists out of a set of elements and sorting them.A bitonic list is a concatenation of two monotonic lists,one ascending and one descending.E.g.:List A=(3,4,7,8)List B=(6,5,2,1)List AB=(3,4,7,8,6,5,2,1)is a bitonic listList BA=(6,5,2,1,3,4,7,8)is also a bitonic listBitonic lists have a special property that is used in bitonic mergesort:Suppose you have a bitonic list of length2n.You can rearrange the list so that you get two halves with n elements where each element(i)of theﬁrst half is less than or equal to each corresponding element(i+n)in the second half(or greater than or equal,if the list descendsﬁrst and then ascends)and the new list is again a bitonic list.This happens by com-paring the corresponding elements and switching them if necessary.This procedure is called a bitonic merge. Bitonic merge sort works by recursively creating and merging bitonic lists that increase in their size until we reach the maximum size and the complete list is sorted.Figure25illustrates the process:Figure25:The diﬀerent stages of the algorithmThe sorting process has a certain number of stages where comparison passes are performed.Each new stage increases the number of passes by one.This results in bitonic mergesort having a complexity of O(n log2(n)+log(n))which is worse than quicksort,but the algorithm has no worst-case scenario(where quicksort reaches O(n2).The algorithm is very well suited for a GPU.Many of the operations can be performed in parallel and the length stays constant,given a certain number of elements.Now when implementing this algorithm on the GPU,we want to make use of as many resources as possible(both in parallel as well as vertically alongthe pipeline),especially considering that the GPU has shortcomings as well,such as the lack of support for binary or integer operations.For example,simply letting the fragment processor stage handle all the calculations might work,but leaving all the other units unused is a waste of precious resources.A possible solution looks like this:In this algorithm,we have groups of elements(fragments)that have the same sorting conditions,while groups next to each other operate in opposite.Now if we draw a vertex quad over two adjacent groups and set appropriateﬂags at each corner,we can easily determine group membership by using the rasterizer.For example,if we set the left corners to-1and the right corners to+1,we can check where a fragment belongs to by simply looking at its sign(the interpolation process takes care of that).Figure26:Using vertexﬂags and the interpolator to determine compare operationsNext,we need to determine which compare operation to use and we need to locate the partner item to compare.Both can again be accomplished by using theﬂag value.Setting the compare operation to less-than and multiplying with theﬂag value implicitlyﬂips the operation to greater-equal halfway across the quad. Locating the partner item happens by multiplying the sign of theﬂag with an absolute value that speciﬁes the distance between the items.In order to sort elements of an array,we store them in a2D texture.Each row is a set of elements and becomes its own bitonic sequence that needs to be sorted.If we extend the quads over the rows of the2D texture and use the interpolation,we can modulate the comparison so the rows get sorted either up or down according to their row number.This way,pairs of rows become bitonic sequences again which can besorted in the same way we sorted the columns of the single rows,simply by transposing the quads.As aﬁnal optimization we reduce texture fetching by packing two neighbouring key pairs into one frag-ment,since the shader operates on4-vectors.Performance comparison:std:sort:16-Bit Data, Pentium43.0GHz Bitonic Merge Sort:16-Bit Float Data, NVIDIA Geforce6800UltraN FullSorts/Sec SortedKeys/SecN Passes FullSorts/SecSortedKey/Sec256282.5 5.4M256212090.07 6.1M 512220.6 5.4M512215318.3 4.8M 10242 4.7 5.0M10242190 3.6 3.8M。