《计算机体系结构 量化研究方法》 第五版 习题答案

合集下载

计算机体系结构量化研究方法 中文版

计算机体系结构量化研究方法 中文版

2021弱电工程师真题模拟及答案(2)1、以下关于UPS工频机和高频机的说法错误的是()(多选题)A. 高频机的逆变拓扑方式可以采用半桥架构也可采用全桥架构B. 高频机母线电压比工频机高,所以所需配置的蓄电池节数一定多于工频机C. 工频机内部一定有输出隔离变压器D. UPS工频机的功率器件的开关频率为工频50HzE. 工频机的输入功率因数一定比高频低试题答案:A,B,D,E2、船舶雾航中使用雷达助航时,应注意()。

(单选题)A. 雷达量程档应远、近交替使用B. 保持对雷达进行仔细、连续的观测C. 对雷达回波应能准确识别D. 以上都对试题答案:D3、CAD中画多段线的命令是()(单选题)A. MB. MLC. PLD. L试题答案:C4、下面哪个不能加强无线网络的安全()(单选题)A. 数据加密B. 定向传输C. 更改SSIDD. 更改默认用户名与密码试题答案:B5、根据能见距离大小,将能见度分为十个等级,能见度恶劣其能见距离规定为()(单选题)A. 小于0.5kmB. 小于0.05kmC. 小于1kmD. 小于2km试题答案:A6、为了消除相邻的地感线圈间的串扰,要保证线圈之间的最小距离为()米。

(单选题)A. 1B. 2C. 3D. 4试题答案:B7、下列说法哪个不正确?()(单选题)A. 航道弯曲半径越大越好B. 航道弯曲半径越小越好C. 航道弯曲中心角越大越好D. 航道弯曲系数越小越好试题答案:B8、为了解决()的问题,采用了双技术探测器。

(单选题)A. 误报B. 漏报C. 干扰D. 辐射试题答案:A9、综合布线系统划分成子系统()(单选题)A. 4个子系统B. 5个子系统C. 6个子系统D. 4个子系统试题答案:C10、造成极板弯曲,主要原因有以下哪几个方面()(多选题)A. 蓄电池中含有杂质,在引起局部作用时,仅有小部分活性物质变成硫酸铅,致使整个极板的活性物质体积变化不一致,造成弯曲B. 极板活性物质在制造过程中因形成或涂膏分布不均匀,因此,在充放电时极板各部分所起的电化作用强弱不均匀,致使极板上活性物质体积的膨胀和收缩不一致而引起弯曲,有的造成开裂C. 过量充电或过量放电,增加了内层活性物质的膨胀和收缩,恢复过程不一致,造成极板的弯曲D. 大电流放电或高温放电时,极板活性物质反应较激烈,容易造成化学反应不均匀而引起极板弯曲试题答案:A,B,C,D11、楼宇自动化系统的功能有()(多选题)A. 监控功能B. 环保功能C. 管理功能D. 服务功能试题答案:A,C,D12、河流中某河段水位站设置多少是根据河段中()大小确定的(单选题)A. 流速B. 比降C. 流量D. 水位试题答案:B13、下面哪个设备可以做为无线AP。

计算机网络第五版答案完整版

计算机网络第五版答案完整版

《计算机网络》课后习题答案第一章概述1-1 计算机网络向用户可以提供哪些服务?答:计算机网络向用户提供的最重要的功能有两个,连通性和共享。

1-2 试简述分组交换的特点答:分组交换实质上是在“存储——转发”基础上发展起来的。

它兼有电路交换和报文交换的优点。

分组交换在线路上采用动态复用技术传送按一定长度分割为许多小段的数据——分组。

每个分组标识后,在一条物理线路上采用动态复用的技术,同时传送多个数据分组。

把来自用户发端的数据暂存在交换机的存储器内,接着在网内转发。

到达接收端,再去掉分组头将各数据字段按顺序重新装配成完整的报文。

分组交换比电路交换的电路利用率高,比报文交换的传输时延小,交互性好。

1-3 试从多个方面比较电路交换、报文交换和分组交换的主要优缺点。

答:(1)电路交换电路交换就是计算机终端之间通信时,一方发起呼叫,独占一条物理线路。

当交换机完成接续,对方收到发起端的信号,双方即可进行通信。

在整个通信过程中双方一直占用该电路。

它的特点是实时性强,时延小,交换设备成本较低。

但同时也带来线路利用率低,电路接续时间长,通信效率低,不同类型终端用户之间不能通信等缺点。

电路交换比较适用于信息量大、长报文,经常使用的固定用户之间的通信。

(2)报文交换将用户的报文存储在交换机的存储器中。

当所需要的输出电路空闲时,再将该报文发向接收交换机或终端,它以“存储——转发”方式在网内传输数据。

报文交换的优点是中继电路利用率高,可以多个用户同时在一条线路上传送,可实现不同速率、不同规程的终端间互通。

但它的缺点也是显而易见的。

以报文为单位进行存储转发,网络传输时延大,且占用大量的交换机内存和外存,不能满足对实时性要求高的用户。

报文交换适用于传输的报文较短、实时性要求较低的网络用户之间的通信,如公用电报网。

(3)分组交换分组交换实质上是在“存储——转发”基础上发展起来的。

它兼有电路交换和报文交换的优点。

分组交换在线路上采用动态复用技术传送按一定长度分割为许多小段的数据——分组。

计算机网络第五版课后习题答案

计算机网络第五版课后习题答案

计算机网络第五版答案第一章概述1-01 计算机网络向用户可以提供那些服务?答:连通性和共享1-02 简述分组交换的要点。

答:(1)报文分组,加首部(2)经路由器储存转发(3)在目的地合并1-03 试从多个方面比较电路交换、报文交换和分组交换的主要优缺点。

答:(1)电路交换:端对端通信质量因约定了通信资源获得可靠保障,对连续传送大量数据效率高。

(2)报文交换:无须预约传输带宽,动态逐段利用传输带宽对突发式数据通信效率高,通信迅速。

(3)分组交换:具有报文交换之高效、迅速的要点,且各分组小,路由灵活,网络生存性能好。

1-04 为什么说因特网是自印刷术以来人类通信方面最大的变革?答:融合其他通信网络,在信息化过程中起核心作用,提供最好的连通性和信息共享,第一次提供了各种媒体形式的实时交互能力。

1-05 因特网的发展大致分为哪几个阶段?请指出这几个阶段的主要特点。

答:从单个网络APPANET向互联网发展;TCP/IP协议的初步成型建成三级结构的Internet;分为主干网、地区网和校园网;形成多层次ISP结构的Internet;ISP首次出现。

1-06 简述因特网标准制定的几个阶段?答:(1)因特网草案(Internet Draft) ——在这个阶段还不是 RFC 文档。

(2)建议标准(Proposed Standard) ——从这个阶段开始就成为 RFC 文档。

(3)草案标准(Draft Standard)(4)因特网标准(Internet Standard)1-07小写和大写开头的英文名字 internet 和Internet在意思上有何重要区别?答:(1) internet(互联网或互连网):通用名词,它泛指由多个计算机网络互连而成的网络。

;协议无特指(2)Internet(因特网):专用名词,特指采用 TCP/IP 协议的互联网络区别:后者实际上是前者的双向应用1-08 计算机网络都有哪些类别?各种类别的网络都有哪些特点?答:按范围:(1)广域网WAN:远程、高速、是Internet的核心网。

计算机体系结构 量化研究方法

计算机体系结构 量化研究方法
在本书中,量化研究方法贯穿始终,包括数据收集、模型建立、数据分析等多个环节。作者通过 这种方法对计算机体系结构的各个方面进行了深入的研究和分析,从而为读者提供了更为具体和 细致的学习资料。
从目录来看,本书的内容共分为四个部分。第一部分“引言”介绍了计算机体系结构的基本概念 和量化研究方法的重要性。第二部分“量化研究方法”详细阐述了量化研究方法的各个环节,包 括数据收集、模拟、性能评估等。第三部分“计算机体系结构要素”则对计算机体系结构的各个 要素进行了分析,包括处理器、内存、I/O系统等。最后一部分“优化计算机体系结构”介绍了 如何运用量化研究方法来优化计算机体系结构,提高系统性能。
在这本书中,作者们不仅介绍了计算机体系结构的基本知识,还深入探讨了并行计算、流水线技 术、超标量技术等前沿领域。同时,书中还提供了大量的案例和实际应用场景,帮助读者更好地 理解和应用这些理论知识。
阅读感受
作者们在书中还提出了一些具有挑战性的问题,引导读者进一步思考和研究。 在阅读这本书的过程中,我不禁回想起自己在学习计算机组成原理时遇到的困扰。虽然那本书详 细介绍了计算机的各个硬件组成部分,以及它们之间的关系和连接方式,但对于如何配置和处理 器的各个寄存器,却没有提供一套成型的理论。而《计算机体系结构:量化研究方法》则填补了 这一空白,它为我们提供了如何根据应用场景去合理地规划各个功能模块的特性的方法。 《计算机体系结构:量化研究方法》是一本令人叹为观止的佳作,它让我重新审视计算机体系结 构这一领域。这本书的深度和广度,以及作者们的专业知识和见解,都为我们提供了宝贵的学习 和研究资源。我相信这本书不仅适合计算机专业的学生和研究者阅读,对于广大计算机爱好者来 说,也是一本值得收藏的经典之作。
阅读感受
阅读感受
《计算机体系结构:量化研究方法》是一本我读过的极具启发性的计算机科学书籍。这本书以其 系统、深入的视角,向我们展示了计算机体系结构的各个方面,包括设计基础、存储器层次结构 设计、指令级并行及其开发、数据级并行、GPU体系结构、线程级并行和仓库级计算机等。通过 阅读这本书,我对计算机体系结构有了更深入的理解,也掌握了一些实用的量化研究方法。

计算机网络系统方法(第5版)课后答案 中文

计算机网络系统方法(第5版)课后答案 中文

第一章概述1-01 计算机网络向用户可以提供那些服务?答:连通性和共享1-02 简述分组交换的要点。

答:(1)报文分组,加首部(2)经路由器储存转发(3)在目的地合并1-03 试从多个方面比较电路交换、报文交换和分组交换的主要优缺点。

答:(1)电路交换:端对端通信质量因约定了通信资源获得可靠保障,对连续传送大量数据效率高。

(2)报文交换:无须预约传输带宽,动态逐段利用传输带宽对突发式数据通信效率高,通信迅速。

(3)分组交换:具有报文交换之高效、迅速的要点,且各分组小,路由灵活,网络生存性能好。

1-04 为什么说因特网是自印刷术以来人类通信方面最大的变革?答:融合其他通信网络,在信息化过程中起核心作用,提供最好的连通性和信息共享,第一次提供了各种媒体形式的实时交互能力。

1-05 因特网的发展大致分为哪几个阶段?请指出这几个阶段的主要特点。

答:从单个网络APPANET向互联网发展;TCP/IP协议的初步成型建成三级结构的Internet;分为主干网、地区网和校园网;形成多层次ISP结构的Internet;ISP首次出现。

1-06 简述因特网标准制定的几个阶段?答:(1)因特网草案(Internet Draft) ——在这个阶段还不是 RFC 文档。

(2)建议标准(Proposed Standard) ——从这个阶段开始就成为 RFC 文档。

(3)草案标准(Draft Standard)(4)因特网标准(Internet Standard)1-07小写和大写开头的英文名字 internet 和Internet在意思上有何重要区别?答:(1) internet(互联网或互连网):通用名词,它泛指由多个计算机网络互连而成的网络。

;协议无特指(2)Internet(因特网):专用名词,特指采用 TCP/IP 协议的互联网络区别:后者实际上是前者的双向应用1-08 计算机网络都有哪些类别?各种类别的网络都有哪些特点?答:按围:(1)广域网WAN:远程、高速、是Internet的核心网。

《计算机网络》第五版课后习题答案完整版(包含十章)

《计算机网络》第五版课后习题答案完整版(包含十章)

《计算机网络》第五版课后习题答案完整版(包含十章)《计算机网络》课后习题答案第一章概述1-1计算机网络向用户可以提供哪些服务?答:计算机网络向用户提供的最重要的功能有两个,连通性和共享。

1-2试简述分组交换的特点答:分组交换实质上是在“存储——转发”基础上发展起来的。

它兼有电路交换和报文交换的优点。

分组交换在线路上采用动态复用技术传送按一定长度分割为许多小段的数据——分组。

每个分组标识后,在一条物理线路上采用动态复用的技术,同时传送多个数据分组。

把来自用户发端的数据暂存在交换机的存储器内,接着在网内转发。

到达接收端,再去掉分组头将各数据字段按顺序重新装配成完整的报文。

分组交换比电路交换的电路利用率高,比报文交换的传输时延小,交互性好。

1-3试从多个方面比较电路交换、报文交换和分组交换的主要优缺点。

答:(1)电路交换电路交换就是计算机终端之间通信时,一方发起呼叫,独占一条物理线路。

当交换机完成接续,对方收到发起端的信号,双方即可进行通信。

在整个通信过程中双方一直占用该电路。

它的特点是实时性强,时延小,交换设备成本较低。

但同时也带来线路利用率低,电路接续时间长,通信效率低,不同类型终端用户之间不能通信等缺点。

电路交换比较适用于信息量大、长报文,经常使用的固定用户之间的通信。

(2)报文交换将用户的报文存储在交换机的存储器中。

当所需要的输出电路空闲时,再将该报文发向接收交换机或终端,它以“存储——转发”方式在网内传输数据。

报文交换的优点是中继电路利用率高,可以多个用户同时在一条线路上传送,可实现不同速率、不同规程的终端间互通。

但它的缺点也是显而易见的。

以报文为单位进行存储转发,网络传输时延大,且占用大量的交换机内存和外存,不能满足对实时性要求高的用户。

报文交换适用于传输的报文较短、实时性要求较低的网络用户之间的通信,如公用电报网。

(3)分组交换分组交换实质上是在“存储——转发”基础上发展起来的。

它兼有电路交换和报文交换的优点。

计算机网络第五版课后答案完整版

计算机网络第五版课后答案完整版

第一章概述1-2 试简述分组交换的特点答:分组交换实质上是在“存储——转发”基础上发展起来的。

它兼有电路交换和报文交换的优点。

分组交换在线路上采用动态复用技术传送按一定长度分割为许多小段的数据——分组。

每个分组标识后,在一条物理线路上采用动态复用的技术,同时传送多个数据分组。

把来自用户发端的数据暂存在交换机的存储器内,接着在网内转发。

到达接收端,再去掉分组头将各数据字段按顺序重新装配成完整的报文。

分组交换比电路交换的电路利用率高,比报文交换的传输时延小,交互性好。

1-5 因特网的发展大致分为哪几个阶段?请指出这几个阶段最主要的特点。

答:第一阶段是从单个网络ARPANRET 向互联网发展的过程。

最初的分组交换网ARPANET 只是一个单个的分组交换网,所有要连接在ARPANET 上的主机都直接与就近的结点交换机相连。

而后发展为所有使用TCP/IP 协议的计算机都能利用互联网相互通信。

第二阶段是1985-1993 年,特点是建成了三级结构的因特网第三阶段是1993 年至今,特点是逐渐形成了多层次ISP 结构的因特网。

1-24 试述五层协议的网络体系结构的要点,包括各层的主要功能。

答:所谓五层协议的网络体系结构是为便于学习计算机网络原理而采用的综合了OSI 七层模型和TCP/IP 的四层模型而得到的五层模型。

五层协议的体系结构见图1-1 所示。

应用层运输层网络层数据链路层物理层各层的主要功能:(1)应用层应用层确定进程之间通信的性质以满足用户的需要。

应用层不仅要提供应用进程所需要的信息交换和远地操作,而且还要作为互相作用的应用进程的用户代理(user agent),来完成一些为进行语义上有意义的信息交换所必须的功能。

(2)运输层任务是负责主机中两个进程间的通信。

因特网的运输层可使用两种不同的协议。

即面向连接的传输控制协议TCP 和无连接的用户数据报协议UDP。

面向连接的服务能够提供可靠的交付。

无连接服务则不能提供可靠的交付。

大学计算机第5版习题参考答案

大学计算机第5版习题参考答案

《大学计算机》习题解答(2017.3)说明:1、部分思考题并无标准答案,需要学生在教材、校园网、因特网中查找相关资料;2、思考题能自圆其说者为“中”;言之有理者为“良”;举例说明者为“优”;3、思考题、简答题均以短小要点形式答题,不论有多少要点,答对3个均视为全部正确;各章习题参考答案第1章计算与计算思维1-1 简要说明计算机发展的三个历史阶段。

答:(1)古代计算机工具(2)中世纪计算机(3)现代计算机1-2 简要说明九九乘法口诀算法有哪些优点。

答:(1)建立了一套完整的算法规则;(2)具有临时存储功能,能连续运算;(3)出现了五进制;(4)制作简单,携带方便。

1-3 简要说明计算机集群系统有哪些特点。

答:(1)将多台计算机通过网络组成一个机群;(2)以单一系统模式管理;(3)并行计算;(4)提供高性能不停机服务;(5)系统计算能力非常高;(6)具有很好的容错功能。

1-4 简要说明各种类型计算机的主要特点。

答:(1)大型机计算机性能高。

(2)微机有海量应用软件,优秀的兼容能力,低价高性能。

(3)嵌入式计算机要求可靠性好。

1-5 简要说明图灵机的重要意义。

答:(1)图灵机证明了通用计算理论;(2)图灵机引入了读写、算法、程序、人工智能等概念;(3)复杂的理论问题可以转化为图灵机进行分析。

(4)图灵机可以分析什么是可计算的,什么是不可计算的。

1-6 简要说明冯诺依曼“存储程序”思想的重要性。

答:(1)为程序控制计算机提供了理论基础;(2)程序和数据的统一;(3)实现了程序控制计算机;(4)提高运算效率;(5)为程序员职业化提供了理论基础。

1-7 简要说明什么是计算思维。

答:周以真认为:计算思维是运用计算机科学的基础概念去求解问题、设计系统和理解人类行为,它涵盖了计算机科学的一系列思维活动。

1-8 举例说明计算思维的应用案例。

答:(1)复杂性分析:如战争分析、经济分析、算法分析等。

(2)抽象:如数据类型、数学公式等。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Chapter 1 Solutions2 Chapter 2 Solutions6 Chapter 3 Solutions13 Chapter 4 Solutions33 Chapter 5 Solutions44 Chapter 6 Solutions50 Appendix A Solutions63 Appendix B Solutions83 Appendix C Solutions92Solutions to Case Studies and Exercises2■Solutions to Case Studies and ExercisesCase Study 1: Chip Fabrication Cost1.1 a.b.It is fabricated in a larger technology, which is an older plant. As plants age,their process gets tuned, and the defect rate decreases.1.2 a.Profit = 416 × 0.65 × $20 = $5408b.Profit = 240 × 0.50 × $25 = $3000c.The Woods chipd.Woods chips: 50,000/416 = 120.2 wafers neededMarkon chips: 25,000/240 = 104.2 wafers neededTherefore, the most lucrative split is 120 Woods wafers, 30 Markon wafers.1.3 a.No defects = 0.282 = 0.08One defect = 0.28 × 0.72 × 2 = 0.40No more than one defect = 0.08 + 0.40 = 0.48b.$20 × 0.28 = Wafer size/old dpw= $23.33 Chapter 1 SolutionsYield 10.30 3.89×4.0--------------------------+⎝⎠⎛⎞4–0.36==Dies per wafer π302⁄()×21.5-----------------------------=π30×sqrt 2 1.5×()------------------------------–47154.4416=–=Yield 10.30 1.5×4.0-----------------------+⎝⎠⎛⎞4–0.65==Dies per wafer π302⁄()×22.5-----------------------------=π30×sqrt 2 2.5×()------------------------------–28342.1240=–=Yield 10.30 2.5× 4.0-------------------------+⎝⎠⎛⎞4–0.50==Defect – Free single core 10.75 1.992⁄×4.0---------------------------------+⎝⎠⎛⎞4–0.28==$20Wafer size old dpw 0.28×-----------------------------------=x Wafer size 1/2old dpw ×0.48×-------------------------------------------------$200.28×1/20.48×-------------------------==Chapter 1 Solutions ■3Case Study 2: Power Consumption in Computer Systems1.4 a..80x = 66 + 2 ×2.3 + 7.9; x = 99b..6 × 4 W + .4 × 7.9 = 5.56c.Solve the following four equations:seek7200 = .75 × seek5400seek7200 + idle7200 = 100seek5400 + idle5400 = 100seek7200 × 7.9 + idle7200 × 4 = seek5400 × 7 + idle5400 × 2.9idle7200 = 29.8%1.5 a.b.c.200 W × 11 = 2200 W2200/(76.2) = 28 racksOnly 1 cooling door is required.1.6 a.The IBM x346 could take less space, which would save money in real estate.The racks might be better laid out. It could also be much cheaper. In addition,if we were running applications that did not match the characteristics of thesebenchmarks, the IBM x346 might be faster. Finally, there are no reliabilitynumbers shown. Although we do not know that the IBM x346 is better in anyof these areas, we do not know it is worse, either.1.7a.(1 – 8) + .8/2 = .2 + .4 = .6b.c. ; x = 50%d.Exercises1.8 a.(1.35)10 = approximately 20b.3200 × (1.4)12 = approximately 181,420c.3200 × (1.01)12 = approximately 3605d.Power density, which is the power consumed over the increasingly small area,has created too much heat for heat sinks to dissipate. This has limited theactivity of the transistors on the chip. Instead of increasing the clock rate,manufacturers are placing multiple cores on the chip.14 KW 66 W 2.3 W 7.9 W ++()-----------------------------------------------------------183=14 KW 66 W 2.3 W 2++7.9 W ×()---------------------------------------------------------------------166=Power new Power old --------------------------V 0.60×()2F 0.60×()×V 2F×----------------------------------------------------------0.630.216===1.751x –()x 2⁄+-------------------------------=Power new Power old --------------------------V 0.75×()2F 0.60×()×V 2F ×----------------------------------------------------------0.7520.6×0.338===4■Solutions to Case Studies and Exercisese.Anything in the 15–25% range would be a reasonable conclusion based onthe decline in the rate over history. As the sudden stop in clock rate shows,though, even the declines do not always follow predictions.1.9 a.50%b.Energy = ½ load × V2. Changing the frequency does not affect energy–onlypower. So the new energy is ½ load × ( ½ V)2, reducing it to about ¼ the oldenergy.1.10 a.60%b.0.4 + 0.6 × 0.2 = 0.58, which reduces the energy to 58% of the originalenergy.c.newPower/oldPower = ½ Capacitance × (V oltage × .8)2× (Frequency × .6)/½Capacitance × V oltage × Frequency = 0.82× 0.6 = 0.256 of the original power.d.0.4 + 0 .3 × 2 = 0.46, which reduce the energy to 46% of the original energy.1.11 a.109/100 = 107b.107/107 + 24 = 1c.[need solution]1.12 a.35/10000 × 3333 = 11.67 daysb.There are several correct answers. One would be that, with the current sys-tem, one computer fails approximately every 5 minutes. 5 minutes is unlikelyto be enough time to isolate the computer, swap it out, and get the computerback on line again. 10 minutes, however, is much more likely. In any case, itwould greatly extend the amount of time before 1/3 of the computers havefailed at once. Because the cost of downtime is so huge, being able to extendthis is very valuable.c.$90,000 = (x + x + x + 2x)/4$360,000 = 5x$72,000 = x4th quarter = $144,000/hr1.13 a.Itanium, because it has a lower overall execution time.b.Opteron: 0.6 × 0.92 + 0.2 × 1.03 + 0.2 × 0.65 = 0.888c.1/0.888 = 1.1261.14 a.See Figure S.1.b. 2 = 1/((1 – x) + x/10)5/9 = x = 0.56 or 56%c.0.056/0.5 = 0.11 or 11%d.Maximum speedup = 1/(1/10) = 105 = 1/((1 – x) + x/10)8/9 = x = 0.89 or 89%Chapter 1 Solutions ■5e.Current speedup: 1/(0.3 + 0.7/10) = 1/0.37 = 2.7Speedup goal: 5.4 = 1/((1 – x) + x /10) = x = 0.91This means the percentage of vectorization would need to be 91%1.15 a.old execution time = 0.5 new + 0.5 × 10 new = 5.5 newb.In the original code, the unenhanced part is equal in time to the enhanced partsped up by 10, therefore:(1 – x) = x /1010 – 10x = x10 = 11x10/11 = x = 0.911.16 a.1/(0.8 + 0.20/2) = 1.11b.1/(0.7 + 0.20/2 + 0.10 × 3/2) = 1.05c.fp ops: 0.1/0.95 = 10.5%, cache: 0.15/0.95 = 15.8%1.17 a.1/(0.6 + 0.4/2) = 1.25b.1/(0.01 + 0.99/2) = 1.98c.1/(0.2 + 0.8 × 0.6 + 0.8 × 0.4/2) = 1/(.2 + .48 + .16) = 1.19d.1/(0.8 + 0.2 × .01 + 0.2 × 0.99/2) = 1/(0.8 + 0.002 + 0.099) = 1.111.18 a.1/(.2 + .8/N)b.1/(.2 + 8 × 0.005 + 0.8/8) = 2.94c.1/(.2 + 3 × 0.005 + 0.8/8) = 3.17d.1/(.2 + logN × 0.005 + 0.8/N)e.d/dN(1/((1 – P) + logN × 0.005 + P/N)) = 0Figure S.1Plot of the equation: y = 100/((100 – x) + x/10).6■Solutions to Case Studies and ExercisesChapter 2 SolutionsCase Study 1: Optimizing Cache Performance viaAdvanced Techniques2.1 a.Each element is 8B. Since a 64B cacheline has 8 elements, and each columnaccess will result in fetching a new line for the non-ideal matrix, we need aminimum of 8x8 (64 elements) for each matrix. Hence, the minimum cachesize is 128 × 8B = 1KB.b.The blocked version only has to fetch each input and output element once.The unblocked version will have one cache miss for every 64B/8B = 8 rowelements. Each column requires 64Bx256 of storage, or 16KB. Thus, columnelements will be replaced in the cache before they can be used again. Hencethe unblocked version will have 9 misses (1 row and 8 columns) for every 2 inthe blocked version.c.for (i = 0; i < 256; i=i+B) {for (j = 0; j < 256; j=j+B) {for(m=0; m<B; m++) {for(n=0; n<B; n++) {output[j+n][i+m] = input[i+m][j+n];}}}}d.2-way set associative. In a direct-mapped cache the blocks could be allocatedso that they map to overlapping regions in the cache.e.You should be able to determine the level-1 cache size by varying the blocksize. The ratio of the blocked and unblocked program speeds for arrays thatdo not fit in the cache in comparison to blocks that do is a function of thecache block size, whether the machine has out-of-order issue, and the band-width provided by the level-2 cache. You may have discrepancies if yourmachine has a write-through level-1 cache and the write buffer becomes alimiter of performance.2.2Since the unblocked version is too large to fit in the cache, processing eight 8B ele-ments requires fetching one 64B row cache block and 8 column cache blocks.Since each iteration requires 2 cycles without misses, prefetches can be initiatedevery 2 cycles, and the number of prefetches per iteration is more than one, thememory system will be completely saturated with prefetches. Because the latencyof a prefetch is 16 cycles, and one will start every 2cycles, 16/2 = 8 will be out-standing at a time.Open hands-on exercise, no fixed solution.2.3Chapter 2 Solutions■7Case Study 2: Putting it all Together: Highly ParallelMemory Systems2.4 a.The second-level cache is 1MB and has a 128B block size.b.The miss penalty of the second-level cache is approximately 105ns.c.The second-level cache is 8-way set associative.d.The main memory is 512MB.e.Walking through pages with a 16B stride takes 946ns per reference. With 250such references per page, this works out to approximately 240ms per page.2.5 a.Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 4KB for the graph above.b.Hint: Take independent strides by the page size and look for increases inlatency not attributable to cache sizes. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size as acache level.c.Hint: This is visible in the graph above as a slight increase in L2 miss servicetime for large data sets, and is 15ns in the graph above.d.Hint: Take independent strides that are multiples of the page size to see if theTLB if fully-associative or set-associative. This may be hard to discern if theamount of memory mapped by the TLB is almost the same as the size as acache level.2.6 a.Hint: Look at the speed of programs that easily fit in the top-level cache as afunction of the number of threads.b.Hint: Compare the performance of independent references as a function oftheir placement in memory.Open hands-on exercise, no fixed solution.2.7Exercises2.8 a.The access time of the direct-mapped cache is 0.86ns, while the 2-way and4-way are 1.12ns and 1.37ns respectively. This makes the relative accesstimes 1.12/.86 = 1.30 or 30% more for the 2-way and 1.37/0.86 = 1.59 or59% more for the 4-way.b.The access time of the 16KB cache is 1.27ns, while the 32KB and 64KB are1.35ns and 1.37ns respectively. This makes the relative access times 1.35/1.27 = 1.06 or 6% larger for the 32KB and 1.37/1.27 = 1.078 or 8% larger forthe 64KB.c.Avg. access time = hit% × hit time + miss% × miss penalty, miss% = missesper instruction/references per instruction = 2.2% (DM), 1.2% (2-way), 0.33%(4-way), .09% (8-way).Direct mapped access time = .86ns @ .5ns cycle time = 2 cycles2-way set associative = 1.12ns @ .5ns cycle time = 3 cycles8■Solutions to Case Studies and Exercises4-way set associative = 1.37ns @ .83ns cycle time = 2 cycles8-way set associative = 2.03ns @ .79ns cycle time = 3 cyclesMiss penalty = (10/.5) = 20 cycles for DM and 2-way; 10/.83 = 13 cycles for4-way; 10/.79 = 13 cycles for 8-way.Direct mapped – (1 – .022) × 2 + .022 × (20) = 2.39 6 cycles => 2.396 × .5 = 1.2ns2-way – (1 – .012) × 3 + .012 × (20) = 3. 2 cycles => 3.2 × .5 = 1.6ns4-way – (1 – .0033) × 2 + .0033 × (13) = 2.036 cycles => 2.06 × .83 = 1.69ns8-way – (1 – .0009) × 3 + .0009 × 13 = 3 cycles => 3 × .79 = 2.37nsDirect mapped cache is the best.2.9 a.The average memory access time of the current (4-way 64KB) cache is 1.69ns.64KB direct mapped cache access time = .86ns @ .5 ns cycle time = 2 cyclesWay-predicted cache has cycle time and access time similar to direct mappedcache and miss rate similar to 4-way cache.The AMAT of the way-predicted cache has three components: miss, hit withway prediction correct, and hit with way prediction mispredict: 0.0033 × (20)+ (0.80 × 2 + (1 – 0.80) × 3) × (1 – 0.0033) = 2.26 cycles = 1.13nsb.The cycle time of the 64KB 4-way cache is 0.83ns, while the 64KB direct-mapped cache can be accessed in 0.5ns. This provides 0.83/0.5 = 1.66 or 66%faster cache access.c.With 1 cycle way misprediction penalty, AMA T is 1.13ns (as per part a), butwith a 15 cycle misprediction penalty, the AMAT becomes: 0.0033 × 20 +(0.80 × 2 + (1 – 0.80) × 15) × (1 – 0.0033) = 4.65 cycles or 2.3ns.d.The serial access is 2.4ns/1.59ns = 1.509 or 51% slower.2.10 a.The access time is 1.12ns, while the cycle time is 0.51ns, which could bepotentially pipelined as finely as 1.12/.51 = 2.2 pipestages.b.The pipelined design (not including latch area and power) has an area of1.19 mm2 and energy per access of 0.16nJ. The banked cache has an area of1.36 mm2 and energy per access of 0.13nJ. The banked design uses slightlymore area because it has more sense amps and other circuitry to support thetwo banks, while the pipelined design burns slightly more power because thememory arrays that are active are larger than in the banked case.2.11 a.With critical word first, the miss service would require 120 cycles. Withoutcritical word first, it would require 120 cycles for the first 16B and 16 cyclesfor each of the next 3 16B blocks, or 120 + (3 × 16) = 168 cycles.b.It depends on the contribution to Average Memory Access Time (AMAT) ofthe level-1 and level-2 cache misses and the percent reduction in miss servicetimes provided by critical word first and early restart. If the percentage reduc-tion in miss service times provided by critical word first and early restart isroughly the same for both level-1 and level-2 miss service, then if level-1misses contribute more to AMAT, critical word first would likely be moreimportant for level-1 misses.Chapter 2 Solutions■92.12 a.16B, to match the level 2 data cache write path.b.Assume merging write buffer entries are 16B wide. Since each store canwrite 8B, a merging write buffer entry would fill up in 2 cycles. The level-2cache will take 4 cycles to write each entry. A non-merging write bufferwould take 4 cycles to write the 8B result of each store. This means themerging write buffer would be 2 times faster.c.With blocking caches, the presence of misses effectively freezes progressmade by the machine, so whether there are misses or not doesn’t change therequired number of write buffer entries. With non-blocking caches, writes canbe processed from the write buffer during misses, which may mean fewerentries are needed.2.13 a. A 2GB DRAM with parity or ECC effectively has 9 bit bytes, and wouldrequire 18 1Gb DRAMs. To create 72 output bits, each one would have tooutput 72/18 = 4 bits.b. A burst length of 4 reads out 32B.c.The DDR-667 DIMM bandwidth is 667 × 8 = 5336 MB/s.The DDR-533 DIMM bandwidth is 533 × 8 = 4264 MB/s.2.14 a.This is similar to the scenario given in the figure, but tRCD and CL areboth5. In addition, we are fetching two times the data in the figure. Thus itrequires 5 + 5 + 4 × 2 = 18 cycles of a 333MHz clock, or 18 × (1/333MHz) =54.0ns.b.The read to an open bank requires 5 + 4 = 9 cycles of a 333MHz clock, or27.0ns. In the case of a bank activate, this is 14 cycles, or 42.0ns. Including20ns for miss processing on chip, this makes the two 42 + 20 = 61ns and27.0+20 = 47ns. Including time on chip, the bank activate takes 61/47 = 1.30or 30% longer.The costs of the two systems are $2 × 130 + $800 = $1060 with the DDR2-667 2.15DIMM and 2 × $100 + $800 = $1000 with the DDR2-533 DIMM. The latency toservice a level-2 miss is 14 × (1/333MHz) = 42ns 80% of the time and 9 × (1/333MHz) = 27ns 20% of the time with the DDR2-667 DIMM.It is 12 × (1/266MHz) = 45ns (80% of the time) and 8 × (1/266MHz) = 30ns(20% of the time) with the DDR-533 DIMM. The CPI added by the level-2misses in the case of DDR2-667 is 0.00333 × 42 × .8 + 0.00333 × 27 × .2 = 0.130giving a total of 1.5 + 0.130 = 1.63. Meanwhile the CPI added by the level-2misses for DDR-533 is 0.00333 × 45 × .8 + 0.00333 × 30 × .2 = 0.140 giving atotal of 1.5 + 0.140 = 1.64. Thus the drop is only 1.64/1.63 = 1.006, or 0.6%,while the cost is $1060/$1000 = 1.06 or 6.0% greater. The cost/performance ofthe DDR2-667 system is 1.63 × 1060 = 1728 while the cost/performance of theDDR2-533 system is 1.64 × 1000 = 1640, so the DDR2-533 system is a bettervalue.The cores will be executing 8cores × 3GHz/2.0CPI = 12 billion instructions per 2.16second. This will generate 12 × 0.00667 = 80 million level-2 misses per second.With the burst length of 8, this would be 80 × 32B = 2560MB/sec. If the memorybandwidth is sometimes 2X this, it would be 5120MB/sec. From Figure 2.14, this is just barely within the bandwidth provided by DDR2-667 DIMMs, so just one memory channel would suffice.2.17a.The system built from 1Gb DRAMs will have twice as many banks as thesystem built from 2Gb DRAMs. Thus the 1Gb-based system should provide higher performance since it can have more banks simultaneously open.b.The power required to drive the output lines is the same in both cases, but thesystem built with the x4 DRAMs would require activating banks on 18 DRAMs,versus only 9 DRAMs for the x8 parts. The page size activated on each x4 and x8 part are the same, and take roughly the same activation energy. Thus since there are fewer DRAMs being activated in the x8 design option, it would have lower power.2.18a.With policy 1,Precharge delay Trp = 5 × (1/333 MHz) = 15ns Activation delay Trcd = 5 × (1/333 MHz) = 15ns Column select delay Tcas = 4 × (1/333 MHz) = 12ns Access time when there is a row buffer hitAccess time when there is a missWith policy 2,Access time = Trcd + Tcas + TddrIf A is the total number of accesses, the tip-off point will occur when the net access time with policy 1 is equal to the total access time with policy 2.i.e.,= (Trcd + Tcas + Tddr)Ar = 100 × (15)/(15 + 15) = 50%If r is less than 50%, then we have to proactively close a page to get the best performance, else we can keep the page open.b.The key benefit of closing a page is to hide the precharge delay Trp from thecritical path. If the accesses are back to back, then this is not possible. This new constrain will not impact policy 1.T h r Tcas Tddr +()100--------------------------------------=T m 100r –()Trp Trcd Tcas Tddr +++()100--------------------------------------------------------------------------------------------=r 100--------Tcas Tddr +()A 100r–100----------------Trp Trcd Tcas Tddr +++()A +r 100Trp×Trp Trcd+---------------------------=⇒The new equations for policy 2,Access time when we can hide precharge delay = Trcd + Tcas + Tddr Access time when precharge delay is in the critical path = Trcd + Tcas + Trp + Tddr Equation 1 will now become,r = 90 × 15/30 = 45%c.For any row buffer hit rate, policy 2 requires additional r × (2 + 4) nJ peraccess. If r = 50%, then policy 2 requires 3nJ of additional energy.2.19 Hibernating will be useful when the static energy saved in DRAM is at least equalto the energy required to copy from DRAM to Flash and then back to DRAM.DRAM dynamic energy to read/write is negligible compared to Flash and can be ignored.= 400 secondsThe factor 2 in the above equation is because to hibernate and wakeup, both Flash and DRAM have to be read and written once.2.20a.Yes. The application and production environment can be run on a VM hostedon a development machine.b.Yes. Applications can be redeployed on the same environment on top of VMsrunning on different hardware. This is commonly called business continuity.c.No. Depending on support in the architecture, virtualizing I/O may add sig-nificant or very significant performance overheads.d.Yes. Applications running on different virtual machines are isolated fromeach other.e.Yes. See “Devirtualizable virtual machines enabling general, single-node,online maintenance,” David Lowell, Yasushi Saito, and Eileen Samberg, in the Proceedings of the 11th ASPLOS, 2004, pages 211–223.2.21a.Programs that do a lot of computation but have small memory working setsand do little I/O or other system calls.b.The slowdown above was 60% for 10%, so 20% system time would run120% slower.c.The median slowdown using pure virtualization is 10.3, while for para virtu-alization the median slowdown is 3.76.r 100--------Tcas Tddr +()A 100r–100----------------Trp Trcd Tcas Tddr +++()A +0.9Trcd Tcas Tddr ++()×A 0.1Trcd Tcas Trp Tddr +++()×+=r ⇒90Trp Trp Trcd +---------------------------⎝⎠⎛⎞×=Time 81092 2.56106–××××64 1.6×------------------------------------------------------------=d.The null call and null I/O call have the largest slowdown. These have no realwork to outweigh the virtualization overhead of changing protection levels,so they have the largest slowdowns.The virtual machine running on top of another virtual machine would have to emu- 2.22late privilege levels as if it was running on a host without VT-x technology.2.23 a.As of the date of the Computer paper, AMD-V adds more support for virtual-izing virtual memory, so it could provide higher performance for memory-intensive applications with large memory footprints.b.Both provide support for interrupt virtualization, but AMD’s IOMMU alsoadds capabilities that allow secure virtual machine guest operating systemaccess to selected devices.Open hands-on exercise, no fixed solution.2.242.25 a.These results are from experiments on a3.3GHz Intel® Xeon® ProcessorX5680 with Nehalem architecture (westmere at 32nm). The number of missesper 1K instructions of L1 Dcache increases significantly by more than 300Xwhen input data size goes from 8KB to 64 KB, and keeps relatively constantaround 300/1K instructions for all the larger data sets. Similar behavior withdifferent flattening points on L2 and L3 caches are observed.b.The IPC decreases by 60%, 20%, and 66% when input data size goes from8KB to 128 KB, from 128KB to 4MB, and from 4MB to 32MB, respectively.This shows the importance of all caches. Among all three levels, L1 and L3caches are more important. This is because the L2 cache in the Intel® Xeon®Processor X5680 is relatively small and slow, with capacity being 256KB andlatency being around 11 cycles.c.F or a recent Intel i7 processor (3.3GHz Intel® Xeon® Processor X5680),when the data set size is increased from 8KB to 128KB, the number of L1Dcache misses per 1K instructions increases by around 300, and the numberof L2 cache misses per 1K instructions remains negligible. With a 11 cyclemiss penalty, this means that without prefetching or latency tolerance fromout-of-order issue we would expect there to be an extra 3300 cycles per 1Kinstructions due to L1 misses, which means an increase of 3.3 cycles perinstruction on average. The measured CPI with the 8KB input data size is1.37. Without any latency tolerance mechanisms we would expect the CPI ofthe 128KB case to be 1.37 + 3.3 = 4.67. However, the measured CPI of the128KB case is 3.44. This means that memory latency hiding techniques suchas OOO execution, prefetching, and non-blocking caches improve the perfor-mance by more than 26%.Case Study 1: Exploring the Impact of Microarchitectural Techniques3.1 The baseline performance (in cycles, per loop iteration) of the code sequence inFigure 3.48, if no new instruction’s execution could be initiated until the previ-ous instruction’s execution had completed, is 40. See Figure S.2. Each instruc-tion requires one clock cycle of execution (a clock cycle in which that instruction, and only that instruction, is occupying the execution units; since every instruction must execute, the loop will take at least that many clock cycles). To that base number, we add the extra latency cycles. Don’t forget the branch shadow cycle.3.2 H ow many cycles would the loop body in the code sequence in Figure 3.48require if the pipeline detected true data dependencies and only stalled on those,rather than blindly stalling everything just because one functional unit is busy?The answer is 25, as shown in Figure S.3. Remember, the point of the extra latency cycles is to allow an instruction to complete whatever actions it needs, in order to produce its correct output. Until that output is ready, no dependent instructions can be executed. So the first LD must stall the next instruction for three clock cycles. The MULTD produces a result for its successor, and therefore must stall 4 more clocks, and so on.Figure S.2Baseline performance (in cycles, per loop iteration) of the code sequence in Figure 3.48.Chapter 3 SolutionsLoop:LD F2,0(Rx) 1 + 4DIVD F8,F2,F0 1 + 12MULTD F2,F6,F2 1 + 5LD F4,0(Ry) 1 + 4ADDD F4,F0,F4 1 + 1ADDD F10,F8,F2 1 + 1ADDI Rx,Rx,#8 1 ADDI Ry,Ry,#81SD F4,0(Ry) 1 + 1SUB R20,R4,Rx 1BNZR20,Loop1 + 1____cycles per loop iter403.3 Consider a multiple-issue design. Suppose you have two execution pipelines, eachcapable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself.Further assume that the only reason an execution pipeline would stall is to observe a true data dependency. Now how many cycles does the loop require? The answer is 22, as shown in Figure S.4. The LD goes first, as before, and the DIVD must wait for it through 4 extra latency cycles. After the DIVD comes the MULTD , which can run in the second pipe along with the DIVD , since there’s no dependency between them.(Note that they both need the same input, F2, and they must both wait on F2’s readi-ness, but there is no constraint between them.) The LD following the MULTD does not depend on the DIVD nor the MULTD , so had this been a superscalar-order-3 machine,Figure S.3N umber of cycles required by the loop body in the code sequence in Figure 3.48.that LD could conceivably have been executed concurrently with the DIVD and the MULTD . Since this problem posited a two-execution-pipe machine, the LD executes in the cycle following the DIVD /MULTD . The loop overhead instructions at the loop’s bottom also exhibit some potential for concurrency because they do not depend on any long-latency instructions.3.4 Possible answers:1.If an interrupt occurs between N and N + 1, then N + 1 must not have beenallowed to write its results to any permanent architectural state. Alternatively,it might be permissible to delay the interrupt until N + 1 completes.2.If N and N + 1 happen to target the same register or architectural state (say,memory), then allowing N to overwrite what N + 1 wrote would be wrong.3.N might be a long floating-point op that eventually traps. N + 1 cannot beallowed to change arch state in case N is to be retried.Execution pipe 0Execution pipe 1Loop:LDF2,0(Rx);<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop>DIVD F8,F2,F0;MULTD F2,F6,F2LDF4,0(Ry);<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop><stall for LD latency>;<nop>ADDF4,F0,F4;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop><stall due to DIVD latency>;<nop>ADDD F10,F8,F2;ADDI Rx,Rx,#8ADDI Ry,Ry,#8;SD F4,0(Ry)SUB R20,R4,Rx;BNZR20,Loop <nop>;<stall due to BNZ>cycles per loop iter 22Figure S.4 Number of cycles required per loop.。

相关文档
最新文档