Intel-Memory关键技术解析

Intel Memory关键技术解析

Independent Channel Mode

Channels can be populated in any order in Independent Channel Mode. All four

channels may be populated in any order and have no matching requirements. All channels must run at the same interface frequency but individual channels may run at different DIMM timings (RAS latency, CAS latency, and so forth).

Lockstep Channel Mode

In Lockstep Channel Mode, each memory access is a 128-bit data access that spans Channel 0 and Channel 1, and Channel 2 and Channel 3. Lockstep Channel mode is the only RAS mode that allows SDDC for x8 devices. Lockstep Channel Mode requires that Channel 0 and Channel 1, and Channel 2 and Channel 3 must be populated identically with regards to size and organization. DIMM slot populations within a channel do not

have to be identical but the same DIMM slot location across Channel 0 and Channel 1

and across Channel 2 and Channel 3 must be populated the same.

Mirrored Channel Mode

In Mirrored Channel Mode, the memory contents are mirrored between Channel 0 and Channel 2 and also between Channel 1 and Channel 3. As a result of the mirroring, the total physical memory available to the system is half of what is populated. Mirrored Channel Mode requires that Channel 0 and Channel 2, and Channel 1 and Channel 3

must be populated identically with regards to size and organization. DIMM slot populations within a channel do not have to be identical but the same DIMM slot

location across Channel 0 and Channel 2 and across Channel 1 and Channel 3 must be populated the same.

Rank Sparing Mode

In Rank Sparing Mode, one rank is a spare of the other ranks on the same channel. The spare rank is held in reserve and is not available as system memory. The spare rank

must have identical or larger memory capacity than all the other ranks (sparing source ranks) on the same channel. After sparing, the sparing source rank will be lost.

进行内存热备时，做热备份的内存在正常情况下是不使用的，也就是说系统是看不到这部分内存容量的。每个内存通道中有一个DIMM不被使用，预留为热备内存。芯片组中设置有内存校验错误次数的阈值, 即每单位时间发生错误的次数。当工作内存的故障次数达到这个“容错阈值”，系统开始进行双重写动作，一个写入主内存，一个写入热备内存，当系统检测到两个内存数据一致后，热备内存就代替主内存工作，故障内存被禁用，这样就完成了热备内存接替故障内存工作的任务，有效避免了系统由于内存故障而导致数据丢失或系统宕机。这个做热备的内存容量应大于等于所在通道的最大内存条的容量，以满足内存数据迁移的最大容量需求。

内存刷洗（Memory Scrubbing）

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

Key Info：1）Soft error, an important reason for doing memory scrubbing

2）Error detection and correction, a general theory used for memory scrubbing

ECC技术

90年代初，内存体系采用奇偶性校验（Parity Verifying）技术。奇偶校验内存在每一字节（8位）外又额外增加了一位作为错误检测之用，BIOS中的监控程序会将存入内存中的数据位相加，并将结果存于校验位中。比如一个字节中存储了某一数值10011110，每一位加起来的结果为奇数（1＋0＋0＋1＋1＋1＋1＋0＝5），校验位存入1。当CPU读取储存的数据时，监控程序再次相加存储的8位数据，并将计算结果与校验位相比较。如果发现二者不同，系统就会产生出错信息。奇偶校验技术仅能粗略地检查内存错误，并不具备纠错能力。

另一种内存纠错技术叫做ECC（Error Correct Code，纠错码），它也是在原来的数据位上外加位来实现的，增加的位用来重建错误数据。在ECC纠错体系中，如果数据为N个字节，则外加的ECC位为log2N + 5。例如对于64位数据，需要外加log28 + 5 = 8个ECC 位。

当出现一个存储位错误时，ECC体系可以自动进行纠错。当出现2个数据位错误时，可以检测出来，但不能纠错，这种行为通常称作“单错纠正／双错检测（Single Error Correction/Double Error Detection ，简称SEC/DED）。一次存取中有2个以上的数据位出错时，由于SEC/DED体系检测不出来了，致使数据的完整性受损。采用这种结构的存储器，当检测出多位错误时，系统就会报告出现了致命故障（Fatal fault），之后系统崩溃。

X4/X8 SDDC (Single Device Data Correction)

随着RAM芯片的集成度的提高和内存容量的增大，内存发生错误的概率也随之增加。几年前被认为很可靠的SEC／DED内存体系，今天已经力不从心了，寻求具有多位纠错能力的内存体系结构一直是众多厂商追求的目标。

RAM器件失效最为严重的情形是其全部数据位全部发生错误，纠正这种错误的基本思路应该着眼于芯片和系统的硬件结构，而不可能通过软件升级的方式来达到目的。

存储器中的每个字节外加一个ECC位构成ECC字。如果存储器系统的数据宽度为32个字节（或256位），实际的存储器数据的宽度是256＋32＝288位。同时，每一个数据位都被置于分离的ECC字中。

图1描述了这种方法工作的原理。存储系统由4个DIMM模块构成，32个字节（256位）的数据被分成4个ECC字，每个ECC字含有8个字节（64位）的数据位和8个ECC 位。这样，一个ECC字的实际长度为64＋8＝72位，存储数据总长度为72×4＝288位。

图1 Chipkill内存纠错原理

存储器控制器（Memory Controller）把每个ECC字被分成4个长度为18位的段，分别存储于4个DIMM中。同时，每个DIMM中也存储了4个来自不同的ECC字的段。然后，每个段的18个位再被存储在不同的RAM芯片中。

经过上述处理，每个DRAM芯片中只保存了ECC字的一位。如果RAM芯片失效，导致某个芯片中的全部18个位都出错，也只是造成ECC字的一位错误。因为每个ECC字具有SEC／DED能力，可以自动纠错，所以可以恢复所有的数据。

What is LR-DIMM or LRDIMM ?

Today, using RDIMMs, a typical server system can accommodate up to three quad-rank 16GB RDIMMS per processor. However, that same system can support up to nine quad-rank 16GB LRDIMMS per processor, pushing the memory capacity from 48GB to 144GB.

Load reduced DIMM (LRDIMM) is another new memory technology in development. Designed with a buffer chip (or chips) to replace the register to help minimize loading, the LRDIMM is targeted to increase overall server system memory capacity and speed using a memory buffer chip or chips as opposed to a register.

( Large rectangular memory buffer)

Before we dive deep into LR-DIMM, lets refresh some key features on DIMM memory which is a Dual Inline Memory Module. DIMM stands for Dual Inline Memory Module. It is the RAM memory we found in our desktop computer. It consists of a few black chips (IC) on a small PCB. It stores our file and data temporally when we turn on our computer. Refers to pins on both side of the module. We generally call them gold fingers? I used to put 4 of these sticks into my old 486 computer to reach the maximum allowable memory of 16MB. With the change of time and technology, I now found that I still have 4 sticks of memory in my new computer but the total memory is 16GB instead. Not only the memory capacity has increased, I also found that my memory now is 200 times faster than the memory in my 486 computer.

Memory loading in a consumer computer

With the 16GB of DDR3-1333Mhz memory in my computer, I now can play online games, stream movie and draw my 3D graphic pictures on the screen. It is just slick! Don`t need more? Yes, I would like to share my movies with 3 of my friends. I would like to pass picture files to my cousin in China and I also want to watch the Space Shuttle Launch. I will never run out of memory appetite. My next question is: Why can`t I put 6 sticks of memory into my computer?

Wait a minute! You just cannot keep adding memory into your computer without any penalty.

At 1333Mhz (1.3GHz), noise gets involved. It generally is something called the signal integrity or signal reflection issue. At a point, the accumulated noise in the system would render the system not operable.

At high frequency, there is also something called loading factor? Each memory chip (IC) has input capacitance that tends to suppress the high frequency signal. Generally, each chip has about 3 to 5 pf of input capacitance. The more chips on the module, the more accumulated capacitance will weaken the signal to an in-operable state.

(Graphical illustration: Input capacitance increases the loading factor at high frequency)

To solve the loading factor?problem, the PC designer introduced multiple memory channels? Some are dual channel and some are triple channel just to let you maintain the large number of memory in your computer.

Memory loading in a server computer

Server computer further complicates the issue. Since server runs multiple applications simultaneously, it is best to have all the necessary data running on the active background all the time. That calls for tremendous amount of memory running dynamically at high speed and high band width. But the question is how to achieve that?

Examining a regular memory module (unbuffered DIMM), we realize different groups of input lines on the same module has different loading. For example, those go to the address and control instruction lines are connected to more chips in parallel in comparison to the data bus lines. The data bus lines are usually only connected to either 1 chip or 2 chips versus the other lines can be connected to as many as 16 chips. Therefore, the question is if we can add a logical driver (buffer) chip in between the input and the address and control bus. It would even be better if this chip can also be used to line up all the address and control line signals. A register chip is, therefore, used to deliver the proper function. It is to increase drive power and keep the bus signals lined up.

Registered DIMM. Bandwidth and scalability

This register chip really does wonders. It keeps the signal strong and also synchronized the timing between lines. Since the clock signal is repetitive, it can also be strengthened using a phase lock loop re-driver. A phase lock loop re-driver is also called a zero delay clock driver. It re-generates clock signal in time synchronization to the original clock signal. Using this method, several identical clock signals can be generated from the original clock source and thus multiplies the drive power of the origin repetitive clock source.

(Graphical illustration: Register DIMM Block Diagram)

Speed evolution limits the number of modules in a server system

Since the invention of the registered module, it kept the server industry going for years until once again the increase in operational frequency had hindered the system memory capacity (number of modules per system) again.

There comes the FB-DIMM with serial input and parallel output

Intel had invented the Fully Buffered DIMM to solve the above problem. It put a big driver chip in the middle of the DIMM module. This buffer chip accepts a high frequency serial signal input. Inside the chip, it converts this serial signal to parallel signal and re-drive the memory chips (DRAM) from there. Ideally, this approach reduces the physical number of signal lines at the input of the DIMM and therefore un-cluttered the physical system wiring. At the same time, it increases the number of module per system.

When FB-DIMM run out of steam, LRDIMM comes into the picture.

While Fully-Buffered DIMM was originally a good idea, the industry soon found that it has implementation problems. First, the serial input frequency has to be 4 times higher than the memory clock frequency. This puts it into the microwave frequency range and is a whole new page of technical difficulties. The signal weakening issue at high frequency is amplified to a difficult to control stage. Besides, the higher serial input frequency also increases the heat generation to an unacceptable point. Smart engineers soon announced the alternative

approach, the LRDIMM.

(Graphical illustration: Block Diagram FB-DIMM vs LRDIMM)

LRDIMM is Load Reduced with high fan-out and bi-directionally buffers All lines are buffered. The LRDIMM (Load Reduced DIMM) works very similar to a Registered DIMM. It buffers the address and control signals through register logic. It re-drives the clock through Phase Lock Loop. The difference is that it also buffers the data lines through bi-directional drivers. This way, all the signal lines are truly fully buffered?in the parallel fashion.

Pros and Cons of LRDIMM, technical point

Through the full buffering of all signals, you can double the number of DIMM in a system using the LRDIMM re-drive method. With the addition of the new 4 rank modules and dual die chip, you can reach up to 16GB per channel with LRDIMM in today`s system. Together with the new 4 channel system construction, 8 modules and 64GB memory population total can be achieved. Since the serial approached is abandoned, the heat and power dissipation problem no longer exist.

However, I do see one short fall on LRDIMM. Data line latency would increase. Write/Read turn around time will be required. There will be system with 2T and 3T practical latency. Luckily, most of the applications in server system are consecutive reads where write is not very

frequent and therefore would not affect the average system performance.

(Picture of LRDIMM: LRDIMM enhances server system performance)

Cost and benefit analysis, financial view

Since LR-DIMM will be a JEDEC standard, it is widely support by the industry. Cost will be driven down by volume production and multiple sources. Presently, there are many companies supporting the creation of new standard. That includes DRAM manufacturers, logic buffer chip manufacturers as well as memory module designers and infrastructure providers.

Conclusion

Looks like LR-DIMM will definitely be the next generation server module. Since it is scalable following the DRAM chip roadmap, it will cover from the DDR3 into the DDR4 DRAM generation. Beyond that, nobody would know if bigger and better technology will surface. Micron is currently sampling an 8GB LRDIMM with select enablers. Mass production of its

16GB LRDIMMs is expected to begin in 2010.

One thing for sure is that CST, Inc. (https://www.360docs.net/doc/a211024845.html,) will be here to support the testing with low cost LRDIMM testers.