mtcp协议栈

合集下载

什么是TCPIP协议栈？栈是什么意思？

什么是TCP/IP协议栈？栈是什么意思？TCP/IP协议叫做传输控制/网际协议，它是Internet国际互联网络的基础。

TCP/IP 是网络中使用的基本的通信协议。

虽然从名字上看TCP/IP包括两个协议，传输控制协议（TCP）和网际协议（IP），但TCP/IP实际上是一组协议，它包括上百个各种功能的协议，如：远程登录、文件传输和电子邮件等，而TCP协议和IP协议是保证数据完整传输的两个基本的重要协议。

通常说TCP/IP是Internet协议族，而不单单是TCP和IP。

TCP/IP协议的基本传输单位是数据包（datagram）,TCP协议负责把数据分成若干个数据包，并给每个数据包加上包头（就像给一封信加上信封），包头上有相应的编号，以保证在数据接收端能将数据还原为原来的格式，IP协议在每个包头上再加上接收端主机地址，这样数据找到自己要去的地方，如果传输过程中出现数据丢失、数据失真等情况，TCP协议会自动要求数据重新传输，并重新组包。

总之，IP协议保证数据的传输，TCP协议保证数据传输的质量。

TCP/IP 协议数据的传输基于TCP/IP协议的四层结构：应用层、传输层、网络层、接口层，数据在传输时每通过一层就要在数据上加个包头，其中的数据供接收端同一层协议使用，而在接收端，每经过一层要把用过的包头去掉，这样来保证传输数据的格式完全一致。

TCP/IP协议介绍TCP/IP的通讯协议这部分简要介绍一下TCP/IP的内部结构，为讨论与互联网有关的安全问题打下基础。

TCP/IP协议组之所以流行，部分原因是因为它可以用在各种各样的信道和底层协议（例如T1和X.25、以太网以及RS-232串行接口）之上。

确切地说，TCP/IP协议是一组包括TCP协议和IP协议，UDP（User Datagram Protocol）协议、ICMP（Internet Control Message Protocol）协议和其他一些协议的协议组。

MP-TCP

MP-TCP背景随着技术的发展许多设备具有了多个网络接口，而TCP依然是一个单线路的协议，在TCP的通信过程中发端和收端都不能随意变换地址。

我们可以利用多个网络接口的这一特性来改善性能和有效冗余。

例如：你的手机同时连接WIFI信号和3G信号的时候，如果WIFI 关掉，使用WIFI进行的TCP连接就会断开，而不能有效利用3G网络继续收发数据。

Multipath TCP可以在一条TCP链接中包含多条路径，避免上述问题出现。

MPTCP简介MPTCP允许在一条TCP链路中建立多个子通道。

当一条通道按照三次握手的方式建立起来后，可以按照三次握手的方式建立其他的子通道，这些通道以三次握手建立连接和四次握手解除连接。

这些通道都会绑定于MPTCP session，发送端的数据可以选择其中一条通道进行传输。

MP TCP是IETF（因特网工程项目组）正在标准化的一个工作。

它作为传统TCP 的一个扩展，是由IETF在2013年1月发布的一个规范。

具体可以参考rfc 6824.MPTCP的设计遵守以下两个原则：1.应用程序的兼容性，应用程序只要可以运行在TCP环境下，就可以在没有任何修改的情况下，运行于MPTCP环境。

2.网络的兼容性，MPTCP兼容其他协议。

MPTCP在协议栈中的位置如下所示：建立连接过程MPTCP的第一个子通道的建立遵守TCP的三次握手，唯一的区别是每次发送的报文段需要添加MP_CAPABLE的的TCP选项和一个安全用途的key。

左边客户端发送的第一个SYN包携带有客户端自身的KEY，右边发送SYN/ACK的时候携带了自身的KEY，而最后左边的客户端发送最后一个ACK的时候携带着双方的KEY。

MPTCP 中关于MP_CAPABLE的定义如下Subtype的定义如下：MPTCP在进行三次握手之后，客户端和服务端会进行地址信息的交换，让对方知道彼此未用的地址信息。

当客户端知道服务端的地址后就可以建立其他子路径。

轻量级嵌入式TCP／IP协议栈的设计

Ｉ转发；Ｐ支持ＩＭＰＣ协议；包括实验性扩展的ＵＰ包括阻Ｄ；塞控制，ＴＲＴ估算和快速恢复／快速转发的ＴＰ可选择的类Ｃ；似Ｂｒｅｙ的ｓｃｅＡＩ支持ＤＣ；ｅｌｋｅｏｋｔＰ；ＨＰ支持ＰＰＰ；以太网的
ＡＲＰ。
上一层无所不在的 “ 电子皮肤” 。要想实现嵌入式系统的
Ｉｔｎｔｎｒｅ网络化，ｅ就必须在嵌入式系统中实现ＴＰＩＣ倚介
ＴＰＩＣ／Ｐ协议层次结构如图１所示。
应用层
运输层网络层
ＩＭＰ、ＩＣＰ
这已经成为嵌入式应用领域和网络互联领域的研究热点。要实现嵌入式ＴＰＩ协议栈，最关键的一个问题就是，Ｃ／Ｐ
［ｂｔｃ］Ｔｅａｅｔｄｃｓｈｇｔｅｈｓｃ，ｓｒｅｅｔｄｒＴＰＰｔｋＩｇｅｐ￣ｓＯｈａｃｒｏＰＩａＡｓａｔｈｐｒｎｏｕｅｅｉ — ｉｔｔｋＩｄｃｂｓｈａａｄＣ￣ａ＋ｔｉｓｍｈｉｔｔｃｒｔｓｆＣ／ｓｃｒｐｉｒｔｌｈｗｇａｔｅｉｔｓｎｓｃｖｅｅｈａｅＴＰｔｋ
ｉｍｂｄｄｄｓｓｅａｄｄｓｇｔｏｓＡｏｍｐｒｓｎｂｔｅｔｎａｄＴＣＰ￣Ｐｓｃｎｇｔｗｅｇｔｎｅｅｅｙｔｍｎｅｉｎｍｅｈ．ｃｄａｉｏｅｗｅｎｓａｄｒｔｋａｄｌｈ— ｉｈａｉＴＣＰ／ＰｓｃｓｌｓｅＩｔｋｉｔｄ．ａｉ
总结了轻量级ＴＰＰ议栈和标准ＴＰＰＣ／协ＩＣｆ协议栈的区别。ｉ关蝴Ｉ：嵌入式系统；轻量级；ＴＰＰ协议栈Ｃ／Ｉ

计算机网络：TCPIP协议栈概述

计算机⽹络：TCPIP协议栈概述⽬录参考模型在⽹络刚刚被搞出来的年代，通常只有同⼀个⼚家⽣产的设备才能彼此通信，不同的⼚家的设备不能兼容。

这是因为没有统⼀的标准去要求不同的⼚家按照相同的⽅式进⾏通信，所以不同的⼚家都闭门造车。

为了解决这个问题，后来就产⽣出参考模型的概念。

参考模型是描述如何完成通信的概念模型，它指出了完成⾼效通信所需要的全部步骤，并将这些步骤划分为称之为“层”的逻辑组。

分层最⼤的优点是为上层隐藏下层的细节，即对于开发者来说，如果他们要开发或实现某⼀层的协议，则他们只需要考虑这⼀层的功能即可。

其它层都⽆需考虑，因为其它层的功能有其它层的协议来完成，上层只需要调⽤下层的接⼝即可。

参考模型的优点如下：1. 将⽹络通信过程划分为更⼩、更简单的组件，使得组件的开发、设计和排错更为⽅便；2. 通过标准化⽹络组件，让不同的⼚商能够协作开发；3. 定义了模型每层执⾏的功能，从⽽⿎励了⾏业标准化；4. 让不同类型的⽹络硬件和软件能够彼此通信；5. 避免让对⼀层的修改影响其它层，从⽽避免妨碍开发⼯作。

协议计算机⽹络中的数据交换必须遵守事先约定好的规则，这些规则明确规定了所交换的数据的格式以及有关的同步问题，⽹络协议 (network protocol)是为进⾏⽹络中的数据交换⽽建⽴的规则、标准或约定。

⽹络协议有 3 个要素：1. 语法：数据与控制信息的结构或格式；2. 语义：需要发出何种控制信息，完成何种动作以及做出何种响应；3. 同步：事件实现顺序的详细说明。

OSI 模型OSI 模型旨在以协议的形式帮助⼚商⽣产兼容的⽹络设备和软件，让不同⼚商的⽹络能够协同⼯作。

同时对于⽤户⽽⾔，OSI 能帮助不同的主机之间传输数据。

OSI 并⾮是具体的模型，⽽是⼀组指导原则，开发者以此为依据开发⽹络应⽤。

同时它也提供了框架，指导如何制定和实施⽹络标准、制造设备，以及制定⽹络互联的⽅案。

OSI 模型包含 7 层，上三层指定了终端中应⽤程序如何彼此通信，以及如何与⽤户交互，下四层指定了如何进⾏端到端数据传输。

协议栈是什么

协议栈是什么协议栈（Protocol Stack）是指一组按照特定顺序排列的通信协议的集合，它们按照层次结构组织，每一层负责特定的功能，从而实现数据在网络中的传输和交换。

在计算机网络中，协议栈是网络通信的基础，它定义了数据在网络中的传输格式、传输方式、错误检测和纠正等规则。

首先，协议栈通常由多个层次组成，每一层都有特定的功能和责任。

最常见的协议栈是TCP/IP协议栈，它由四个层次组成，应用层、传输层、网络层和数据链路层。

每一层都有自己的协议和规范，负责特定的功能。

应用层负责定义应用程序之间的通信规则，传输层负责端到端的数据传输，网络层负责数据在网络中的路由和转发，数据链路层负责数据在物理介质上传输。

其次，协议栈的设计遵循分层的原则，每一层的功能相对独立，各层之间通过接口进行通信，上层向下层提供服务，下层向上层提供支持。

这种设计使得协议栈具有良好的可扩展性和灵活性，可以根据实际需求对每一层进行修改和升级，而不会对整个系统造成影响。

另外，协议栈的工作方式是自底向上的。

当数据从应用程序发送出去时，经过每一层的处理和封装，最终在物理介质上传输；而当数据到达目的地后，经过每一层的解封装和处理，最终交给目标应用程序。

这种逐层处理的方式使得协议栈的工作更加清晰和有序，方便对每一层进行调试和排错。

最后，协议栈的作用是实现网络通信的可靠性和高效性。

通过协议栈的分层设计和逐层处理，可以保证数据在网络中的正确传输和交换，同时也能够提高网络的吞吐量和响应速度。

协议栈的标准化和普及，也为不同厂商的设备和系统之间的互联互通提供了基础。

总的来说，协议栈是网络通信的基础，它通过分层设计和逐层处理实现数据在网络中的传输和交换。

协议栈的设计遵循分层原则，具有良好的可扩展性和灵活性，工作方式是自底向上的，作用是实现网络通信的可靠性和高效性。

对于理解和应用计算机网络技术，掌握协议栈的原理和工作方式非常重要。

协议栈名词解释

协议栈名词解释
协议栈是指计算机网络中的一种通信体系结构，它将不同层级和功能
的通信协议分层处理，并进行相互协作，实现网络通信。

协议栈通常
由多层级协议组成，每一层级都拥有特定的用途和职责。

以下是协议
栈中常用的名词解释:
1. 物理层：物理层是协议栈的最低层级，它负责将比特流转化为信号，并通过传输介质在网络中传输。

2. 数据链路层：数据链路层在物理层之上，它的主要作用是将物理层
传输的比特流转化为数据帧，并进行数据帧的封装和解封装。

3. 网络层：网络层负责在多个数据链路层之间进行路由选择，并实现
数据包的传输。

4. 传输层：传输层是协议栈的核心层级，主要负责点到点的进程与进
程之间的通信，并实现可靠数据传输和数据流量控制。

5. 应用层：应用层是协议栈最高层级，它运行着应用程序，向用户提
供网络服务和应用服务。

6. TCP/IP协议栈：TCP/IP协议栈是因特网协议栈中最常用的协议栈，它包括四个层级：网络接口层、网络层、传输层和应用层。

7. OSI模型：OSI模型是国际标准化组织在1984年发布的一个网络通信体系结构标准。

它将通信协议分为七个层级：物理层、数据链路层、网络层、传输层、会话层、表示层和应用层。

协议栈的实现可以在硬件和软件两个层面进行，软件实现的协议栈通
常可以通过API提供给应用程序使用。

由于网络通信的复杂性和多样性，不同的协议栈应用于不同的场景中。

正确理解和熟练掌握协议栈
的概念和结构，对于网络通信的学习和实践具有非常重要的意义。

什么是协议栈

什么是协议栈协议栈是计算机网络中的一个重要概念，是网络通信中的关键组成部分。

它是一种软件设计模式，用于处理网络通信中的协议。

协议栈是一系列按照特定顺序组织的协议的集合，每个协议层负责特定的功能。

在计算机网络中，协议栈通常被分为七层，这是由国际标准化组织（ISO）制定的一种通信协议模型，称为OSI模型。

它包含以下七个层次：物理层、数据链路层、网络层、传输层、会话层、表示层和应用层。

物理层是协议栈的底层，负责传输比特流，包括硬件接口和电气信号传输。

数据链路层负责在两个相邻节点之间传输帧，包括对物理层传输的数据进行分组和检错等处理。

网络层负责在网络中传输数据包，包括网络互联和路由器之间的通信。

传输层负责在两个应用程序之间建立可靠的数据传输连接，包括错误纠正和数据包重新传送。

会话层负责建立和管理应用程序之间的会话，包括同步和数据分割等操作。

表示层负责将不同应用程序的数据格式转换为网络传输标准格式，包括数据加密和压缩等操作。

应用层是最高层，负责处理特定的网络应用，包括HTTP、FTP和SMTP等常见的应用协议。

协议栈的每一层都有特定的功能，并通过接口与上一层和下一层进行通信。

上层的数据会通过接口传输到下层，经过协议栈的每一层处理，最终到达目的地。

协议栈的设计目的是为了实现网络通信的高效性、可靠性和安全性。

通过分层的设计，可以将复杂的网络通信问题分解成较小的模块，使得网络通信更加可控和容易维护。

协议栈的实现方式有很多种，常见的有TCP/IP协议栈和UDP/IP协议栈。

其中，TCP/IP协议栈是Internet上最重要的协议栈，它由TCP和IP两个主要协议组成，负责在网络中跨越多个节点的数据传输。

总而言之，协议栈是计算机网络的基础设施，是实现网络通信的关键技术。

它通过分层的设计和不同层之间的协作，实现了高效、可靠和安全的数据传输。

在当前互联网时代，协议栈发挥着重要作用，为我们提供了快速的网络通信和丰富的网络应用。

计算机网络TCPIP协议栈概述

计算机网络TCPIP协议栈概述计算机网络是现代信息交流的重要基础，而协议则是实现网络通信的核心组成部分。

其中，TCPIP协议栈是目前最为广泛应用的网络协议栈之一。

本文将对TCPIP协议栈进行概述，介绍其基本结构和功能。

一、TCPIP协议栈简介TCPIP（Transmission Control Protocol/Internet Protocol）即传输控制协议/互联网协议，是互联网的核心协议。

其由四层构成，分别是网络接口层、网络层、传输层和应用层。

每一层都具有不同的功能和特点，协同工作以实现数据的传输和通信。

1.网络接口层网络接口层是TCPIP协议栈的最底层，负责处理物理连接。

它将数据按照帧的形式传输，并提供数据链路层的封装和解封装功能。

同时，网络接口层还包括网络接口卡（NIC）驱动程序和网卡等硬件设备。

2.网络层网络层是TCPIP协议栈的核心层，负责实现数据在网络中的传输。

它主要包括IP（Internet Protocol）协议，用于在互联网上定位和传输数据包。

网络层还包括路由功能，通过选择最佳路径将数据包从发送者传递到接收者。

3.传输层传输层是实现端到端通信的关键层，它为上层应用提供可靠的数据传输服务。

最常用的传输层协议是TCP（Transmission Control Protocol）和UDP（User Datagram Protocol）。

TCP提供可靠的连接服务，保证数据的顺序和完整性；而UDP则提供无连接服务，适用于实时通信和对传输可靠性要求不高的场景。

4.应用层应用层是TCPIP协议栈的最高层，它提供各种应用程序的服务。

常见的应用层协议有HTTP（Hypertext Transfer Protocol）用于网页浏览、FTP（File Transfer Protocol）用于文件传输、SMTP（Simple Mail Transfer Protocol）用于电子邮件传输等。

应用层协议是用户与网络交互的界面，它们通过调用传输层提供的服务实现数据的传输和通信。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

mTCP:A Highly Scalable User-level TCP Stack for Multicore SystemsEunYoung Jeong,Shinae Woo,Muhammad Jamshed,Haewon JeongSunghwan Ihm*,Dongsu Han,and KyoungSoo ParkKAIST*Princeton UniversityAbstractScaling the performance of short TCP connections on multicore systems is fundamentally challenging.Although many proposals have attempted to address various short-comings,inefﬁciency of the kernel implementation still persists.For example,even state-of-the-art designs spend 70%to80%of CPU cycles in handling TCP connections in the kernel,leaving only small room for innovation in the user-level program.This work presents mTCP,a high-performance user-level TCP stack for multicore systems.mTCP addresses the inefﬁciencies from the ground up—from packet I/O and TCP connection management to the application inter-face.In addition to adopting well-known techniques,our design(1)translates multiple expensive system calls into a single shared memory reference,(2)allows efﬁcientﬂow-level event aggregation,and(3)performs batched packet I/O for high I/O efﬁciency.Our evaluations on an8-core machine showed that mTCP improves the performance of small message transactions by a factor of25compared to the latest Linux TCP stack and a factor of3compared to the best-performing research system known so far.It also improves the performance of various popular applications by33%to320%compared to those on the Linux stack. 1IntroductionShort TCP connections are becoming widespread.While large content transfers(e.g.,high-resolution videos)con-sume the most bandwidth,short“transactions”1dominate the number of TCPﬂows.In a large cellular network,for example,over90%of TCPﬂows are smaller than32KB and more than half are less than4KB[45].Scaling the processing speed of these short connec-tions is important not only for popular user-facing on-line services[1,2,18]that process small messages.It is 1We refer to a request-response pair as a transaction.These transac-tions are typically small in size.also critical for backend systems(e.g.,memcached clus-ters[36])and middleboxes(e.g.,SSL proxies[32]and redundancy elimination[31])that must process TCP con-nections at high speed.Despite recent advances in soft-ware packet processing[4,7,21,27,39],supporting high TCP transaction rates remains very challenging.For exam-ple,Linux TCP transaction rates peak at about0.3million transactions per second(shown in Section5),whereas packet I/O can scale up to tens of millions packets per second[4,27,39].Prior studies attribute the inefﬁciency to either the high system call overhead of the operating system[28,40,43] or inefﬁcient implementations that cause resource con-tention on multicore systems[37].The former approach drastically changes the I/O abstraction(e.g.,socket API) to amortize the cost of system calls.The practical lim-itation of such an approach,however,is that it requires signiﬁcant modiﬁcations within the kernel and forces ex-isting applications to be re-written.The latter one typically makes incremental changes in existing implementations and,thus,falls short in fully addressing the inefﬁciencies. In this paper,we explore an alternative approach that de-livers high performance without requiring drastic changes to the existing code base.In particular,we take a clean-slate approach to assess the performance of an untethered design that divorces the limitation of the kernel implemen-tation.To this end,we build a user-level TCP stack from the ground up by leveraging high-performance packet I/O libraries that allow applications to directly access the packets.Our user-level stack,mTCP,is designed for three explicit goals:1.Multicore scalability of the TCP stack.2.Ease of use(i.e.,application portability to mTCP).3.Ease of deployment(i.e.,no kernel modiﬁcations). Implementing TCP in the user level provides many opportunities.In particular,it can eliminate the expen-sive system call overhead by translating syscalls into inter-process communication(IPC).However,it also in-Accept queueConn.Locality Socket API Event Handling Packet I/OApplication Mod-iﬁcation KernelModiﬁcation PSIO [12],No TCP stack BatchedDPDK [4],No interface for NoPF RING [7],transport layer (NIC driver)netmap [21]Linux-2.6Shared None BSD socket Syscalls Per packet Transparent No Linux-3.9Per-core None BSD socket Syscalls Per packet Add optionNo SO REUSEPORTAfﬁnity-Accept [37]Per-core Yes BSD socket Syscalls Per packet Transparent Yes MegaPipe [28]Per-core Yes lwsocket Batched syscalls Per packet Event model to Yes completion I/O FlexSC [40],Shared None BSD socket Batched syscalls Per packet Change to use Yes VOS [43]new APImTCPPer-coreYesUser-level socketBatched function callsBatchedSocket API to mTCP APINo(NIC driver)Table 1:Comparison of the beneﬁts of previous work and mTCP.troduces fundamental challenges that must be addressed—processing IPC messages,including shared memory mes-sages,involve context-switches that are typically much more expensive than the system calls themselves [3,29].Our key approach is to amortize the context-switch overhead over a batch of packet-level and socket-level events.While packet-level batching [27]and system-call batching [28,40,43](including socket-level events)have been explored individually,integrating the two requires a careful design of the networking stack that translates packet-level events to socket-level events and vice-versa.This paper makes two key contributions:First,we demonstrate that signiﬁcant performance gain can be obtained by integrating packet-and socket-level batching.In addition,we incorporate all known optimizations,such as per-core listen sockets and load balancing of concurrent ﬂows on multicore CPUs with receive-side scaling (RSS).The resulting TCP stack out-performs Linux and MegaPipe [28]by up to 25x (w/o SO_REUSEPORT )and 3x,respectively,in handling TCP transactions.This directly translates to application per-formance;mTCP increases existing applications’perfor-mance by 33%(SSLShader)to 320%(lighttpd).Second,unlike other designs [23,30],we show that such integration can be done purely at the user level in a way that ensures ease of porting without requiring signiﬁcant modiﬁcations to the kernel.mTCP provides BSD-like socket and epoll-like event-driven interfaces.Migrating existing event-driven applications is easy since one simply needs to replace the socket calls to their counterparts in mTCP (e.g.,accept()becomes mtcp_accept())and use the per-core listen socket.2Background and MotivationWe ﬁrst review the major inefﬁciencies in existing TCP implementations and proposed solutions.We then discuss our motivation towards a user-level TCP stack.2.1Limitations of the Kernel’s TCP StackRecent studies proposed various solutions to address four major inefﬁciencies in the Linux TCP stack:lack of con-nection locality,shared ﬁle descriptor space,inefﬁcient packet processing,and heavy system call overhead [28].Lack of connection locality:Many applications are multi-threaded to scale their performance on multicore systems.However,they typically share a listen socket that accepts incoming connections on a well-known port.As a result,multiple threads contend for a lock to access the socket’s accept queue,resulting in a signiﬁcant perfor-mance degradation.Also,the core that executes the kernel code for handling a TCP connection may be different from the one that runs the application code that actually sends and receives data.Such lack of connection locality intro-duces additional overhead due to increased CPU cache misses and cache-line sharing [37].Afﬁnity-Accept [37]and MegaPipe [28]address this issue by providing a local accept queue in each CPU core and ensuring ﬂow-level core afﬁnity across the kernel and application thread.Recent Linux kernel (3.9.4)also partly addresses this by introducing the SO_REUSEPORT [14]op-tion,which allows multiple threads/processes to bind to the same port number.Shared ﬁle descriptor space:In POSIX-compliant op-erating systems,the ﬁle descriptor (fd)space is shared within a process.For example,Linux searches for the min-imum available fd number when allocating a new socket.In a busy server that handles a large number of concurrent connections,this incurs signiﬁcant overhead due to lock contention between multiple threads [20].The use of ﬁle descriptors for sockets,in turn,creates extra overhead of going through the Linux Virtual File System (VFS),a pseudo-ﬁlesystem layer for supporting common ﬁle op-erations.MegaPipe eliminates this layer for sockets by explicitly partitioning the fd space for sockets and regular ﬁles [28].libraries[4,7,27,39]address these problems,these li-braries do not provide a full-ﬂedged TCP stack,and not all optimizations are incorporated into the kernel. System call overhead:The BSD socket API requires frequent user/kernel mode switching when there are many short-lived concurrent connections.As shown in FlexSC[40]and VOS[43],frequent system calls can result in processor state(e.g.,top-level caches,branch prediction table,etc.)pollution that causes performance penalties.Previous solutions propose system call batch-ing[28,43]or efﬁcient system call scheduling[40]to amortize the cost.However,it is difﬁcult to readily apply either approach to existing applications since they often require user and/or kernel code modiﬁcation due to the changes to the system call interface and/or its semantics. Table1summarizes the beneﬁts provided by previous work compared to a vanilla Linux kernel.Note that there is not a single system that provides all of the beneﬁts. 2.2Why User-level TCP?While many previous designs have tried to scale the per-formance of TCP in multicore systems,few of them truly overcame the aforementioned inefﬁciencies of the kernel. This is evidenced by the fact that even the best-performing system,MegaPipe,spends a dominant portion of CPU cycles(∼80%)inside the kernel.Even more alarming is the fact that these CPU cycles are not utilized efﬁciently; according to our own measurements,Linux spends more than4x the cycles(in the kernel and the TCP stack com-bined)than mTCP does while handling the same number of TCP transactions.To reveal the signiﬁcance of this problem,we proﬁle the server’s CPU usage when it is handling a large number of concurrent TCP transactions(8K to48K concurrent TCP connections).For this experiment,we use a simple web server(lighttpd v1.4.32[8])running on an8-core Intelcycle in the kernel(including TCP/IP and I/O)across four lighttpd versions.Xeon CPU(2.90GHz,E5-2690)with32GB of memory and a10Gbps NIC(Intel82599chipsets).Our clients use ab v2.3[15]to repeatedly download a64Bﬁle per connection.Multiple clients are used in our experiment to saturate the CPU utilization of the server.Figure1shows the breakdown of CPU usage comparing four versions of the lighttpd server:a multithreaded version that harnesses all8CPU cores on Linux2.6.32and3.10.122(Linux),a version ported to MegaPipe3(MegaPipe),and a version using mTCP,our user-level TCP stack,on Linux2.6.32 (mTCP).Note that MegaPipe adopts all recent optimiza-tions such as per-core accept queues andﬁle descriptor space,as well as user-level system call batching,but reuses the existing kernel for packet I/O and TCP/IP processing. Our results indicate that Linux and MegaPipe spend 80%to83%of CPU cycles in the kernel which leaves only a small portion of the CPU to user-level applications. Upon further investigation,weﬁnd that lock contention for shared in-kernel data structures,buffer management, and frequent mode switch are the main culprits.This implies that the kernel,including its stack,is the major bottleneck.Furthermore,the results in Figure2show that the CPU cycles are not spent efﬁciently in Linux and MegaPipe.The bars indicate the relative number of transactions processed per each CPU cycle inside the kernel and the TCP stack(e.g.,outside the application), normalized by the performance of Linux2.6.32.Weﬁnd that mTCP uses the CPU cycles4.3times more effectively than Linux.As a result,mTCP achieves3.1x and1.8x the performance of Linux2.6and MegaPipe,respectively, while using fewer CPU cycles in the kernel and the TCP stack.Now,the motivation of our work is clear.Can we de-sign a user-level TCP stack that incorporates all existing optimizations into a single system and achieve all beneﬁts that individual systems have provided in the past?How much of a performance improvement can we get if we build such a system?Can we bring the performance of existing packet I/O libraries to the TCP stack?2This is the latest Linux kernel version as of this writing.3We use Linux3.1.3for MegaPipe due to its patch availability.the user er-level TCP is attractive for many rea-sons.First,it allows us to easily depart from the kernel’s complexity.In particular,due to shared data structures and various semantics that the kernel has to support(e.g., POSIX and VFS),it is often difﬁcult to separate the TCP stack from the rest of the kernel.Furthermore,it allows us to directly take advantage of the existing optimiza-tions in the high-performance packet I/O library,such as netmap[39]and Intel DPDK[4].Second,it allows us to apply batch processing as theﬁrst principle,harnessing the ideas in FlexSC[40]and VOS[43]without extensive kernel modiﬁcations.In addition to performing batched packet I/O,the user-level TCP naturally collects multiple ﬂow-level events to and from the user application(e.g., connect()/accept()and read()/write()for differ-ent connections)without the overhead of frequent mode switching in system calls.Finally,it allows us to easily preserve the existing application programming interface. Our TCP stack is backward-compatible in that we provide a BSD-like socket interface.3DesignThe goal of mTCP is to achieve high scalability on mul-ticore systems while maintaining backward compatibil-ity to existing multi-threaded,event-driven applications. Figure3presents an overview of our system.At the high-est level,applications link to the mTCP library,which provides a socket API and an event-driven programming interface for backward compatibility.The two underlying components,user-level TCP stack and packet I/O library, are responsible for achieving high scalability.Our user-level TCP implementation runs as a thread on each CPU core within the same application process.The mTCP thread directly transmits and receives packets to and from the NIC using our custom packet I/O library.Existing user-level packet libraries only allow one application to access an NIC port.Thus,mTCP can only support one application per NIC port.However,we believe this can be addressed in the future using virtualized network inter-faces(more details in Section3.3).Applications can stillthe existing TCP stack,provided thatare not used by mTCP.ﬁrst present the design of mTCP’scomponents in Sections3.1the API and semantics thatapplications in Section3.3.Packet I/O Libraryallow high-speed packet I/Ofrom a user-level application[4,7,are not suitable for implementing atheir interface is mainly based onsigniﬁcantly waste precious CPU cy-cles that can potentially beneﬁt the applications.Further-more,our system requires efﬁcient multiplexing between TX and RX queues from multiple NICs.For example,we do not want to block a TX queue while sending a data packet when a control packet is waiting to be received. This is because if we block the TX queue,important con-trol packets,such as SYN or ACK,may be dropped,re-sulting in a signiﬁcant performance degradation due to retransmissions.To address these challenges,mTCP extends the Pack-etShader I/O engine(PSIO)[27]to support an efﬁcient event-driven packet I/O interface.PSIO offers high-speed packet I/O by utilizing RSS that distributes incoming pack-ets from multiple RX queues by theirﬂows,and provides ﬂow-level core afﬁnity to minimize the contention among the CPU cores.On top of PSIO’s high-speed packet I/O, the new event-driven interface allows an mTCP thread to efﬁciently wait for events from RX and TX queues from multiple NIC ports at a time.The new event-driven interface,ps_select(),works similarly to select()except that it operates on TX/RX queues of interested NIC ports for packet I/O.For exam-ple,mTCP speciﬁes the interested NIC interfaces for RX and/or TX events with a timeout in microseconds,and ps_select()returns immediately if any event of interest is available.If such an event is not detected,it enables the interrupts for the RX and/or TX queues and yields the thread context.Eventually,the interrupt handler in the driver wakes up the thread if an I/O event becomes available or the timeout expires.ps_select()is also similar to the select()/poll()interface supported by netmap[39].However,unlike netmap,we do not integrate this with the general-purpose event system in Linux to avoid its overhead.The use of PSIO brings the opportunity to amortize the overhead of system calls and context switches throughout the entire system,in addition to eliminating the per-packet memory allocation and DMA overhead.In PSIO,packets are received and transmitted in batches[27],amortizing the cost of expensive PCIe operations,such as DMA ad-dress mapping and IOMMU lookups.thread.This“zero-thread TCP”could potentially provide the best performance since this translates costly system calls into light-weight user-level function calls.However, the fundamental limitation of this approach is that the correctness of internal TCP processing depends on the timely invocation of TCP functions from the application. In mTCP,we choose to create a separate TCP thread to avoid such an issue and to minimize the porting effort for existing applications.Figure4shows how mTCP interacts with the application thread.The application uses mTCP library functions that communicate with the mTCP thread via shared buffers.The access to the shared buffers is granted only through the library functions,which allows safe sharing of the internal TCP data.When a library function needs to modify the shared data,it simply places a request(e.g.,write()request)to a job queue.This way, multiple requests from differentﬂows can be piled to the job queue at each loop,which are processed in batch when the mTCP thread regains the CPU.Flow events from the mTCP thread(e.g.,new the CPU core.Flow events from the mTCP thread(e.g.,new connections,new data arrival, etc.)are delivered in a similar wayThis,however,requires additional overhead of manag-ing concurrent data structures and context switch between the application and the mTCP thread.Such cost is un-fortunately not negligible,typically much larger than the system call overhead[29].One measurement on a recent Intel CPU shows that a thread context switch takes19 times the duration of a null system call[3].In this section,we describe how mTCP addresses these challenges and achieves high scalability with the user-level TCP stack.Weﬁrst start from how mTCP processes TCP packets in Section3.2.1,then present a set of key optimizations we employ to enhance its performance in Sections3.2.2,3.2.3,and3.2.4.3.2.1Basic TCP ProcessingWhen the mTCP thread reads a batch of packets from the NIC’s RX queue,mTCP passes them to the TCP packet ﬂow hash table.As in Figure5,if a server side receives an ACK for its SYN/ACK packet(1),the tcb for the new connection will be enqueued to an accept queue(2),and a read event is generated for the listening socket(3).If a new data packet arrives,mTCP copies the payload to the socket’s read buffer and enqueues a read event to an internal event queue.mTCP also generates an ACK packet and keeps it in the ACK list of a TX manager until it is written to a local TX queue.After processing a batch of received packets,mTCP ﬂushes the queued events to the application event queue (4)and wakes up the application by signaling it.When the application wakes up,it processes multiple events in a single event loop(5),and writes responses from multiple ﬂows without a context switch.Each socket’s write() call writes data to its send buffer(6),and enqueues its tcb to the write queue(7).Later,mTCP collects the tcb s that have data to send,and puts them into a send list(8). Finally,a batch of outgoing packets from the list will be sent by a packet I/O system call,transmitting them to the NIC’s TX queue.3.2.2Lock-free,Per-core Data StructuresTo minimize inter-core contention between the mTCP threads,we localize all resources(e.g.,ﬂow pool,socket buffers,etc.)in each core,in addition to using RSS for ﬂow-level core afﬁnity.Moreover,we completely elimi-nate locks by using lock-free data structures between the application and mTCP.On top of that,we also devise an efﬁcient way of managing TCP timer operations. Thread mapping andﬂow-level core afﬁnity:We pre-serveﬂow-level core afﬁnity in two stages.First,the packet I/O layer ensures to evenly distribute TCP con-nection workloads across available CPU cores with RSS. This essentially reduces the TCP scalability problem to each core.Second,mTCP spawns one TCP thread for each application thread and co-locates them in the same physical CPU core.This preserves the core afﬁnity ofpacket andﬂow processing,while allowing them tothe same CPU cache without cache-line sharing. Multi-core and cache-friendly data structures: keep most data structures,such as theﬂow hash socket id manager,and the pool of tcb and socket local to each TCP thread.This signiﬁcantly reduces sharing across threads and CPU cores,and achieves parallelism.When a data structure must be shared threads(e.g.,between mTCP and the applicationwe keep all data structures local to each core and lock-free data structures by using a single-producer single-consumer queue.We maintain write,connect, close queues,whose requests go from the mTCP,and an accept queue where new connections delivered from mTCP to the application.In addition,we keep the size of frequentlydata structures small to maximize the beneﬁt of the CPU cache,and make them aligned with the size of a CPU cache line to prevent any false sharing.For example,we divide tcb into two parts where theﬁrst-level structure holds64bytes of the most frequently-accessedﬁelds and two pointers to next-level structures that have128and192 bytes of receive/send-related variables,respectively. Lastly,to minimize the overhead of frequent memory allocation/deallocation,we allocate a per-core memory pool for tcb s and socket buffers.We also utilize huge pages to reduce the TLB misses when accessing the tcb s. Because their access pattern is essentially random,it often causes a large number of TLB misses.Putting the memory pool of tcb s and a hash table that indexes them into huge pages reduces the number of TLB misses.Efﬁcient TCP timer management:TCP requires timer operations for retransmission timeouts,connections in the TIME WAIT state,and connection keep-alive checks. mTCP provides two types of timers:one managed by a sorted list and another built with a hash table.For coarse-grained timers,such as managing connections in the TIME W AIT state and connection keep-alive check, we keep a list of tcb s sorted by their timeout values.Ev-ery second,we check the list and handle any tcb s whose timers have expired.Note that keeping the list sorted is trivial since a newly-added entry should have a strictly larger timeout than any of those that are already in the list.Forﬁne-grained retransmission timers,we use the remaining time(in milliseconds)as the hash table index, and process all tcb s in the same bucket when a timeout ex-pires for the bucket.Since retransmission timers are used by virtually all tcb s whenever a data(or SYN/FIN)packet is sent,keeping a sorted list would consume a signiﬁcant amount of CPU cycles.Suchﬁne-grained event batch processing with millisecond granularity greatly reduces the overhead.ets in batch,mTCP processes them to generate a batch ofﬂow-level events.These events are then passed up to the application,as illustrated in Figure6.The TX direc-tion works similarly,as the mTCP library transparently batches the write events into a write queue.While the idea of amortizing the system call overhead using batches is not new[28,43],we demonstrate that beneﬁts similar to that of batched syscalls can be effectively achieved in user-level TCP.In our experiments with8RX/TX queues per10Gbps port,the average number of events that an mTCP thread generates in a single scheduling period is about2,170 for both TX and RX directions(see Section5.1).This ensures that the cost of a context switch is amortized over a large number of events.Note the fact that the use of multiple queues does not decrease the number of the events processed in a batch.3.2.4Optimizing for Short-lived ConnectionsWe employ two optimizations for supporting many short-lived concurrent connections.Priority-based packet queueing:For short TCP con-nections,the control packets(e.g.,SYN and FIN)have a critical impact on the performance.Since the control pack-ets are mostly small-sized,they can often be delayed for a while when they contend for an output port with a large number of data packets.We prioritize control packets by keeping them in a separate list.We maintain three kinds of lists for TX as shown in Figure5.First,a control list contains the packets that are directly related to the state of a connection such as SYN,SYN/ACK,and ACK,or FIN and FIN/ACK.We then manage ACKs for incoming data packets in an ACK list.Finally,we keep a data list to send data in the socket buffers of TCPﬂows.When we put actual packets in a TX queue,weﬁrstﬁll the packets from a control list and an ACK list,and later queue the data packets.By doing this,we prioritize important packetsto prevent short connections from being delayed by other long connections.4Lightweight connection setup:In addition,weﬁnd that a large portion of connection setup cost is from allo-cating memory space for TCP control blocks and socket buffers.When many threads concurrently call malloc() or free(),the memory manager in the kernel can be eas-ily contended.To avoid this problem,we pre-allocate large memory pools and manage them at user level to sat-isfy memory(de)allocation requests locally in the same thread.3.3Application Programming Interface One of our primary design goals is to minimize the port-ing effort of existing applications so that they can easily beneﬁt from our user-level TCP stack.Therefore,our programming interface must preserve the most commonly used semantics and application interfaces as much as pos-sible.To this end,mTCP provides a socket API and an event-driven programming interface.User-level socket API:We provide a BSD-like socket interface;for each BSD socket function,we have a corresponding function call(e.g.,accept()becomes mtcp_accept()).In addition,we provide functionali-ties that are frequently used with sockets,e.g.,fcntl and ioctl,for setting the socket as nonblocking or getting/set-ting the socket buffer size.To support various applications that require inter-process communication using pipe(), we also provide mtcp_pipe().The socket descriptor space in mTCP(including the fds of pipe()and epoll())is local to each mTCP thread; each mTCP socket is associated with a thread context. This allows parallel socket creation from multiple threads by removing lock contention on the socket descriptor space.We also relax the semantics of socket()such that it returns any available socket descriptor instead of the minimum available fd.This reduces the overhead of ﬁnding the minimum available fd.User-level event system:We provide an epoll()-like event system.While our event system aggre-gates the events from multipleﬂows for batching ef-fects,we do not require any modiﬁcation in the event handling logic.Applications can fetch the events through mtcp_epoll_wait()and register events through mtcp_epoll_ctl(),which correspond to epoll_wait() and epoll_ctl()in Linux.Our current mtcp_epoll() implementation supports events from mTCP sockets(in-cluding listening sockets)and pipes.We plan to integrate other types of events(e.g.,timers)in the future.4This optimization can potentially make the system more vulnerable to attacks,such as SYNﬂooding.However,existing solutions,such as SYN cookies,can be used to mitigate the problem.Applications:mTCP integrates all techniques known at the time of this writing without requiring substantial kernel modiﬁcation while preserving the application inter-face.Thus,it allows applications to easily scale their performance without modifying their logic.We have ported many applications,including lighttpd,ab,and SSLShader to use mTCP.For most applications we ported, the number of lines changed were less than100(more de-tails in Section4).We also demonstrate in Section5that a variety of applications can directly enjoy the performance beneﬁt by using mTCP.However,this comes with a few trade-offs that appli-cations must consider.First,the use of shared memory space offers limited protection between the TCP stack and the application.While the application cannot directly ac-cess the shared buffers,bugs in the application can corrupt the TCP stack,which may result in an incorrect behavior. Although this may make debugging more difﬁcult,we believe this form of fate-sharing is acceptable since users face a similar issue in using other shared libraries such as dynamic memory allocation/deallocation.Second,appli-cations that rely on the existing socket fd semantics must change their logic.However,most applications rarely de-pend on the minimum available fd at socket(),and even if so,porting them will not require signiﬁcant code change. Third,moving the TCP stack will also bypass all existing kernel services,such as theﬁrewall and packet schedul-ing.However,these services can also be moved into the user-level and provided as application modules.Finally, our prototype currently only supports a single application due to the limitation of the user-level packet I/O system. We believe,however,that this is not a fundamental limita-tion of our approach;hardware-based isolation techniques such as VMDq[5]and SR-IOV[13]support multiple virtual guest stacks inside the same host using multiple RX/TX queues and hardware-based packet classiﬁcation. We believe such techniques can be leveraged to support multiple applications that share a NIC port.4ImplementationWe implement11,473lines of C code(LoC),including packet I/O,TCPﬂow management,user-level socket API and event system,and552lines of code to patch the PSIO library.5For threading and thread synchronization,we use pthread,the standard POSIX thread library[11].Our TCP implementation follows RFC793[17].It sup-ports basic TCP features such as connection management, reliable data transfer,ﬂow control,and congestion control. For reliable transfer,it implements cumulative acknowl-edgment,retransmission timeout,and fast retransmission. mTCP also implements popular options such as timestamp, Maximum Segment Size(MSS),and window scaling.For 5The number is counted by SLOCCount2.26.。