The complexity of decentralized control of markov decision processes
Empirical processes of dependent random variables

2
Preliminaries
n i=1
from R to R. The centered G -indexed empirical process is given by (P n − P )g = 1 n
n
the marginal and empirical distribution functions. Let G be a class of measurabrocesses that have been discussed include linear processes and Gaussian processes; see Dehling and Taqqu (1989) and Cs¨ org˝ o and Mielniczuk (1996) for long and short-range dependent subordinated Gaussian processes and Ho and Hsing (1996) and Wu (2003a) for long-range dependent linear processes. A collection of recent results is presented in Dehling, Mikosch and Sorensen (2002). In that collection Dedecker and Louhichi (2002) made an important generalization of Ossiander’s (1987) result. Here we investigate the empirical central limit problem for dependent random variables from another angle that avoids strong mixing conditions. In particular, we apply a martingale method and establish a weak convergence theory for stationary, causal processes. Our results are comparable with the theory for independent random variables in that the imposed moment conditions are optimal or almost optimal. We show that, if the process is short-range dependent in a certain sense, then the limiting behavior is similar to that of iid random variables in that the limiting distribution is a Gaussian process and the norming √ sequence is n. For long-range dependent linear processes, one needs to apply asymptotic √ expansions to obtain n-norming limit theorems (Section 6.2.2). The paper is structured as follows. In Section 2 we introduce some mathematical preliminaries necessary for the weak convergence theory and illustrate the essence of our approach. Two types of empirical central limit theorems are established. Empirical processes indexed by indicators of left half lines, absolutely continuous functions, and piecewise differentiable functions are discussed in Sections 3, 4 and 5 respectively. Applications to linear processes and iterated random functions are made in Section 6. Section 7 presents some integral and maximal inequalities that may be of independent interest. Some proofs are given in Sections 8 and 9.
面向序贯决策中异常情景下交互问题处理方法

第26卷第12期2020年12月计算机集成制造系统Computer Integrated Manufacturing SystemsVol.26No.12Dec.2020DOI:10.13196/j.cims.2020.12.010面向序贯决策中异常情景下交互问题处理方法安敬民",李冠宇2+,张冬青1,蒋伟2(1.大连东软信息学院计算机与软件学院,辽宁大连116023;2.大连海事大学网络信息中心,辽宁大连116026)摘要:针对目前在环境智能方面的序贯决策研究成果主要集中于不确定环境下的多智能体(Agent)交互决策问题,而未涉及到Agent在异常情景下对于该问题的解决思路,提出一种异常情景中Agent交互决策机制。
首先基于改进的情景本体对情景中Agent所观察的实体进行“时一空”状态的获取和计算;其次,结合元认知环结构的语义推理算法对异常情景进行检测和评估,并反馈于Agent,最终做出符合当前情景下用户需求的动作或反应。
经过在智能家居环境中的实验验证,在原有几种具有代表性的机器学习处理方法基础上,所提方法在其决策精确性上平均提高10%以上,响应时间则增加5%左右,且实现了在应用领域上的拓展,增强了实用性。
关键词:智能体;序贯决策;环境智能;异常情景;情景本体;“时一空”状态;元认知环中图分类号:TP18文献标识码:ASequential decision-making-oriented interaction problem processing method for perturbation contextAN Jingmin1'2,LI Guanyu2+,ZHANG Dongqing1,JIANG Wei2(1.Faculty of Computer and Software,Dalian Neusoft University of Information,Dalian116023,China;work Information Center,Dalian Maritime University,Dalian116026,China)Abstract:The current researches of sequential decision-making on ambient intelligence mainly are focused on the problem of Agents interaction decision-making over uncertain context,and solution for perturbation context is not involved.For this problem,the Agent interactive decision-making mechanism was proposed.The entity-spatio-temporal contexts based on the modified context ontology was acquired and calculated)and then the semantic-based metacognitive loop was used to detect and evaluate perturbation context so as to feedback to user-serving Agent.Ultimately,experiments in a smart home environment showed that the proposed method improved the accuracy o£decision-making by more than10%on the basis of several representative machine learning processing methods,while the response time increased by less than5%,which achieved the expansion in the application field and enhanced the practicabilit y.Keywords:Agent;sequential decision-making;ambient intelligence;perturbation context;spatio-temporal context;metacognitive loop0引言的研究成为了人工智能领域的热点和重点问题m,但因MAS难以解决的维数灾难的问题,使其发展近年来,随着智能决策和服务推荐的兴起,对多遭遇到了瓶颈。
基于决策图的复杂系统模型对称约减方法

( 1 .东北林 业 大 学 信 息与计 算机 工程 学院 ,黑龙 江 哈 尔滨 1 5 0 0 4 0 ; 2 .哈 尔滨工程 大学 计算机 科 学与技 术 学 院 ,黑龙 江 哈 尔滨 1 5 0 0 0 1 )
摘 要 :针 对复杂随机 系统模型检测过程 中的状 态空间爆 炸 问题 ,提 出一种用 于支持 迁移 回报特 征描述 的概 率模型对称 约
2 0 1 3年 1 0月
计算机3 - - 程 与 设 计
C OM PUTER ENGI NEERI NG AND DES I GN
0c t . 2 0 1 3 Vo 1 . 3 4 No . 1 0
第 3 4 卷
第 1 O期
ቤተ መጻሕፍቲ ባይዱ
基 于决 策 图 的 复 杂 系统 模 型对 称 约减 方法
J I Mi n g - y u ,W ANG Ha i — t a o ,CHEN Z h i — y u a n 2 ,LI Ya n - me i 。
( 1 .Co l l e g e o f I n f o r ma t i o n a n d o mp C u t e r En g i n e e r i n g, No r t h e a s t F o r e s t r y Un i v e r s i t y ,Ha r b i n 1 5 0 0 4 0,Ch i n a 2 .C o l l e g e o f o mp c u t e r S c i e n c e a n d Te c h n o l o g y ,Ha r b i n En g i n e e r i n g Un i v e r s i t y ,H a r b i n 1 5 0 0 0 1 ,C h i n a ) Ab s t r a c t : To s o l v e t h e s t a t e s p a c e e x p l o s i o n p r o b l e m o f mo d e l c h e c k i n g p r o c e s s f o r c o mp l i c a t e d s t o c h a s t i c s y s t e ms ,a s y mme t r y r e d u c t i o n me t h o d b a s e d p r o b a b i l i s t i c mo d e l wh i c h s u p p o r t s t h e t r a n s i t i o n r e wa r d c h a r a c t e r i s t i c s i s p r o p o s e d .Th e s i z e o f t h e s t a t e s e t i n t h e o r i g i n a l mo d e l i s d e c r e a s e d b y u s i n g t h e e x c l u s i v e r e p r e s e n t a t i o n f u n c t i o n o f t h e s t a t e s e t e q u i v a l e n c e r e l a t i o n s a n d t h e t r a d i t i o n a l mu l t i - t e r mi n a l b i n a r y d e c i s i o n d i a g r a m i s i mp r o v e d b y a d d i n g t h e r e t u r n c h a r a c t e r i s t i c s wh i c h i s u s e d t o r e p r e s e n t t h e t r a n s i t i o n r e l a t i o n s o f p r o b a b i l i s t i c r e wa r d mo d e l ,a n e f f i c i e n t s y mme t r i c r e d u c t i o n a l g o r i t h m i s p r o p o s e d f o r r e d u c e t r a n s i t i o n r e — l a t i o n s b a s e d o n t r a n s i t i o n ma t r i x . Th e e x a mp l e r e s u l t s h o ws t h e f e a s i b i l i t y a n d v a l i d i t y o f t h e me t h o d . Ke y wo r d s :f o m a r l v e r i f i c a t i o n;mo d e l c h e c k i n g;s t a t e s p a c e e x p l o s i o n;d e c i s i o n d i a g r a m ;s y mme t r y r e d u c t i o n;q u o t i e n t mo d e l
部分可观测马尔可夫决策过程(pa...

部分可观测马尔可夫决策过程(pa...
部分可观测马尔可夫决策过程(partially observable Markov decision processes ,POMDP) 模型是马尔可夫决策过程(MDP)模型的扩展。
MDP 模型根据系统当前实际状态做出决策,但是很多情况下,系统的精确状态难以获取。
例如,对复杂的机械系统,测量系统状态的传感器信号常受到噪声污染,难以获得系统的精确状态。
POMDP 假设系统的状态信息不能直接观测得到,是部分可知的,因⽽对只有不完全状态信息的系统建模,依据当前的不完全状态信息做出决策。
POMDP 的应⽤领域⾮常⼴泛,包括⼯业(机械维修、结构检查、电梯控制及渔业等) 、科学(机器⼈控制、⽣态⾏为及机器视觉等) 、商业(⽹络故障发现和修理、分布式数据库查询、⾏销、问卷调查表设计及团体政策等) 、军事(移动⽬标搜索、搜索营救、⽬标辨识及武器分配等) 和社会(教育及医疗诊断等) 等[ 1 ] 。
⽬前对POMDP 算法的研究包括精确算法和近似算法。
精确算法理论上可以获得最优解,但由于计算复杂性
随着问题的规模呈指数增长,⼀般只适⽤于求解⼀些⼩规模的问题。
因此出现了许多求解POMDP 的近似算法,近
似算法⼤都以精确算法为基础,精确算是研究和构造近似算法的基础[ 2 ] 。
本⽂在对POMDP 的模型以及性质介绍的基础上,对当前的POMDP 主要精确算法进⾏了分析,并简要介绍了常⽤的近似算法。
摘⾃桂林,武⼩悦,部分可观测马尔可夫决策过程算法综述,系统⼯程与电⼦技术,2008 年 6⽉。
面向复杂UML的Markov建模方法研究

ComputerEngineeringandApplications计算机工程与应用2018,54(4)1引言随着计算机技术的快速发展,软件系统已渗透到社会各个领域中,社会对软件系统的需求不断增加,软件系统规模、复杂程度逐步增长,软件危机日益突出,亟待解决[1]。尤其是在航空、航天、医疗和金融等领域中存在的庞大复杂的软件-硬件组成的混合系统一旦失效,将危害到人类的生命健康和财产安全,如1996年的欧洲航天计划软件失效导致火箭发射失败,造成巨大损失[2]。软件开发模式日益多样化,传统意义上的基于测试数据的模型已经无法适应,基于构件的可靠性测试建模逐渐成为可靠性研究的主要内容。基于构件的可靠性建模有基于构件迁移概论图、基于路径和基于状态三种[3-4]。基于构件迁移概论图的建模是根据构件之间的迁移方式和路径来计算构件系统的可靠性;基于路径的建模是将软件结构表现为执行路径,通过计算路径的可靠性来度量整个系统的可靠性;基于状态的建模是把系统构件之间的交互和转移映射成为一个Markov过程,将每一步交互发生后的构件映射为状态,下一状态只与当前状态和信息的交互有关,与历史状态和消息无关,这种性质叫做马氏性质[5]。本文的方法就是在UML的基础上,将软件系统构件的交互与转移映射为软件的Markov链使用模型的过程。本文的主要贡献:(1)针对场景执行消息粒度过粗、执行情况理想为顺序化的缺陷,基于多层嵌套组合片段面向复杂UML的Markov建模方法研究靖天才,方景龙,魏丹JINGTiancai,FANGJinglong,WEIDan杭州电子科技大学计算机学院复杂系统建模与仿真教育部重点实验室,杭州310018EducationMinistryKeyLaboratoryofComplexSystemsModelingandSimulation,SchoolofComputerScience,HangzhouDianziUniversity,Hangzhou310018,ChinaJINGTiancai,FANGJinglong,WEIDan.ResearchforMarkovchainmodelingbasedonUML.ComputerEngineeringandApplications,2018,54(4):60-65.Abstract:Softwarereliabilitytestingtechnologyisimportanttoensurethequalityofsoftware,especiallyforcomplexsoft-ware,suchassoftwareofaerospaceandfinancialinstitutions.Althoughsomestudieshaveinvestigatedthemodelchainusagemodeltoassessthesoftwarereliability,thesizeofscenariomessagesintheUMLmodelofcomplexsoftwareistoocoarsetofullydescribesoftware.Tosolvethisproblem,anovelmethodisproposedtobuildaMarkovchainusagemodelfromaUMLmodelwithnestedcombinationfragments.Thenanexampleisgiventoillustratetheapplicationandfeasibilityofthismodelingmethod,whichconfirmstheeffectivenessoftheproposedmethodandprovidesaguidelineofbuildingaMarkovchainusagemodel.Keywords:reliabilitytesting;UMLmodel;nestedcombinationfragment;Markovchain摘要:软件可靠性测试技术是保证软件质量的重要研究内容,尤其是对航空航天、金融机构等高信度复杂软件尤为重要。在现有研究的基础上,针对复杂软件UML模型场景消息粒度过大导致构建的Markov链使用模型描述软件的真实度不够的问题,提出了一种基于多层嵌套组合片段UML模型的Markov链使用模型的构建方法,最后结合实例对研究提出的模型构建方法做出应用分析,说明了算法的可行性,为更有效地构建Markov链使用模型提供了指导。关键词:可靠性测试;UML模型;嵌套组合片段;Markov链文献标志码:A中图分类号:TP311.5doi:10.3778/j.issn.1002-8331.1609-0429作者简介:靖天才(1992—),男,硕士研究生,主要研究方向为软件可靠性,E-mail:j-t-c@foxmail.com;方景龙(1964—),男,博士,主要研究方向为数据挖掘、机器学习、软件质量评估;魏丹(1979—),女,博士,讲师,主要研究方向为数据挖掘、机器学习、软件质量评估。收稿日期:2016-09-29修回日期:2017-01-03文章编号:1002-8331(2018)04-0060-06CNKI网络优先出版:2017-03-16,http://kns.cnki.net/kcms/detail/11.2127.TP.20170316.1515.014.html602018,54(4)
无人机航迹规划中的POMDP问题求解

无人机航迹规划中的POMDP问题求解无人机飞行已经成为了现代化社会中各个领域的必要手段。
无人机航迹规划对于保证无人机的安全飞行至关重要。
然而,在实际应用中,无人机航迹规划面临着复杂的环境和情况,传统方法可能难以满足实际需求。
因此,POMDP方法可以作为一种有效的解决方案。
什么是POMDP方法?POMDP是部分可观测的马尔可夫决策过程的英文缩写,指的是当问题存在不确定性的时候,我们可以使用马尔科夫决策过程来建模,以这种方式来利用概率和规划算法来制定最优策略。
在实际应用中,可以通过实时监测和更新环境以及传感器设备来解决不确定性。
POMDP方法在无人机航迹规划中的应用POMDP方法可以应用在无人机的航迹规划中,其主要依靠传感器信息来监控和更新无人机所处的环境信息,以此推断出最优航迹方案。
POMDP方法允许无人机选择最优决策。
为了实现这一目标,无人机需要计算折扣回报值。
然而,在实际应用中,无人机模型被设计为知道其下一步行动的结果,但是它却无法知道未来的情况,这些未知的情况可能会对其下一步行动的结果产生深刻的影响。
因此,当选择无人机航迹时,必须考虑当前环境下的不确定性因素,以避免出现不利的情况。
POMDP求解方法求解POMDP问题最常用的方法是基于值迭代的方案。
基于值迭代的问题解决方案是通过计算每个状态下下一次行动的期望值,然后继续对回报进行迭代,直到找到最优策略。
这种方案虽然计算量大,但可靠,因此常用于实际情况中。
此外,对于某些需要快速求解的问题,可以使用基于策略的求解方法。
基于策略的方法与基于值迭代的方法相比,需要更少的计算时间,因此可以用来求解实时问题。
POMDP方法的应用前景无人机的快速发展推动了人们对其安全问题的认识。
POMDP方法在保证无人机安全飞行方面发挥了重要作用。
未来,无人机技术将会得到更广泛的应用,从而需要更加先进的航迹规划方法来指导其飞行。
POMDP方法在无人机航迹规划中的应用前景广阔。
mdp的格式 -回复
mdp的格式-回复为了更好地解释和探讨中括号内的主题,请允许我以文章的形式进行回答。
[MDP的格式]:驱动强化学习中的马尔可夫决策过程在强化学习领域中,马尔可夫决策过程(Markov Decision Process,简称MDP)是一个重要的数学框架,用于描述智能体在序列决策问题中的动态和策略选择。
本文将详细介绍MDP的格式,包括状态空间、行动空间、奖励函数以及状态转移概率。
首先,MDP的核心组成部分之一是状态空间。
状态空间是指智能体可能处于的所有具体状态的集合。
每个状态都是问题环境中的一个观测结果,可以是离散的或连续的。
例如,在一个飞机自动驾驶的问题中,状态空间可以包括飞机的位置、速度、姿态等状态变量。
第二个组成部分是行动空间,它是智能体可以采取的所有可能行动的集合。
行动可以是离散的或连续的,取决于具体的问题。
在飞机自动驾驶问题中,行动空间可以包括飞机的飞行方向、推力大小等。
除了状态空间和行动空间,MDP还需要一个奖励函数。
奖励函数是一个将状态和行动映射到实数奖励的函数。
它反映了智能体在某个状态下采取某个行动的好坏程度。
奖励函数的设计对强化学习算法的性能有很大影响,可以通过合理设置来引导智能体学习到期望的行为。
最后,状态转移概率是MDP的另一个重要部分。
它描述了智能体在某个状态下采取某个行动后,转移到下一个状态的概率分布。
在确定性环境中,状态转移是确定性的,即智能体在某个状态下采取某个行动后只能转移到一个确定的下一个状态。
而在随机环境中,状态转移是随机的,即智能体在某个状态下采取某个行动后可能以不同的概率转移到不同的下一个状态。
在MDP的框架下,强化学习算法的主要目标是通过学习和优化策略来最大化累积奖励。
学习算法可以通过迭代更新价值函数或策略函数的方式来实现。
价值函数衡量了在某个状态下采取某个策略的长期累积奖励期望值,而策略函数定义了在每个状态下采取的行动概率。
通过不断迭代更新这些函数,智能体可以逐渐学习到最优的策略,实现最大化累积奖励的目标。
2012-57-9-TAC-采样一致+二阶积分器+非一致时变时延
A Sufficient Condition for Convergence of Sampled-DataConsensus for Double-Integrator Dynamics With Nonuniform and Time-Varying Communication Delays Jiahu Qin,Student Member,IEEE,andHuijun Gao,Senior Member,IEEEAbstract—This technical note investigates a discrete-time second-order consensus algorithm for networks of agents with nonuniform and time-varying communication delays under dynamically changing communica-tion topologies in a sampled-data setting.Some new proof techniques are proposed to perform the convergence analysis.It isfinally shown that under certain assumptions upon the velocity damping gain and the sampling pe-riod,consensus is achieved for arbitrary bounded time-varying commu-nication delays if the union of the associated digraphs of the interaction matrices in the presence of delays has a directed spanning tree frequently enough.Index Terms—Double-integrator agents,sampled-data consensus,span-ning tree,time-varying communication delays.I.I NTRODUCTIONIn recent years,consensus problems for agents with single-integrator dynamics have been studied from various perspectives(see,e.g.,[4], [7],[10],[11],[14],[16],[17],[26]).Taking into account that double-integrator dynamics can be used to model more complicated systems in reality,cooperative control for multiple agents with double-integrator dynamics has been studied extensively recently,see[12],[18]–[20], [23],[28]for continuous algorithms and[1]–[3],[5],[6],[8],[13]for discrete-time algorithms.In[8],a sampled-data algorithm is studied for double-integrator dy-namics through a Lyapunov-based approach.The analysis in[8]is lim-ited to an undirected network topology and cannot be extended to deal with the directed case.However,the informationflow might be directed in practical applications.In a similar sampled-data setting,[1]studies two sampled-data consensus algorithms,i.e.,the case with an absolute velocity damping term and the case with a relative velocity damping term,in the context of a directed network topology by extensively using matrix spectral analysis.Reference[2]extends the algorithms in[1]to deal with a dynamic directed network topology.References[5]and[6] mainly investigate sampled-data consensus for the case with a relative velocity damping term under a dynamic network topology.In[5],the network topologies are required to be both balanced and strongly con-nected at each sampling instant.On the other hand,considering that it might be difficult to measure the velocity information in practice,[6] Manuscript received November17,2009;revised September15,2010; August15,2011,and January24,2012;accepted January25,2012.Date of publication February17,2012;date of current version August24,2012.This work was supported in part by the National Natural Science Foundation of China under Grants60825303,60834003,and61021002,by the973Project (2009CB320600),and by the Foundation for the Author of National Excellent Doctoral Dissertation of China(2007B4).Recommended by Associate Editor H.Ito.J.Qin is with Harbin Institute of Technology,Harbin,China,and also with the Australian National University,Canberra,A.C.T.,Australia(e-mail:jiahu. qin@.au).H.Gao is with the Research Institute of Intelligent Control and Systems, Harbin Institute of Technology,Harbin150001,China(e-mail:hjgao@. cn).Color versions of one or more of thefigures in this paper are available online at .Digital Object Identifier10.1109/TAC.2012.2188425proposes a consensus strategy using the measurements of the relative positions between neighboring agents to estimate the relative velocities. In[13],consensus problems of second-order multi-agent systems with nonuniform time delays and dynamically changing topologies is investigated.However,the paper considers a discrete-time model es-timated by using the forward difference approximation method rather than a sampled-data model.In general,a sampled-data model is more realistic.Also,in[13],the weighting factors must be chosen from a finite set.With this background,we study the convergence of sam-pled-data consensus for double-integrator dynamics under dynamically changing topologies and allow the communication delays to be not only different but also time varying.Here,considering the weighting factors of directed edges between neighboring agents usually represent confi-dence or reliability of the transmitted information,it is more natural to consider choosing the weighting factors from an infinite set,which is more general than thefinite set case in[2]and[13].Moreover,dif-ferent from that in[13],A(k),the interaction matrix in the presence of delays at time t=kT,is introduced in this technical note and the dif-ference between A(k)and A(k),the adjacency matrix at time t=kT, is further explored as well.The reason for introducing A(k)is that it is more relevant than A(k)to the strategies investigated in this technical note.It is worth pointing out that the method employed to perform the convergence analysis is totally different from most of the existing liter-ature which heavily relies on analyzing the system matrix by spectral analysis.By using the similar transformation as that used in[13],we can treat the sampled-data consensus for double-integrator dynamics as the consensus for multiple agents modeled byfirst-integrator dynamics. Then,in order to make the transformed system dynamics mathemati-cally tractable,a new graphic method is proposed to specify the rela-tions between0(A(k)),the associated digraph of the interaction matrix in the presence of delays,and the the associated digraph of the trans-formed system matrix.Finally,motivated by the work in[22,Theorem 2.33]and[27],by employing the product properties of row-stochastic matrices from an infinite set,we present a sufficient condition in terms of the associated digraph of the interaction matrix in the presence of delays for the agents to reach consensus.Note here that the proving techniques employed in this technical note can be extended directly to derive similar results by considering the discrete-time model in[13]. The rest of the technical note is organized as follows.In Section II, we formulate the problem to be investigated and also provide some graph theory notations,while the convergence analysis is given in Section III.In Section IV,a numerical example is provided to show the effectiveness of the new result.Finally,some concluding remarks are drawn in Section V.II.B ACKGROUND AND P RELIMINARIESA.NotationsLet I n2n2n and0n;n2n2n denote,respectively,the identity matrix and the zero matrix,and1m2m be the column vector of all ones.Letand+denote,respectively,the set of nonnegative and positive integers.Given any matrix A=[a ij]2n2n,let diag(A) denote the diagonal matrix associated with A with the ith diagonal element equal to a ii.Hereafter,matrices are assumed to be compatible for algebraic operations if their dimensions are not explicitly stated.A matrix M2n2n is nonnegative,denoted as M 0,if all its entries are nonnegative.Let N2n2n.We write M N if M0N 0.A nonnegative matrix M is said to be row stochastic if all its row sums are1.Let k i=1M i=M k M k01111M1denote the left product of the matrices M k;M k01;111;M1.A row-stochastic matrix M is ergodic0018-9286/$31.00©2012IEEE(or indecomposable and aperiodic )if there exists a column vector f2nsuch that lim k !1M k =1n f T .B.Graph Theory NotationsLet G =(V ;E ;A )be a weighted digraph of order n with a finite nonempty set of nodes V =f 1;2;...;n g ,a set of edges E V 2V ,and a weighted adjacency matrix A =[a ij ]2n 2n with nonnegative adjacency elements a ij .An edge of G is denoted by (i;j ),meaning that there is a communication channel from agent i to agent j .The adjacency elements associated with the edges are positive,i.e.,(j;i )2E ,a ij >0.Moreover,we assume a ii =0for all i 2V .The set of neighbors of node i is denoted by N i =f j 2V :(j;i )2Eg .Denote by L =[l ij ]the Laplacian matrix associated with G ,where l ij =0a ij ,i =j ,and l ii=n k =1;k =i a ik .A directed path is a sequence of edges in a digraph of the form (i 1;i 2);(i 2;i 3);....A digraph has a directed spanning tree if there exists at least one node,called the root node,having a directed path to all the other nodes.A spanning subgraph G s of a directed graph G is a directed graph such that the node set V (G s )=V (G )and the edge set E (G s ) E (G ).Given a nonnegative matrix S =[s ij ]2n 2n ,the associated di-graph of S ,denoted by 0(S ),is the directed graph with the node set V =f 1;2;...;n g such that there is an edge in 0(S )from j to i if and only if s ij >0.Note that for arbitrary nonnegative matrices M;N2p 2p satisfying M N ,where >0,if 0(N )has a di-rected spanning tree,then 0(M )also has a directed spanning tree.C.Sampled-Data Consensus Algorithm for Double-Integrator DynamicsEach agent is regarded as a node in a digraph G of order n .Let T >0denote the sampling period and k2denote the discrete-time index.For notational simplicity,the sampling period T will be dropped in the sequel when it is clear from the context.We consider the following sampled-data discrete-time system which has been investigated in [1],[2],and [8]asr i (k +1)0r i (k )=T v i (k )+12T 2u i (k )v i (k +1)0v i (k )=T u i (k )(1)where x i (k )2p ,v i (k )2p and u i (k )2p are,respectively,the position,velocity and control input of agent i at time t =kT .For simplicity,we assume p =1.However,all results still hold for any p2+by introducing the notation of Kronecker product.In this technical note,we mainly consider the following discrete-time second-order consensus algorithm which takes into account the nonuniform and time-varying communication delays as u i (k )=0 v i (k )+j 2N (k )ij (k )(r j (k 0 ij (k ))0r i (k ))(2)where >0denotes the absolute velocity damping gain,N i (k )de-notes the neighbor set of agent i at time t =kT that varies with G (k )(i.e.,the dynamic communication topology at time t =kT ), ij (k )>0if agent i can receive the delayed position r j (k 0 ij (k ))from agent j at time t =kT while ij (k )=0otherwise,and 0 ij (k ) max ,where ij (k )2,is the communication delay from agent j to agent i .Here,we assume ii (t ) 0,that is,the time delays affect only the in-formation that is transmitted from one agent to another.Moreover,we assume that all the nonzero and hence positive weighting factors areboth uniformly lower and upper bounded,i.e., ij (k )2[ ;],where 0< < ,if j 2N i (k ).Remark 1:In general,(j;i )2E (G (k ))or a ij (k )>0,which cor-responds to an available communication channel from agent j to agent i at time t =kT ,does not imply ij (k )>0even if the reverse is true.This is mainly because the communication topologies are dynamicallychanging and the communication delays are time varying,which may destroy the continuity of information.Note that ij (k )>0requires a ij >0for the whole time between k 0 ij (k )and k .DefineA (k )= 11(k )111 1n (k )......... n 1(k )111 nn (k):To distinguish A (k )from the adjacency matrix A (k )at time t =kT ,we call A (k )the interaction matrix in the presence of delays to em-phasize that A (k )is closely related to not only the available commu-nication channel but also the information transmission in the presence of delays.Let L (k )be L (k )=D (k )0A (k ),where D (k )is a diag-onal matrix with the i th diagonal entrybeing n j =1;j =i ij (k ).In fact,0(A (k )),the associated digraph of A (k ),is a spanning subgraph of the communication topology G (k )at time t =kT .To illustrate,consider a team of n =3agents.The possible communication topologies are modeled by the digraph as shown in Fig.1.Assume the communica-tion delays 21(k )and 32(k ),k2,are all larger than 1T ,while the communication topology switches periodically between Ga and Gb at each sampling instant.Clearly,A (k )=03;3at each sampling instant.However,in the special case that there is no communication delay be-tween neighboring agents,0(A (k ))=G (k ).In the case that both the communication topology and the communication delays are time in-variant,0(A (k ))=G (k )after max time steps.We say that consensus is reached for algorithm (2)if for any initial position and velocity states,and any i;j 2Vlim k !1r i (k )=lim k !1r j (k )and lim k !1v i (k )=0:It is assumed that r i (k )=r i (0)and v i (k )=v i (0)for any k <0and i;j 2V .III.M AIN R ESULTSDenote G=f G 1;G 2;...;G m g as the finite set of all possible com-munication topologies for all the n agents.In the sequel,when we men-tion the union of a group of digraphs f G i ;...;G i g G,we mean a digraph with the node set V =f 1;2;...;n g and the edge set given by the union of the edge sets of G i ,j =1;...;k .Firstly,we perform the following model transformation,which helps us deal with the consensus problem for an equivalent trans-formed discrete-time system.Denote r (k )=[r 1(k );111;r n (k )]T ,v (k )=[v 1(k );111;v n (k )]T ,x (k )=(2= )v (k )+r (k ),andy (k )=[r (k )T x (k )T ]T.Then,applying algorithm (2)and by some manipulation,(1)can be written in a matrix form asy (k +1)=40(k )y (k )+`=14`(k )y (k 0`)(3)where we get the equation shown at the bottom of the next page,and 4`(k )=T2A `(k )0n;n2T +12T 2A `(k )0n;n;`=1;2;...; max :Here in 4p (k ),p =0;1;...; max ,the ij th element of A p (k )is either equal to ij (k )if ij (k )=p ,or equal to 0otherwise and L (k )is the Laplacian matrix of the digraph of A (k ).1ObviouslyA 0(k )+A 1(k )+111+A(k )=A (k ):The following lemma will allow us to perform the convergence anal-ysis by using the product properties of row-stochastic matrices.1NoteL (k )is different from the Laplacian matrix of the communicationtopology G(k).Fig.1.Two possible communication topologies for the three agents.Lemma 1:Let d (k )be the largest diagonal element of the Lapla-cian matrix L (k ),i.e.,d (k )=max if n j =1;j =i ij (k )g .If the ve-locity damping gain and the sampling period T satisfy the following condition:4 T 0 T >2and T 01 2T d (k )(4)then 4(k )=40(k )+41(k )+111+4(k );k2+,is a row-stochastic matrix with positive diagonal elements.Proof:It follows from A 0(k )+A 1(k )+111+A(k )=A (k )=diag L (k )0L (k )that4(k )=40(k )+41(k )+111+4(k )=411(k )412(k )421(k )422(k )(5)where 411(k )=(10( =2)T +( 2=4)T 2)I n 0(T 2=2)L (k ),412(k )=(( =2)T 0( 2=4)T 2)I n ,421(k )=(( =2)T +( 2=4)T 2)I n 0((2= )T +(1=2)T 2)L (k )422(k )=(10( =2)T 0( 2=4)T 2)I n .One can easily check from (4)that all the matrices 411(k ),412(k ),421(k ),and 422(k )are nonnegative with positive di-agonal elements.That is,4(k )is a nonnegative with positive diagonal elements.Finally,it follows straightforwardly from L (k )1n =1n that 4(k )is a row-stochastic matrix.Remark 2:By some manipulation,we can get that (4)is equivalent to the following condition:1+1+8T 2d (k )2T <p 501:(6)This is achieved by solving ( T )2+2 T 04<0and T 20 02T d (k ) 0,which can be considered the quadratic inequalities in T and ,respectively.In the sequel,4(k )will be used to denote the row-stochastic matrix as described in Lemma 1.In order to make the transformed system dynamics mathematically tractable in terms of 0(A (k )),the associated digraph of the interaction matrix in the presence of delays,we need to explore the relations be-tween 0(A (k ))and the associated digraph of the transformed system matrix 0(4(k )).To this end,a new graphic method is proposed as follows.Lemma 2:Given any digraph G (V ;E ).Let G 1(V 1;E 1)be a graph with n nodes and an empty edge set,that is,V 1=f n +1;n +2;...;2n g and E 1=.Let ~G(~V ;~E )be a digraph satisfying the fol-lowing conditions:(A)~V=V [V 1=f 1;...;n;n +1;...;2n g ;(B)there is an edge from node n +i to node i ,i.e.,(n +i;i )2~",for any i 2V ;(C)if (j;i )2E ,then (j;n +i )2~Efor any i;j 2V ;i =j .Then,G has a directed spanning tree if and only if ~Ghas a directed spanning tree.Proof:Necessity:Denote G s as a directed spanning tree of the digraph G .Assume,without loss of generality,`is the root node of G s .By rules (B )and (C ),split each edge (i;j )in G s into edges (i;n +j );(n +j;j )and add edge (n +`;`)for the root node `,then we canget a directed spanning tree for ~G.Sufficiency:Let ~Gs be a directed spanning tree of ~G .Note that by the definition of ~G,the digraph G can be obtained by contracting all the edges (n +i;i );i 2V in the digraph ~G.Thus,the operation of the edge contraction on ~Gs will result in a directed spanning tree,say G s ,of the digraph G .Based on the above lemma,now we have the following result.Lemma 3:Suppose that and T satisfy the inequality in (4).Let f z 1;z 2;...;z q g be any finite subsetof +.If the union of the digraphs 0(A (z 1));0(A (z 2));...;0(A (z q ))has a directed spanning tree,then the union of digraphs 0(4(z 1));0(4(z 2));...;0(4(z q ))also has a directed spanning tree.Proof:The union of the digraphs 0(4(z 1));0(4(z 2));...;0(4(z q ))hereby is exactly the digraph0(q l =14(z l )).Because and T satisfy (4),it follows that 4(z l ),l =1;2;...;q ,is a row-stochastic (and hence nonnegative)matrix with positive diagonal entries.Note that L (z l )=diag L (z l )0A (z l ).By observing the equation in (5),we get that there exists a positive number ,say =min f q (( =2)T 0( 2=4)T 2);(2= )T +(1=2)T 2g ,such that we get (7),as shown at the bottom of the page.It thus follows from ~M 12=I n that (n +i;i )20(q l =14(z l ))for any i 2V .On the other hand,~M 21=q l =1A (z l )implies that(j;i )20(q l =1A (z l ))if and only if (j;n +i )20(ql =14(z l ))for any i;j 2V ;i =j .Combining these arguments,we knowthat the digraphs0(q l =14(z l ))and0(ql =1A (z l ))correspondto the digraphs ~G and G ,respectively,as described in Lemma 2.Note that the digraph0(q l =1A (z l ))is just the union of digraphs 0(A (z 1));0(A (z 2));...;0(A (z q )).It then follows from Lemma 2that the digraph0(q l =14(z l ))has a directed spanning tree,which proves the Lemma.Let P be the set of all n by n row-stochastic matrices.Given any row-stochastic matrix P =[p ij ]2P ,define (P )=10mini;j k min f p ik ;p jk g [25].Lemma 4: (1)is continuous on P .40(k )=102T +4T2I n 0T2(diag L (k )0A 0(k))2T 04T2In2T +4T2I n 02T +12T 2(diag L (k )0A 0(k))102T 04T2I nql =14(z l )q2T 04T2I n2T +12T 2diag q l =1L (z l )0q l =1L (z l )0Inql =1A (z l )0= ~M 11~M12~M 21~M22:(7)Proof:2:P can be viewed as a subset of metricspace n .All the functions involved in the definition of (1)are continuous,and since the operations involved are sums and mins,it readily follows that (1)is continuouson n .The restriction of a continuous function is con-tinuous,so (1)is also continuous on P .Two nonnegative matrices M and N are said to be of the same type,denoted by M N ,if they have zero elements and positive elements in the same places.To derive the main result,we need the fol-lowing classical results regarding the infinite product of row-stochastic matrices.Lemma 5:([25])Let M =f M 1;M 2;...;M q g be a finite set of n 2n ergodic matrices with the property that for each se-quence M i ;M i ;...;M i of positive length,the matrix productM i M i111M i is ergodic.Then,for each infinite sequence M i ;M i ;...there exists a column vector c2n such thatlim j !1M i M i111M i =1c T :(8)In addition,when M is an infinite set, (W )<1,where W =S k S k 111Sk,S k 2M ,j =1;2;...;N (n )+1,and N (n )(which may depend on n )is the number of different types of all n 2n ergodic matrices.Furthermore,if there exists a constant 0 d <1satisfying (W ) d ,then (8)still holds.Let d=(n 01) .Assume,in the sequel,that ;T satisfy (4= T )0 T >2and T 01 (2= )T d.Then,by Lemma 1,all possible 4(k )must be nonnegative with positive diagonal elements.In addition,since the set of all 2n ( max +1)22n ( max +1)matrices can be viewed as the metricspace [2n (+1)],for each fixed pair ;T ,all possible 4(k )compose a compact set,denoted by 7( ;T ).This is because all the nonzero and hence positive entries of 4(k )are both uniformly lower and upper bounded,which can be seen by observing the form of 4(k )in (5).Let 3(A )=f B =[b ij ]22n 22n :b ij =a ij or b ij =0;i;j =1;2;...;2n g ,and denote by 5( ;T )the set of matricesM (40;41;...;4)=40411114014I 2n 0111000I 2n 11100 0111I 2nsuch that 40;41;...;423(4(k ))and 40+41+...+4=4(k ),where 4(k )27( ;T ).The set 5( ;T )is compact,since givenany 4(k )27( ;T ),all possible choices of 40;41;...;4are finite.Let (k )=[ 1(k ); 2(k );111; 2n (+1)(k )]T =[y T (k );y T (k 01);111;y T (k 0 max )]T22n (+1).Then,there exists a matrix M (40(k );41(k );...;4(k ))25( ;T )such that system (3)is rewritten as(k +1)=M (40(k );41(k );...;4(k )) (k ):(9)Clearly,the set 5( ;T )includes all possible system matrices of system (9).2Weare indebted to Associate Editor,Prof.Jorge Cortes,for his help with a simpler proof of this lemma.Given any positive integer K,define ~5(;T )=i =1M (4i 0;4i 1;...;4i):M (1)25( ;T )and there exists a integer ;1 K suchthat the union of digraphsj =04ij ;i =1;...; ;has a directed spanningtree :~5(;T )is also a compact set,which can be derived by noticing the following facts:1)5( ;T )is a compact set;2)all possible choices of are finite since is bounded by K;3)all possible choices of the directed spanning trees are finite;and 4)given the directed spanning tree and ,the followingset:i =1M (4i 0;4i 1;...;4i):M (1)25( ;T )and the union of the digraphsj =04ij;i =1;...; ;hasthe speci ed directed spanningtreeis compact (this can be proved by following the similar proof of [27,Lemma 10]).Note that the set ~5(;T )includes all possible products of ; K ,consecutive system matrices of system (9).The following lemma is presented to prove that all the possible prod-ucts of consecutive system matrices of system (9)satisfy the result as stated in Lemma 5,which in turn allow us to use the properties of in-finite products of row-stochastic matrices from an infinite set to derive our main result.Lemma 6:If 81;...;8k 2~5(;T ),where k =N (2n ( max +1))+1,then there exists a constant 0 d <1such that(k i =18i ) d .Proof:We first prove that for any 82~5(;T );8is an er-godic matrix.According to the definition of ~5(;T ),there exist pos-itive integer (1 K),M (4i 0;4i 1;...;4i )25( ;T ),i =1;...; ,such that 8= i =1M (4i 0;4i 1;...;4i)and the union of digraphs0(j =04ij ),i =1;...; ,has a directed span-ning tree.Since M (4i 0;4i 1;...;4i )25( ;T ),j =04ij must be nonnegative matrices with positive diagonal elements.Furthermore,there exists a positive number 1such that diag(j =04ij ) I 2n ,for any M (4i 0;4i 1;...;4i )25( ;T ).Specifically,by observing (5),we can choose as=min 1;10 2T + 24T20T 22(n 01) ;10 2T 0 24T2:Combining this with the condition that the union of digraphs0(j =04ij ),i =1;...; ,has a directed spanning tree,we can prove that matrix 8is ergodic by following the proof of [26,Lemma 7].Letd =max 82~5(;T )ki =18i :From Lemma 5,we know that(k i =18i )<1.This,together withthe fact that ~5( ;T )is a compact set and (1)is continuous (Lemma4),implies d must exist and 0 d <1,which therefore completing the proof.For notational simplicity,we shall denote M (40(k );41(k );...;4(k ))by M (k )if it is self-evident from the context.Based on the preceding work,now we can present our main result as follows.Theorem 1:Assume that and T satisfy (4= T )0 T >2andT 01 (2= )T d.Then,employing algorithm (2),consensus is reached for all the agents if there exists an infinite sequence of con-tiguous,nonempty,uniformly bounded time intervals [k j ;k j +1),j =1;2;...,starting at k 1=0,with the property that the union of the di-graphs 0(A (k j ));0(A (k j +1));...;0(A (k j +101))has a directed spanning tree.Proof:We first prove that consensus can be reached for system (9)using algorithm (2).Let 8(k;k )=I 2n (+1),k 0,and 8(k;l )=M (k 01)111M (l +1)M (l ),k >l 0.Assume,without loss of generality,that the lengths of all the time intervals [k j ;k j +1),j =1;2;...,are bounded by K.It follows from Lemma 3and the condition that the union of the digraphs 0(A (k j ));0(A (k j +1));...;0(A (k j +101))has a directed spanning tree that the union of the digraphs 0(4(k j ));0(4(k j +1));...;0(4(k j +101))also has a directed spanning tree for each j2+,which,together with the proof ofLemma 6,implies that 8(k j +1;k j )=k 01k =k M (k )2~5(;T ).Since 8(k j ;0)=8(k j ;k j 01)8(k j 01;k j 02)1118(k 2;k 1),it then follows from Lemma 5and Lemma 6thatlim j !18(k j ;0)=12n (+1)wT(10)where w22n (+1)and w 0.For each m >0,let k l be the largest nonnegative integer such that k l m .Note that matrix 8(m;k l )is row stochastic,thus we have8(m;0)012n w T =8(m;k l)8(k l ;0)012n wT :The matrix 8(m;k l )is bounded because it is the product of fi-nite matrices which come from a bounded set ~5(;T ).By using (10),we immediately have lim m !18(m;0)=12n (+1)w T .Combining this with the fact that (m )=8(m;0) (0)yields lim m !1 (m )=(w T (0))12n (+1)which,in turn,implies lim m !1x (m )=(w T (0))1n and lim k !1v (m )=0,and there-fore completing the proof.Remark 3:Matrix A (k )is a somewhat complex object to study compared with the adjacency matrix A (k )(see Remark 1).It is worth noting that more general results in which the sufficient conditions for guaranteeing the final consensus are presented in terms of G (k )instead of the interaction matrix in the presence of delays can be provided if some additional conditions are imposed.For example,if in addition to the conditions on and T as that required in Theorem 1,it is further required that a certain communication topology which takes effect at some time will last for at least max +1time steps,then we can get that consensus can be reached if there exists an infinite sequence of contiguous,nonempty,uniformly bounded time intervals [k j ;k j +1),j =1;2;...,starting at k 1=0,with the property that the union of the digraphs G (k j );G (k j +1);...;G (k j +101)has a directed spanning tree.This can be observed by reconstructing a new sequence of con-tiguous,nonempty and uniformly bounded time intervals which satis-fies the condition in Theorem 1by using similar technique as that in in [26,Theor.3].IV .I LLUSTRATIVE E XAMPLEConsider a group of n =6agents interacting between the possible digraphs f Ga;Gb;Gc g (see Fig.2),all of which have 0–0.2weights.Fig.2.Digraphs which model all the possible communicationtopologies.Fig.3.Position and velocity trajectories for agents.Take and T as =2and T =0:6respectively.Assume that the communication delays ij (k )satisfies 21(k )= 32(k )= 43(k )=1T s , 52(k )= 54(k )=2T s ,while 65(k )= 61(k )=3T s ,for any k2+.Moreover,we assume the switching signal is periodically switched,every 3T s in a circular way from Ga to Gb ,from Gb to Gc ,and then from Gc to Ga .Obviously,the union of the digraphs 0(A (k ))across each time in-terval of 9T s is precisely the digraph G d in Fig.2,which therefore has a directed spanning tree.Fig.3shows that consensus is reached for algorithm (2),which is consistent with the result in Theorem 1.V .C ONCLUSIONS AND F UTURE W ORKIn this technical note,we have investigated a discrete-time second-order consensus algorithm for networks of agents with nonuniform and time-varying communication delays under dynamically changing com-munication topologies in a sampled-data setting.By employing graphic method,state argumentation technique as well as the product proper-ties of row-stochastic matrices from an infinite set,we have presented a sufficient condition in terms of the associated digraph of the interac-tion matrix in the presence of delays for the agents to reach consensus.Finally,we have shown the usefulness and advantages of the proposed result through simulation results.It is worth noting that the case with input delays is an interesting topic which deserves further investigation in our future work.。
基于FL-MADQN_算法的NR-V2X_车载通信频谱资源分配
第 43 卷第 3 期2024年 5 月Vol.43 No.3May 2024中南民族大学学报(自然科学版)Journal of South-Central Minzu University(Natural Science Edition)基于FL-MADQN算法的NR-V2X车载通信频谱资源分配李中捷,邱凡,姜家祥,李江虹,贾玉婷(中南民族大学a.电子信息工程学院;b.智能无线通信湖北重点实验室,武汉430074)摘要针对5G新空口-车联网(New Radio-Vehicle to Everything,NR-V2X)场景下车对基础设施(Vehicle to Infrastructure,V2I)和车对车(Vehicle to Vehicle,V2V)共享上行通信链路的频谱资源分配问题,提出了一种联邦-多智能体深度Q网络(Federated Learning-Multi-Agent Deep Q Network,FL-MADQN)算法. 该分布式算法中,每个车辆用户作为一个智能体,根据获取的本地信道状态信息,以网络信道容量最佳为目标函数,采用DQN算法训练学习本地网络模型. 采用联邦学习加快以及稳定各智能体网络模型训练的收敛速度,即将各智能体的本地模型上传至基站进行聚合形成全局模型,再将全局模型下发至各智能体更新本地模型. 仿真结果表明:与传统分布式多智能体DQN算法相比,所提出的方案具有更快的模型收敛速度,并且当车辆用户数增大时仍然保证V2V链路的通信效率以及V2I链路的信道容量.关键词车联网;资源分配;深度Q网络;联邦学习中图分类号TN929.5 文献标志码 A 文章编号1672-4321(2024)03-0401-07doi:10.20056/ki.ZNMDZK.20240315Spectrum resource allocation for NR-V2X in-vehicle communicationbased on FL-MADQN algorithmLI Zhongjie,QIU Fan,JIANG Jiaxiang,LI Jianghong,JIA Yuting(South-Central Minzu University,a.College of Electronic andInformation Engineering;b.Hubei Key Laboratory ofIntelligent Wireless Communications,Wuhan 430074,China)Abstract To address the spectrum resource allocation problem of shared uplink between vehicle-to-infrastructure (V2I)and vehicle-to-vehicle (V2V) in 5G New Radio-Vehicle to Everything (NR-V2X) scenario. A Federated Learning-Multi-Agent Deep Q Network (FL-MADQN) algorithm is proposed. In the decentralized algorithm, each vehicle user is treated as an agent to learn the local network model using the DQN algorithm based on the obtained local channel state information and the optimal network channel capacity as the objective function. Federated learning is used to speed up and stabilize the convergence rate of each agent 's model training. The local model of each agent is uploaded to the base station for aggregation to form the global model,and then the global model is distributed to each agent to update the local model. Simulation results show that this scheme has a faster model convergence speed compared with the traditional distributed multi-agent DQN algorithm, and the communication efficiency of the V2V link and the channel capacity of the V2I link are still guaranteed when the number of vehicle users increases.Keywords V2X; resource allocation; deep Q network; federated learning随着智能汽车和移动通信的发展,车联网(Vehicle to Everything,V2X)通信已被认为是支持安全和高效智能交通服务的关键技术[1].蜂窝车联网(cellular-Vehicle to Vehicle,C-V2X)也因其能够达到收稿日期2022-01-06作者简介李中捷(1974-),男,教授,博士,研究方向:无线通信网络,E-mail:*********************基金项目国家自然科学基金资助项目(61379028,61671483);中央高校基本科研业务费专项资金资助项目(CZY23027)第 43 卷中南民族大学学报(自然科学版)更好的覆盖率和服务质量(coverage and quality of service,QoS)已经被第五代汽车协会(5th Generation Automotive Association,5GAA)广泛开发和部署. 第三代合作伙伴计划(the 3rd Generation Partnership Project,3GPP)在第16版推出了NR-V2X,作为C-V2X向5G的延伸和补充,其中定义了两种信息传输模式,即model-1(基站集中调度的资源分配方式)和model-2(终端自主式的资源分配方式).在model-1中,基站(base station,BS)在其覆盖范围内的每个传输期内为车辆分配资源,车辆在分配的资源上相互通信,而在model-2中,车辆可选择自己的资源[2].这项技术包括V2I通信与V2V通信两种重要的通信模式.V2I通信的重点是满足高数据率服务,V2V通信的重点则是保证安全关键信息的传递.为了满足V2X通信的严格要求,C-V2X需要共享频谱资源,从而实现V2I和V2V的同步通信.因此,如何在有限的频谱资源内减少干扰并且同时保证V2I链路的信道容量和V2V通信的可靠性,成为V2X通信亟待解决的一个重要问题.传统的优化方法已经被用来应对V2X通信中的资源分配和干扰管理[3-8].在文献[3]中,提出了一种资源分配策略,以提高基于设备到设备(Device-to-Device,D2D)的车辆通信框架的吞吐量.在文献[4]中,提出了一种无线电资源管理(Radio Resource Management,RRM)算法保证基于D2D的V2X系统的延迟和可靠性要求.基于类似的V2X框架,文献[5]提出了一个优化问题,在只考虑缓慢变化的大规模衰减信道的情况下为D2D车辆系统设计一个高效的频谱和功率分配算法.此外,在文献[7]和[8]中研究了排队延迟对吞吐量和可靠性的影响.针对动态的资源分配问题,由于车辆的高速移动性导致快速变化的信道以及多样化的服务需求,传统的优化算法难以利用数学方法进行精确建模并且会产生大量的计算开销,机器学习被认为是一种有前途的技术,可以解决传统优化方法中的挑战.强化学习(Reinforcement Learning,RL)作为一种可以与动态环境交互学习的机器学习技术广泛应用于无线通信领域,通过将RL与深度神经网络(Deep Neural Networks,DNN)相结合,深度强化学习(Deep Reinforcement Learning,DRL)已被广泛采用来解决V2X通信场景中更复杂的资源管理问题[9-12]. 在文献[10]中提出了一种DQN算法,通过一个综合框架控制复杂的资源来提高车辆网络的性能.在文[11]和[12]中分别使用单智能体和多智能体算法研究了考虑V2V链路延迟和V2I链路总信道容量的资源分配问题,文献[12]在提出的多智能体RL算法基础上,采用集中式学习和分布式实现来选择资源.但是集中式学习方案中V2V链路与BS交互会提高计算复杂度,违背车辆通信低延迟的原则;传统分布式学习方案中V2V链路不能获取全局信道状态信息(Channel State Information,CSI),可能导致模型无法达到最优效果.本文提出了FL-MADQN算法来解决上述问题.为了避免集中式决策方案上传车辆用户的相关信道数据,本文采用了联邦学习上传模型参数的思想,将传输成本较低的模型数据上传至中央服务器进行聚合.另一方面针对分布式决策方案V2V链路信息无法交互的问题,本文采用聚合后产生的全局模型反馈至V2V链路的方法实现信息交互.本文的其余部分安排如下:第一部分介绍系统模型以及问题描述,第二部分介绍FL-MADQN算法的详细设计和训练计划,第三部分介绍仿真结果与分析,第四部分得出最终结论.1 系统模型与问题描述1.1 系统模型本文考虑一个城市道路下的单小区V2X通信系统,如图1所示一个基站为多个车载用户提供服务,根据V2X通信不同的业务需求,系统内所有参与通信的车辆被分为m条V2I链路和n对V2V链路.其中V2I上行链路将高数据率信息从车辆传输至基站,V2V链路则用于车辆之间可靠的安全关键信息的互相传输.此通信系统基于mode-2通信模式,每V2Vm'图 1 NR-V2X车载通信系统Fig. 1 NR-V2X vehicular communication systems402第 3 期李中捷,等:基于FL-MADQN算法的NR-V2X车载通信频谱资源分配个车辆可以自主为其V2V链路选择频谱资源.假设每条V2I链路已经预先分配有一个固定发射功率的正交子频带,并且子频带的数目与V2I链路数m相等.此外,多对V2V链路可以共享同一个子频带,但每对V2V链路只能选择一个子频带进行通信.在单个时隙内,第m条子频带上的第n对V2V链路的信道增益g n[m]表示如下:gn[m]=αn h n[m],(1)其中αn和h n[m]分别表示包含每条通信链路的路径损耗和阴影的大尺度衰落和小尺度衰落.基于3GPP协议中关于V2X通信的信道准则,小尺度衰落服从零均值和单位方差的瑞利分布[15].其中第m个子频带上的第m条V2I上行链路和第n对V2V链路的SINR分别表示为:γv2im[m]=P v2imgm,B[m]σ2+∑nρn[m]P v2v n[m]g n,B[m],(2)γv2vn[m]=P v2vn[m]g n[m]σ2+I n[m],ρn[m]=1,(3)In[m]=P v2i m g m,n[m]+∑n'≠nρn'[m]P v2v n'[m]g n',n[m],(4)其中P v2v n,P v2i m和σ2分别表示第n对V2V链路和第m 条V2I链路的发射功率以及噪声功率,g m,B[m]为第m条V2I链路发射端对BS的第m个子频带上的信道增益,g n[m]为第m个子频带上的第n对V2V链路的信道增益,g n,B[m]为第n对V2V链路发射端对BS 的第m个子频带的信道增益.I n[m]为复用第m个子频带的第n对V2V受到的信道干扰,g n',n[m]为第n'对V2V链路发射端对第n对V2V链路接受端在第m 个子频带上的信道干扰,g m,n[m]为第m条V2I链路发射端对占用第m个子频带的第n对V2V链路接受端的信道干扰. ρn[m]以及ρn'[m]是一个二进制频谱分配,当其数值为1时,则表示该V2V链路占用第m个子频带,否则其数值为0.将公式(2),(3)代入香农公式C=W log2(1+ S/N)后可得出在第n个子频带上第m条V2I链路和第n对V2V链路的信道容量,表示如下:C v2im[m]=W log2(1+γv2i m[m]),(5)C v2vn[m]=W log2(1+γv2v n[m]),(6)其中W表示每条子频带的带宽.1.2 问题描述本文目标在传输功率约束和复用资源块(Resource Block,RB)约束下,联合优化V2V链路复用V2I链路RB决策和V2V链路传输功率的大小,保证所有V2V链路传输成功概率最大化的前提下,使得所有V2I链路信道总容量最大,问题定义为:max{∑m C v2i m[m]+∑m C v2v n[m]}s.t.ìíîïïïïïïγv2vn[m]≥SINR0,∀n∑m=1Mρn[m]≤1,∀nρn[m]ϵ{0,1},∀m,nP v2vn[m]≤P max.(7)其中SINR0表示V2V链路对通信时设定的信干噪比阈值,ρn[m]以及P v2v n[m]为V2V链路对复用频谱资源约束以及最大功率约束.2 多智能体联邦深度Q网络算法2.1 FL-MADQN算法考虑到现实场景下车辆计算能力较弱,本文结合联邦学习与深度强化学习,并对其进行改进形成FL-MADQN算法,如图2所示,该算法既能减少集中式资源分配所消耗的大量通信开销,又避免了分布式资源分配获取信息不全的缺点,实现了动态的资源分配.FL-MADQN算法步骤如下:(1)本地模型训练:BS与覆盖范围内的车辆节点建立通信并且生成V2V链路,初始化网络参数,车辆用户根据周围环境数据以及ϑavg j-1训练本地模型ϑw n j.(2)本地模型上传与聚合:车辆用户w n将训练一定轮次的本地模型ϑw n j发送给BS,BS将每对V2V 链路的网络模型进行聚合成全局模型.本文根据本地DQN模型训练奖励值r n进行加权平均,表示如下:ϑavgj←∑n=1N r n r1+r2+⋅⋅⋅+r Nϑw n j.(8)(3)基站反馈全局模型: V2V链路接收到全局模型,同时更新信道状态信息并利用该模型进行训练.该步骤不断重复直至V2V链路做出理想的决策,此过程中每对V2V链路协同训练但避免了状态信息传输的过程.2.2 DQN本文将资源分配问题建模为一个马尔科夫决策过程(Markov decision processes,MDP),其特征由403第 43 卷中南民族大学学报(自然科学版)状态空间、动作空间、转换概率和即时奖励组成.本文考虑的问题是连续状态空间和离散动作空间,在C -V2X 场景下,其MDP 模型中的状态与动作集合以及奖励函数如下所述.(1)状态集合用O (S t ,n )来表示第n 对V2V 链路在t 时间段下的C -V2X 环境下的状态集合,表示如下:O (S t ,n )={B n ,T n ,{I n [m ]}m ∈M ,{G n [m ]}m ∈M },(9)其中,B n ,T n 分别为第n 对V2V 链路在t 时间段下传输信息的剩余负载量以及完成传输信息的剩余时间,{I n [m ]}m ∈M 为第n 对V2V 链路复用子频带m 受到的干扰,{G n [m ]}m ∈M }为与第n 对V2V 链路相关的增益.(2)动作空间由于在解决资源分配问题时运用到DQN 网络,因此在设计动作空间时,V2V 链路的功率为P v 2v=[p 1,p 2,...,p r ]离散集,RB 数目为s ,所以动作空间可表示为:a n (t )={P v 2vn [m ],ρn [m ]},(10)根据公式(10)可知,a n (t )包含r ×s 组动作可供V2V 链路选择.(3)奖励函数设计在C -V2X 场景下资源分配的目标是保证V2V 链路的可靠性的同时使V2I 链路的信道容量最大化,所以奖励函数设计如下:r n (t )=c 1∑mC v 2i m [m ]+∑mc 2C v 2vn [m ]+c 3∑mψ(γv 2vn [m ]-SINR 0),(11)其中c 1,c 2为V2I 链路与V2V 链路对奖励函数的权重,c 3为V2V 链路的SINR 约束权重.2.3 联邦学习由于隐私和通信成本问题,大多数边缘设备都拒绝分享私人数据.另一方面,在设备端对本地数据进行模型训练和数据分析总是耗时且不精确的.而联邦学习作为一种分布式学习范式,边缘设备只将本地训练的模型参数上传至服务器,而原始数据保存在本地客户端,然后服务器聚合形成全局模型反馈回边缘设备,直到模型达到收敛.联邦学习包括以下4个步骤:(1)本地更新:每个客户端根据其原始数据在本地并行地更新学习模型.(2)权重上传:每个本地客户端将训练好的模型ϑw nj 的更新参数,发送到中央服务器.(3)全局汇总:中央服务器根据从本地客户端收到的参数计算出平均权重ϑavg .(4)权重反馈:服务器将更新后的参数广播给每个本地客户,以便进行下一次迭代.通过避免原始数据的传输,联邦学习解决了隐私问题,减少了通信开销,并将计算从中央服务器转移至本地客户端.在车联网通信场景下,本地客户端通过无线连接且数量巨大,将数据集中传输至BS 训练后反馈,这会消耗大量的通信资源并且无法保证较低延迟,利用联邦学习可以减少在BS 处理数据的计算开销,有效地分配有限的通信资源.avg图 2 FL -MADQN 算法流程Fig. 2 Flow chart of FL -MADQN algorithm404第 3 期李中捷,等:基于FL-MADQN算法的NR-V2X车载通信频谱资源分配2.4 复杂度分析FL-MADQN算法总时间复杂度为Ο(k),其中k为DQN网络算法计算规模,Ο(k)根据V2V链路对数量呈倍数增长,传统分布式算法的时间复杂度与其一致皆为Ο(k),但本文算法在训练一定轮数进行常数阶模型加权平均,其算法输入数值k大于传统分布式算法,导致其运算时间高于传统分布式算法,而集中式DQN算法在忽略对信道数据进行预处理的情况下,其总时间复杂度为Ο(nk),因此FL-MADQN算法的时间复杂度低于集中式DQN算法,而运算时间高于分布式DQN算法.3 仿真结果与分析本文使用3GPP-TR38.886协议中城市案例的参数建模,其中车辆基于空间泊松过程被投放在十字路口,BS位于中心,每条道路每个方向由两条车道组成,每个V2V发射端与其广播范围内最近的车辆建立V2V链路. LOS状态、路径损耗参数是基于协议中城市街道场景的计算公式确定,其中各相关参数如表1所示.仿真中采用的DQN是一个由输入层、隐藏层和输出层构成的全连接神经网络,同时随机初始化网络的参数.整个信道系统在TensorFlow中实现,使用的超参数如表2所示以及ReLu激活函数和Adam优化器. 为满足车辆通信延迟需求,在训练及测试过程中,本文将最大传输时间设置为100 ms.为了验证FL-MADQN算法的效率,我们在模拟实验中采用了三种对比算法.集中式DQN算法(C-DQN):所有车辆的位置以及信道信息都传输到BS,经过DQN训练后,为每对V2V链路分配最佳RB和发射功率.分布式DQN算法(D-MARL):在这种算法中的车辆用户只获取与其参与通信的相关信道信息,并且每条V2V链路独立选择其RB和基于本地DQN模型的发射功率.随机选择算法:V2V链路从候选RB池中随机选择RB与发射功率.图3给出了FL-MADQN算法以及对比算法的训练过程以及奖励值.在网络收敛速度方面,C-DQN 算法因其输入信息更全面,在1000次迭代后已经趋近平稳, D-MARL算法在2600次迭代后逐渐平稳. FL-MADQN算法则在1600次迭代后达到平稳状态,其使用加权平均方法,使奖励值平稳上升,能够获得理想的收敛速度.随着训练迭代次数的增加,可以看出三种算法的奖励皆呈上升趋势,并且在相同场景下FL-MADQN算法奖励值趋近于C-DQN算法,但其计算量比C-DQN算法更少,可以更快地进行信表 1 默认仿真参数Tab.1 DEFAULT SIMULATION PARAMETERS参数载波频率BS天线高度车辆天线高度RB数量单个RB带宽V2I链路数量V2V链路数量V2V链路路径损耗V2I链路路径损耗车辆速度最大传输功率噪声功率值5.9 GHz25 m1.5 m4180 kHz44,8,12,16PL=38.77+16.7log10(d3D)+18.2log10(f c)LOS:PL=28.0+40log10(d3D)+20log10(f c)-9log10((d'B P)2+(h BS-h UT)2)NLOS: PL=32.4+20log10(f c)+30log10(d3D)[10,15] m/s23 dBm-114 dBm表 2 训练超参Tab.2 Training super parameters参数训练轮次学习率折扣因子初始探索率最终探索率经验池大小目标网络更新step数联邦聚合step数奖励权重值30001.00e-030.7010.0120004001200.1,1,0.1图 3 FL-MADQN学习过程对比Fig.3 Comparison of FL-MADQN learning process405第 43 卷中南民族大学学报(自然科学版)息传输.图4和图5分别给出了V2V 链路数对于V2V 链路信道容量以及V2V 链路传输成功概率的影响对比,可以看出FL -MADQN 算法在这两个性能指标上都优于其他分布式算法.伴随着V2V 链路的数目增多,每个V2V 链路会受到更多干扰,也会有更多车辆处于非视距(Non Line of Sight ,NLOS )状态,这会导致分布式算法状态无法做出最佳动作选择,而FL -MADQN 算法可以依据网络模型达到平衡每条V2V 链路的效果,可以进一步观察到FL -MADQN 算法与C -DQN 算法性能接近.然而全局CSI 的获取以及集中调度V2V 链路对在大规模车联网中可行性较低,另一方面,FL -MADQN 算法仅基于本地观察做出决定,所以其更加符合车联网的需求.图6和图7分别给出了在4对V2V 链路的通信场景下,V2V 有效负载对V2V 链路信道容量以及V2V 链路传输成功概率的影响对比,随着传输有效负载的增大, V2I 链路性能都呈下降趋势,但是FL -MADQN 算法的V2I 链路性能优于其它分布式算法,而且FL -MADQN 算法在随着有效负载的增大,V2I 链路性能呈非常小的线性下降,这证明该算法对于传输负载有较好的适应性和优越性.4 总结针对NR -V2X 通信场景下的频谱资源分配问题,本文提出一种基于深度强化学习和联邦学习相结合的多智能体网络框架,并解释了设计的初衷,成功地提高了通信的性能.在与随机算法和其它网络模型对比后发现FL -MADQN 算法表现最平衡,特别是在计算量以及信息传输效率方面.另外实验表明,在车辆数目增多的情况下,FL -MADQN 算法相较于传统分布式算法仍能保证90%以上的V2V 链路传输成功率.然而在单小区内车辆数量变化对于整体资源分配方案的影响等方面还需要更为充分详实的研究,这也是接下来的研究重点.C v 2i M b p sV2V 链路数图 4 V2I 链路信道容量与V2V 链路数量对比Fig.4 Capacity of V2I versus the number of V2V pairs.V 2V 有效负载传输成功概率V2V 链路数图 5 V2V 链路通信的成功率与V2V 链路数量对比Fig.5 Satisfied rate of V2V versus the number of V2V pairs.负载(×1060 Bytes )C v 2i M b p s图 6 V2I 链路总容量与V2V 有效负载对比Fig.6 Sum capacity of V2I versus V2V payload.V 2V 有效负载传输成功概率负载(×1060 Bytes )图 7 V2V 链路通信的成功率与V2V 有效负载对比Fig.7 Satisfied rate of V2V versus V2V payload.406第 3 期李中捷,等:基于FL-MADQN算法的NR-V2X车载通信频谱资源分配参考文献[1]WANG J, LIU J, KATO N. Networking and communications in autonomous driving: A survey[J]. IEEE CommunicationsSurveys & Tutorials, 2019, 21(2):1243-1274.[2]3GPP. 3GPP TR 21.916:Technical specification group services and system aspects;release16 description;summary of Rel-16 work items (release 16)[R]. SophiaAntipolis, France, 2020.[3]REN Y,LIU F,LIU Z,et al. Power control in D2D-based vehicular communication networks[J]. IEEETransactions on Vehicular Technology,2015,64(12):5547-5562.[4]SUN W,STROM E,BRANNSTROM F,et al. Radio resource management for D2D-based V2V communication[J]. IEEE Transactions on Vehicular Technology, 2016,65(8): 6636-6650.[5]LIANG L, LI G Y, XU W. Resource allocation for D2D-enabled vehicular communications[J]. IEEE Transactionson Communications, 2017, 65(7):3186-3197.[6]LIANG L,XIE S,LI G Y,et al. Graph-based resource sharing in vehicular communication[J]. IEEE Transactionson Wireless Communications, 2018, 17(7): 4579-4592.[7]MEI J,ZHENG K,ZHAO L,et al. A latency and reliability guaranteed resource allocation scheme for LTEV2V communication systems[J]. IEEE Transactions onWireless Communications, 2018, 17(6): 3850-3860.[8]LIU C F,BENNIS M. Ultra-reliable and low-latency vehicular transmission: An extreme value theory approach[J]. IEEE Communications Letters, 2018, 22(6):1292-1295.[9]VOLODYMYR M,KORAY K,DAVID,et al. Human-level control through deep reinforcement learning[J].Nature, 2019, 518(7540):529-533.[10]HE Y,ZHAO N,YIN H. Integrated networking,caching, and computing for connected vehicles: A deepreinforcement learning approach[J]. IEEE Transactionson Vehicular Technology, 2018, 67(1):44-55.[11]HE Y,LI G Y,JUANG B F. Deep reinforcement learning based resource allocation for V2V communications[J]. IEEE Transactions on Vehicular Technology,2019,68(4): 3163-3173.[12]LIANG L,YE H,LI G Y. Spectrum sharing in vehicular networks based on multi-agent reinforcementlearning[J]. IEEE Journal on Selected Areas inCommunications, 2019, 37(10):2282-2292.[13]ZHANG X, PENG M, YAN S, et al. Deep reinforcement learning based mode selection and resource allocationfor cellular V2X communications[J]. IEEE Internet ofThings Journal, 2019, 7(7):6380-6391.[14]WANG L,YE H,LIANG L,et al. Learn to compress CSI and allocate resources in vehicular networks[J].IEEE Transactions on Communications, 2020, 68(6):3640-3653.[15]3GPP TR 38.886:Technical specification group radio access network;V2X services based on NR;userequipment (UE)radio transmission and reception(Release 16)[S]. 2021.(责编&校对雷建云)407。
复杂受限系统的鲁棒性分析与控制
客观评价: 澳大利亚皇家墨尔本理工大学 Xinghuo Yu 教授(IEEE Fellow, IEEE 工业电子协会主席) 在论文(IEEE Transactions on Systems, Man, and Cybernetic: Systems, vol. 47, pp. 783 – 793, 2017)中评价: “A novel Lyapunov functional is proposed for sampled control in [35], which greatly reduces the conservatism of the existing results” 。文献[35]为代表性论 文[1]。 西安交通大学徐宗本教授(中国科学院院士)在论文(IEEE Transactions on Cybernetics, vol. 46, pp. 1189-1201, 2016)中指出: “Markov chain has achieved great success in many machine learning tasks such as speech recognition [18], biogeography-based optimization [19], DNA sequence analysis, neural networks synchronization [20] „„” 。文献[20]为 代表性论文[2]。 澳大利亚西悉尼大学 Wei Xing Zheng 教授(IEEE Fellow)在论文(IEEE Transactions on Neural Networks and Learning Systems, vol. 26, pp. 2346-2356, 2015) 中评价 : “ More recently, a study on filtering with partial information on mode jumps has been carried out for a class of Markov jump systems in [18]. A nonsynchronous filter is designed with nonstationary modes transition that is capable of capturing the different degree of nonsynchronous jumps between system modes and filters. It has been verified that the designed filter is effective, outperforming the mode-independent filter in achieving a better filtering performance index in the energy-to-peak sense” 。文献[18]为代表性 论文[5]。 东北大学张化光教授(IEEE Fellow)在论文(IEEE Transactions on Neural Networks and Learning Systems, vol. 28, pp. 740-752, 2017) 中 评 价 : “ the less conservatism synchronization criteria for neural networks have been obtained in [30] ” 。文献[30] 为代表性论文[7]。 意大利米兰理工大学 Hamid Reza KarimiI 教授 (科睿唯安 “全球高被引科学家” ) 在论文(IEEE Access, vol. 5, 2017) 中 评 价 : “ a novel TDLF was proposed in [21] to reduce conservativeness and enlarge the allowable sampling” 。文献[21]为代表性论文[1]。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Daniel S.Bernstein,Shlomo Zilberstein,and Neil Immerman Department of Computer ScienceUniversity of MassachusettsAmherst,Massachusetts01003bern,shlomo,immerman@AbstractPlanning for distributed agents with partialstate information is considered from a decision-theoretic perspective.We describe generaliza-tions of both the MDP and POMDP modelsthat allow for decentralized control.For even asmall number of agents,thefinite-horizon prob-lems corresponding to both of our models arecomplete for nondeterministic exponential time.These complexity results illustrate a fundamen-tal difference between centralized and decentral-ized control of Markov processes.In contrast tothe MDP and POMDP problems,the problemswe consider provably do not admit polynomial-time algorithms and most likely require doublyexponential time to solve in the worst case.Wehave thus provided mathematical evidence corre-sponding to the intuition that decentralized plan-ning problems cannot easily be reduced to cen-tralized problems and solved exactly using estab-lished techniques.1IntroductionAmong researchers in artificial intelligence,there has been growing interest in problems with multiple distributed agents working to achieve a common goal(Grosz&Kraus, 1996;Lesser,1998;desJardins et al.,1999;Durfee,1999; Stone&Veloso,1999).In many of these problems,intera-gent communication is costly or impossible.For instance, consider two robots cooperating to push a box(Mataric, 1998).Communication between the robots may take time that could otherwise be spent performing physical actions. Thus,it may be suboptimal for the robots to communi-cate frequently.A planner is faced with the difficult task of deciding what each robot should do in between com-munications,when it only has access to its own sensory information.Other problems of planning for distributed agents with limited communication include maximizing the throughput of a multiple access broadcast channel(Ooi& Wornell,1996)and coordinating multiple spacecraft on a mission together(Estlin et al.,1999).We are interested in the question of whether these planning problems are computationally harder to solve than problems that involve planning for a single agent or multiple agents with access to the exact same information.We focus on centralized planning for distributed agents, with the Markov decision process(MDP)framework as the basis for our model of agents interacting with an envi-ronment.A partially observable Markov decision process (POMDP)is a generalization of an MDP in which an agent must base its decisions on incomplete information about the state of the environment(White,1993).We extend the POMDP model to allow for multiple distributed agents to each receive local observations and base their decisions on these observations.The state transitions and expected rewards depend on the actions of all of the agents.We call this a decentralized partially observable Markov de-cision process(DEC-POMDP).An interesting special case of a DEC-POMDP satisfies the assumption that at any time step the state is uniquely determined from the current set of observations of the agents.This is denoted a decen-tralized Markov decision process(DEC-MDP).The MDP, POMDP,and DEC-MDP can all be viewed as special cases of the DEC-POMDP.The relationships among the models are shown in Figure1.There has been some related work in AI.Boutilier(1999) studies multi-agent Markov decision processes(MMDPs), but in this model,the agents all have access to the same in-formation.In the framework we describe,this assumption is not made.Peshkin et al.(2000)use essentially the DEC-POMDP model(although they refer to it as a partially ob-servable identical payoff stochastic game(POIPSG))and discuss algorithms for obtaining approximate solutions to the corresponding optimization problem.The models that we study also exist in the control theory literature(Ooi et al.,1997;Aicardi et al.,1987).However,the compu-tational complexity inherent in these models has not been studied.One closely related piece of work is that of Tsit-DEC-MDPPOMDP MDPDEC-POMDPFigure1:The relationships among the models.siklis and Athans(1985),in which the complexity of non-sequential decentralized decision problems is studied.We discuss the computational complexity offinding opti-mal policies for thefinite-horizon versions of these prob-lems.It is known that solving an MDP is P-complete and that solving a POMDP is PSPACE-complete(Papadim-itriou&Tsitsiklis,1987).We show that solving a DEC-POMDP with a constant number,,of agents is com-plete for the complexity class nondeterministic exponen-tial time(NEXP).Furthermore,solving a DEC-MDP with a constant number,,of agents is NEXP-complete. This has a few consequences.One is that these problems provably do not admit polynomial-time algorithms.This trait is not shared by the MDP problems nor the POMDP problems.Another consequence is that any algorithm for solving either problem will most likely take doubly expo-nential time in the worst case.In contrast,the exact al-gorithms forfinite-horizon POMDPs take“only”exponen-tial time in the worst case.Thus,our results shed light on the fundamental differences between centralized and de-centralized control of Markov decision processes.We now have mathematical evidence corresponding to the intuition that decentralized planning problems are more difficult to solve than their centralized counterparts.These results can steer researchers away from trying tofind easy reductions from the decentralized problems to centralized ones and to-ward completely different approaches.A precise categorization of the two-agent DEC-MDP prob-lem presents an interesting mathematical challenge.The extent of our present knowledge is that the problem is PSPACE-hard and is contained in NEXP.2Centralized ModelsA Markov decision process(MDP)models an agent acting in a stochastic environment to maximize its long-term re-ward.The type of MDP that we consider contains afinite set of states,with as the start state.For each state ,is afinite set of actions available to the agent. is the table of transition probabilities,whereis the probability of a transition to state given that the agent performed action in state.is the reward func-tion,where is the expected reward received by the agent given that it chose action in state.There are several different ways to define“long-term re-ward”and thus several different measures of optimality.In this paper,we focus onfinite-horizon optimality,for which the aim is to maximize the expected sum of rewards re-ceived over time steps.Formally,the agent should max-imizewhere is the reward received at time step.A policy for afinite-horizon MDP is a mapping from each state and time to an action.This is called a non-stationary policy.The decision problem corresponding to afinite-horizon MDP is as follows:Given an MDP,a positive integer,and an integer,is there a policy that yields total reward at least?An MDP can be generalized so that the agent does not nec-essarily observe the exact state of the environment at each time step.This is called a partially observable Markov de-cision process(POMDP).A POMDP has a state set,a start state,a table of transition probabilities,and a reward function,just as an MDP does.Additionally,it contains afinite set of observations,and a table of ob-servation probabilities,where is the probability that is observed,given that action was taken and led to state.For each observation,is afinite set of actions available to the agent.A policy is now a mapping from histories of observations to actions in. The decision problem for a POMDP is stated in exactly the same way as for an MDP.3Decentralized ModelsA decentralized partially observable Markov decision pro-cess(DEC-POMDP)is a generalization of a POMDP to allow for distributed control by agents that may not be able to observe the exact state.A DEC-POMDP contains afinite set of states,with as the start state.The transition probabilitiesand expected rewards depend on the ac-tions of all agents.is afinite set of observations for agent,and is a table of observation probabilities, where is the probability that are observed by agents respectively, given that the action tuple was taken and led to state.Each agent has a set of actions for each observation.Notice that this model reduces to a POMDP in the one-agent case.For each,let denote the set of observation tuples that have a nonzero chance of occurring given that the action tuple was taken and led to state.To form a decentralized Markov decision process(DEC-MDP),we add the requirementthat for each,and eachthe state is uniquely determined by .In the one-agent case,this model is essen-tially the same as an MDP.We define a local policy,,to be a mapping from local histories of observations to actions.A joint policy,,is defined to be a tu-ple of local policies.We wish tofind a joint policy that maximizes the total expected return over thefinite hori-zon.The decision problem is stated as follows:Given a DEC-POMDP,a positive integer,and an integer, is there a joint policy that yields total reward at least? Let DEC-POMDP and DEC-MDP denote the deci-sion problems for the-agent DEC-POMDP and the-agent DEC-MDP,respectively.4Complexity ResultsIt is necessary to consider only problems for which.If we place no restrictions on,then the upper bounds do not necessarily hold.Also,we assume that each of the elements of the tables for the transition prob-abilities and expected rewards can be represented with a constant number of bits.With these restrictions,it was shown in(Papadimitriou&Tsitsiklis,1987)that the de-cision problem for an MDP is P-complete.In the same paper,the authors showed that the decision problem for a POMDP is PSPACE-complete and thus probably does not admit a polynomial-time algorithm.We prove that for all ,DEC-POMDP is NEXP-complete,and for all,DEC-MDP is NEXP-complete,where NEXP NTIME(Papadimitriou,1994).Since P NEXP,we can be certain that there does not exist a polynomial-time algorithm for either problem.Moreover,there probably is not even an exponential-time algorithm that solves either problem.For our reduction,we use a problem called TILING(Pa-padimitriou,1994),which is described as follows:We are given a set of square tile types,to-gether with two relations(the horizontal and vertical compatibility relations,respectively).We are also given an integer in binary.A tiling is a function.A tiling is consistent if and only if(a),and(b)for all,and. The decision problem is to tell,given,,,and, whether a consistent tiling exists.It is known that TILING is NEXP-complete.Theorem1For all,DEC-POMDP is NEXP-complete.Proof.First,we will show that the problem is in NEXP.We can guess a joint policy and write it down in exponential time.This is because a joint policy consists of map-pings from local histories to actions,and since, all histories have length less than.A DEC-POMDPtogether with a joint policy can be viewed as a POMDP to-gether with a policy,where the observations in the POMDP correspond to the observation tuples in the DEC-POMDP.In exponential time,each of the exponentially many possi-ble sequences of observations can be converted into beliefstates.The transition probabilities and expected rewardsfor the corresponding“belief MDP”can be computed in exponential time(Kaelbling et al.,1998).It is possible touse dynamic programming to determine whether the policyyields expected reward at least in this belief MDP.This takes at most exponential time.Now we show that the problem is NEXP-hard.For sim-plicity,we consider only the two-agent case.Clearly,theproblem with more agents can be no easier.We are given an arbitrary instance of TILING.From it,we construct aDEC-POMDP such that the existence of a joint policy thatyields a reward of at least zero is equivalent to the existence of a consistent tiling in the original problem.Furthermore, in the DEC-POMDP that is constructed.Intu-itively,a local policy in our DEC-POMDP corresponds to a mapping from tile positions to tile types,i.e.,a tiling,andthus a joint policy corresponds to a pair of tilings.The pro-cess works as follows:In the position choice phase,two tile positions are randomly“chosen”by the environment. Then,at the tile choice step,each agent sees a different position and must use its policy to determine a tile to be placed in that position.Based on information about where the two positions are in relation to each other,the environ-ment checks whether the tile types placed in the two posi-tions could be part of one consistent tiling.Only if the nec-essary conditions hold do the agents obtain a nonnegative reward.It turns out that the agents can obtain a nonnega-tive expected reward if and only if the conditions hold for all pairs of positions the environment can choose,i.e.,there exists a consistent tiling.We now present the construction in detail.During the posi-tion choice phase,each agent only has one action availableto it,and a reward of zero is obtained at each step.The states and the transition probability matrix comprise the nontrivial aspect of this phase.Recall that this phase intu-itively represents the choosing of two tile positions.First, let the two tile positions be denoted and, where.There are steps in this phase,and each step is devoted to the choosing of one bit of one of the numbers.(We assume that is a power of two.It is straightforward to modify the proof to deal with the more general case.)The order in which the bits are chosen is important,and it is as follows:The bits of and are chosen from least significant up to most sig-nificant,alternating between the two numbers at each step. Then and are chosen in the same way.As the bits ofthe numbers are being determined,information about the relationships between the numbers is being recorded in the state.How we express all of this as a Markov process is explained below.Each state has six components,and each component rep-resents a necessary piece of information about the two tile positions being chosen.We describe how each of the com-ponents changes with time.A time step in our process can be viewed as having two parts,which we refer to as the stochastic part and the deterministic part.During the stochastic part,the environment“flips a coin”to choose either the number0or the number1,each with equal prob-ability.After this choice is made,the change in each com-ponent of the state can be described by a deterministicfinite automaton that takes as input a string of0’s and1’s(the en-vironment’s coinflips).The semantics of the components, along with their associated automata,are described below: 1)Bit Chosen in the Last StepThis component of the state says whether0or1was just chosen by the environment.The corresponding automaton consists of only two states.2)Number of Bits Chosen So FarThis component simply counts up to,in order to determine when the position choice phase should end.Its automaton consists of states.3)Equal Tile PositionsAfter the steps,this component tells us whether the two tile positions chosen are equal or not.For this automa-ton,along with the following three,we need to have a no-tion of an accept state.Consider the following regular ex-pression:Note that the DFA corresponding to the above expression, on an input of length,ends in an accept state if and only if.4)Upper Left Tile PositionThis component is used to check whether thefirst tile posi-tion is the upper left corner of the grid.Its regular expres-sion is as follows:The corresponding DFA,on an input of length,ends in an accept state if and only if.5)Horizontally Adjacent Tile PositionsThis component is used to check whether thefirst tile po-sition is directly to the left of the second one.Its regular expression is as follows:The corresponding DFA,on an input of length,ends in an accept state if and only if.6)Vertically Adjacent Tile PositionsThis component is used to check whether thefirst tile posi-tion is directly above the second one.Its regular expression is as follows:The corresponding DFA,on an input of length,ends in an accept state if and only if.So far we have described the six automata that determine how each of the six components of the state evolve based on input(0or1)from the environment.We can take the cross product of these six automata to get a new automaton that is only polynomially bigger and describes how the entire state evolves based on the sequence of0’s and1’s chosen by the environment.This automaton,along with the en-vironment’s“coinflips,”corresponds to a Markov process. The number of states of the process is polylogarithmic in, and hence polynomial in the size of the TILING instance. The start state is a tuple of the start states of the six au-tomata.The table of transition probabilities for this process can be constructed in time polylogarithmic in.We have described the states,actions,state transitions,and rewards for the position choice phase,and we now describe the observation function.In this DEC-POMDP,the obser-vations are uniquely determined from the state.For the states after which a bit of or has been chosen,agent one observes thefirst component of the state,while agent two observes a dummy observation.The reverse is true for the states after which a bit of or has been chosen.Intu-itively,agent one“sees”only,and agent two“sees”only.When the second component of the state reaches its limit, the tile positions have been chosen,and the last four com-ponents of the state contain information about the tile po-sitions and how they are related.Of course,the exact tile positions are not recorded in the state,as this would require exponentially many states.This marks the end of the posi-tion choice phase.In the next step,which we call the tile choice step,each agent has actions available to it, corresponding to each of the tile types,.We de-note agent one’s choice and agent two’s choice.No matter which actions are chosen,the state transitions de-terministically to somefinal state.The reward function for this step is the nontrivial part.After the actions are chosen, the following statements are checked for validity:1)If,then.2)If,then.3)If,then.4)If,then.If all of these are true,then a reward of0is received.Oth-erwise,a reward of-1is received.This reward function can be computed from the TILING instance in polynomialtime.To complete the construction,the horizon is set to(exactly the number of steps it takes the process to reach the tile choice step,and fewer than the number of states).Now we argue that the expected reward is zero if and only if there exists a consistent tiling.First,suppose a consis-tent tiling exists.This tiling corresponds to a local policy for an agent.If each of the two agents follows this policy, then no matter which two positions are chosen by the en-vironment,the agents choose tile types for those positions so that the conditions checked at the end evaluate to true. Thus,no matter what sequence of0’s and1’s the environ-ment chooses,the agents receive a reward of zero.Hence, the expected reward for the agents is zero.For the converse,suppose the expected reward is zero. Then the reward is zero no matter what sequence of0’s and1’s the environment chooses,i.e.,no matter which two tile positions are chosen.This implies that the four condi-tions mentioned above are satisfied for any two tile posi-tions that are chosen.Thefirst condition ensures that for all pairs of tile positions,if the positions are equal,then the tile types chosen are the same.This implies that the two agents’tilings are exactly the same.The last three condi-tions ensure that this tiling is consistent.Theorem2For all,DEC-MDP is NEXP-complete.Proof.(Sketch)Inclusion in NEXP follows from the fact that a DEC-MDP is a special case of a DEC-POMDP.For NEXP-hardness,we can reduce a DEC-POMDP with two agents to a DEC-MDP with three agents.We simply add a third agent to the DEC-POMDP and impose the following requirement:The state is uniquely determined by just the third agent’s observation,but the third agent always has just one action and cannot affect the state transitions or rewards received.It is clear that the new problem qualifies as a DEC-MDP and is essentially the same as the original DEC-POMDP.The reduction described above can also be used to con-struct a two-agent DEC-MDP from a POMDP and hence show that DEC-MDP is PSPACE-hard.However,this technique is not powerful enough to prove the NEXP-hardness of the problem.In fact,the question of whether DEC-MDP is NEXP-hard remains open.Note that in the reduction in the proof of Theorem1,the observa-tion function is such that there are some parts of the state that are hidden from both agents.This needs to some-how be avoided in order to reduce to a two-agent DEC-MDP.A simpler task may actually be to derive a better up-per bound for the problem.For example,it may be pos-sible that DEC-MDP co-NEXP,where co-NEXPReferencesAicardi,M.,Franco,D.&Minciardi,R.(1987).Decentral-ized optimal control of Markov chains with a common past information set.IEEE Transactions on Automatic Control, AC-32(11),1028–1031.Babai,L.,Fortnow,L.&Lund, C.(1991).Non-deterministic exponential time has two-prover interactive putational Complexity,1,3–40.Boutilier,C.(1999).Multiagent systems:Challenges and opportunities for decision-theoretic planning.AI Magazine, 20(4),35–43.Cassandra,A.,Littman,M.L.&Zhang,N.L.(1997).In-cremental pruning:A simple,fast,exact method for par-tially observable Markov decision processes.In Proceed-ings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence(pp.54–61).desJardins,M.E.,Durfee,E.H.,Ortiz,C.L.&Wolverton, M.J.(1999).A survey of research in distributed,continual planning.AI Magazine,20(4),13–22.Durfee,E.H.(1999).Distributed problem solving and planning.In Multiagent Systems(pp.121–164).Cam-bridge,MA:The MIT Press.Estlin,T.,Gray,A.,Mann,T.,Rabideau,G.,Casta˜n o,R., Chien,S.&Mjolsness,E.(1999).An integrated system for mulit-rover scientific exploration.In Proceedings of the Sixteenth National Conference on Artificial Intelligence (pp.541–548).Grosz,B.&Kraus,S.(1996).Collaborative plans for com-plex group action.Artificial Intelligence,86(2),269–357. Hansen,E.(1998).Solving POMDPs by searching in pol-icy space.In Proceedings of the Fourteenth Annual Con-ference on Uncertainty in Artificial Intelligence(pp.211–219).Jaakkola,T.,Singh,S.P.&Jordan,M.I.(1995).Reinforce-ment learning algorithm for partially observable Markov decision problems.In Proceedings of Advances in Neural Information Processing Systems7(pp.345–352). Kaelbling,L.P.,Littman,M.L.&Cassandra,A.R.(1998). Planning and actiong in partially observable stochastic do-mains.Artificial Intelligence,101(1-2),99–134. Lesser,V.R.(1998).Reflections on the nature of multi-agent coordination and its implications for an agent archi-tecture.Autonomous Agents and Multi-Agent Systems,1, 89–111.Madani,O.,Hanks,S.&Condon,A.(1999).On the un-decidability of probabilistic planning and infinite-horizon partially observable Markov decision process problems.In Proceedings of the Sixteenth National Conference on Arti-ficial Intelligence(pp.541–548).Mataric,M.J.(1998).Using communication to reduce lo-cality in distributed multi-agent learning.Journal of Exper-imental and Theoretical Artificial Intelligence,10(3),357–369.Ooi,J.M.,Verbout,S.M.,Ludwig,J.T.&Wornell,G.W. (1997).A separation theorem for periodic sharing informa-tion patterns in decentralized control.IEEE Transactions on Automatic Control,42(11),1546–1550.Ooi,J.M.&Wornell,G.W.(1996).Decentralized con-trol of a multiple access broadcast channel:Performance bounds.In Proceedings of the35th Conference on Deci-sion and Control(pp.293–298).Papadimitriou,C.H.(1994).Computational Complexity. Reading,MA:Addison-Wesley.Papadimitriou,C.H.&Tsitsiklis,J.N.(1987).The com-plexity of Markov decision processes.Mathematics of Op-erations Research,12(3),441–450.Peshkin,L.,Kim,K.-E.,Meuleau,N.&Kaelbling,L.P. (2000).Learning to cooperate via policy search.In Pro-ceedings of the Sixteenth International Conference on Un-certainty in Artificial Intelligence.Peterson,G.L.&Reif,J.R.(1979).Multiple-person al-ternation.In20th Annual Symposium on Foundations of Computer Science(pp.348–363).Stone,P.&Veloso,M.(1999).Task decomposition,dy-namic role assignment,and low-bandwidth communication for real-time strategic teamwork.Artificial Intelligence, 110(2),241–273.Tsitsiklis,J.N.&Athans,M.(1985).On the complexity of decentralized decision making and detection problems. IEEE Transactions on Automatic Control,AC-30(5),440–446.White,D.J.(1993).Markov Decision Processes.West Sussex,England:John Wiley&Sons.。