查看论文信息

查看全文

查看论文信息

中文题名：	面向微分博弈的动态系统策略优化方法研究
姓名：	林铭铎
保密级别：	公开
论文语种：	chi
学科代码：	071102
学科专业：	系统分析与集成
学生类型：	博士
学位：	理学博士
学位类型：	学术学位
学位年度：	2024
校区：	北京校区培养
学院：	系统科学学院
研究方向：	自适应动态规划与强化学习
第一导师姓名：	赵博
第一导师单位：	系统科学学院
提交日期：	2024-06-17
答辩日期：	2024-05-24
外文题名：	Research on Policy Optimization Methods for Dynamical Systems of Differential Games
中文关键词：	最优控制 ; 自适应动态规划 ; 微分博弈 ; 强化学习 ; 智能控制
外文关键词：	Optimal Control ; Adaptive Dynamic Programming ; Differential Games ; Reinforcement Learning ; Intelligent Control
中文摘要：	︿系统科学是以不同领域的系统为研究对象，研究各种系统的行为、结构、环境与功能的普适关系，以及演化与调控一般规律的综合性交叉学科。作为系统科学的关键领域之一，控制论致力于研究动态系统的建模、优化与调控，至今其思想和方法仍然广泛影响着众多自然科学和社会工程领域。动态系统的性能优化，即最优控制，因其可以提高生产效率，降低成本，并增强系统的稳定性，受到研究人员的广泛关注，并已经在系统工程、管理与决策、社会经济等领域取得广泛应用。在信息化时代的背景下，越来越多的动态系统包含多个或大量的调控单元甚至是子系统，呈现出不同的交互关系与决策行为。然而，现有最优控制的研究大多仅考虑单一决策者，而忽视了决策者之间战略互动的影响，这导致决策者的最优控制策略并不能够总是实现整体的协同目标，难以满足实际的控制需求。因此，针对动态系统中不同决策者间的合作与冲突（当动态系统由微分方程描述时，称为微分博弈）进行深入且系统的探究具有重要理论研究意义和潜在实际应用价值。与此同时，近年来，受强化学习思想启发，研究人员结合动态规划理论、神经网络和稳定性理论，形成了一套具有一定可靠性的强化学习理论，也称为自适应动态规划。作为一种控制策略优化方法，自适应动态规划在求解非线性系统的微分博弈问题中展现出极大的潜力。然而，目前针对微分博弈动态系统的控制策略优化方法仍然存在很多局限性，现有研究大多考虑多决策输入动态系统的模型已知且运行在理想交互环境，忽略实际物理系统的不确定性，例如系统模型未知、外部干扰、实际物理器件限制、通讯资源约束等复杂情况，以及随着决策者数量增加甚至形成大规模群体，现有博弈均衡策略优化方法在如何保证博弈均衡策略在线求解的高效性与稳定性方面的研究仍存在难点和不足，这些因素严重制约了现有博弈均衡策略求解方法的适用性和博弈均衡策略的闭环系统稳定性与安全性，阻碍了相关工作在实际物理系统的推进和发展。基于此，本文从最优控制和博弈论的角度出发，基于自适应动态规划和强化学习的技术，根据决策者的交互机制复杂性层次递进的原则，研究面向各类微分博弈（多玩家零和博弈、多玩家的斯塔克伯格–纳什博弈、大规模群体的平均场博弈）的控制策略优化方法。具体工作与创新点如下： 1）针对多玩家的零和微分博弈下输入饱和限制和状态演化约束同时带来的安全性挑战，以及系统动力学模型完全未知带来的控制策略设计难题，提出一种安全无模型异策略学习算法，并给出收敛性证明。首先，将具有外界干扰的多控制输入系统的∞ 最优控制描述为多玩家的零和微分博弈。设计融合博弈协同目标、基于障碍函数的变换状态二次型与控制输入的非二次型的代价函数，在多玩家零和博弈的框架下建立哈密顿–雅可比–伊萨克（Hamilton–Jacobi–Isaacs，HJI）方程的解与纳什均衡策略的等价性。然后，基于HJI 方程推演出与其等价的无模型异策略贝尔曼方程，从而解决系统动力学未知下控制策略的设计问题，并证明所开发的方法不易受到探测噪声引起的偏差的影响。 2）考虑到实际的系统状态可能是部分可测量的，从直接策略优化的视角提出一种基于测量输出数据的无模型最优学习输出跟踪控制策略优化算法，并证明算法对最优值函数的全局收敛性。与异策略学习和Q学习不同，该算法通过直接优化策略的参数空间来最小化折扣值函数，为实现无模型的输出控制策略优化提供一种新的视角。 3）针对多玩家的斯塔克伯格–纳什微分博弈下系统模型不确定性带来的安全性挑战，以及有限的通讯和计算资源制约了博弈均衡策略求解的这一难题，提出一种事件触发鲁棒控制策略优化算法。首先，将动态系统中领导者和跟踪者的分层决策过程描述为多玩家的斯塔克伯格–纳什微分博弈。针对标称系统，设计融合系统不确定性上界函数、系统演化状态以及反映领导者与跟随者分层决策机制的代价函数，巧妙地将不确定动态系统的鲁棒控制策略优化转换为标称系统的控制策略优化。其次，为了有效地节省通讯资源和减轻计算资负担，在多玩家斯塔克伯格–纳什微分博弈的框架下引入事件触发的机制，构建事件触发的耦合哈密顿–雅可比（Hamilton–Jacobi，HJ）方程，并分析闭环系统的稳定性和鲁棒性。进一步，提出基于单评判网络的在线控制策略优化算法求解博弈均衡策略，并排除芝诺现象。 4）针对大规模群体平均场博弈下如何保证博弈均衡策略在线求解的高效稳定这一难题，提出一种在线联邦控制策略优化算法。首先，基于平均场博弈的思想，将个体间的复杂交互关系近似为个体与群体的交互，设计融合平均场耦合项的新型无折扣性能指标函数，该设计无需对个体间复杂多变的交互机制进行刻画，避免大规模群体间通信的巨大负担，并消除折扣因子对控制精度的影响。进一步，在平均场博弈的框架下构建哈密顿–雅可比–贝尔曼（Hamilton–Jacobi– Bellman，HJB）和福克–普朗克（Fokker–Planck，FP）方程，提出一种新的联邦自适应评判–密度学习算法，克服现有在线分散式学习过程中可能引起全局密度信息的估计误差持续累积，进而导致博弈均衡解不存在的难题，在保证闭环系统稳定性的同时提升评判–密度神经网络结构的训练速度。﹀
外文摘要：	︿ Systems science, taking systems in different fields as research objects, is a comprehensive interdisciplinary discipline that investigates the universal relationships among the behaviors, structures, environments and functions of various systems, as well as the general laws of evolution and regulation. As one of the key areas of systems science, cybernetics is devoted to the study of modeling, optimization and regulation of dynamical systems, and its ideas and methods have been widely influencing many natural sciences and social engineering fields so far. Performance optimization of dynamical systems, i.e., optimal control, has received extensive attention from researchers since it can improve productivity, reduce costs, and enhance system stability. Optimal control has achieved a wide range of applications in the fields of systems engineering, management and decision-making, and social economy. In the context of information technology era, more and more dynamical systems contain multiple or a large number of regulating units or even sub-systems, presenting different interaction relationships and decision-making behaviors. However, most of the existing studies on optimal control only consider a single decision maker and ignore the impact of strategic interactions among decision makers, which results in decision makers’ optimal control policies not always being able to achieve the overall goal and hardly meeting the actual control requirements. Therefore, an in-depth and systematic study of cooperation and conflict between different decision makers in dynamical systems (when the dynamical system is described by differential equations, it is also called a differential game) has important theoretical research significance and potential practical application value. Meanwhile, in recent years, inspired by the idea of reinforcement learning, researchers have combined dynamic programming, neural networks and stability theory to form a class of reinforcement learning with a certain degree of reliability, also known as adaptive dynamic programming. As a control policy optimization method, adaptive dynamic programming shows great potential in solving differential game problems of nonlinear systems. However, there are still many limitations in the current control policy optimization methods for differential game dynamical systems. Most of the existing studies consider that the model of the multi-control input dynamical system is known and operates in the ideal interaction environment, ignoring the uncertainties of the actual physical system, such as the unknown system model, external disturbances, the limitations of the actual physical devices, the communication resource constraints and other complex situations. Furthermore, with the increase of the number of decision makers and even the formation of large-scale groups, the existing game equilibrium optimization methods still have difficulties and deficiencies in the research on how to ensure the efficiency and stability of the game equilibrium policy online solution. These factors seriously restrict the applicability of the existing game equilibrium policy optimization methods and the stability and security of the obtained game equilibrium policy, which impedes the advancement and development of the related works in the actual physical systems. Thus, from the perspective of optimal control and game theory, combined with adaptive dynamic programming and reinforcement learning, this paper investigates control policy optimization methods for various types of differential games (multi-player zero-sum games, multi-player Stackelberg–Nash games, and massive multi-agent mean-field games) according to the principle of hierarchical progression of complexity of the decision-makers’ interaction mechanism. The specific work and innovations are as follows: 1) Aiming at the safety challenges posed by both input saturations and state constraints under multi-player zero-sum differential games, as well as the controller design difficulties posed by the complete unknown system model, a safety-aware model-free off-policy learning algorithm is proposed and a convergence proof is given. First, the ∞ optimal control of the multiple control input dynamical system with external disturbances is described as a multi-player zero-sum differential game. A cost function that integrates the cooperative objective of the game, the quadratic function of the barrier function-based transformed state, and the non-quadratic function of the control inputs is designed. The equivalence of the solution of the Hamilton–Jacobi–Isaacs (HJI) equation to the Nash equilibrium policy is established in the framework of the multi-player zero-sum game. Then, the model-free off-policy Bellman equation is derived based on the HJI equation to solve the problem of designing the control policy under unknown system dynamics, and it is demonstrated that the developed method is less susceptible to the bias induced by the detection noise. 2) Considering that the actual system states may be partially measurable, a model-free optimal learning output tracking control policy optimization method based on measured output data is proposed from the perspective of direct policy optimization, and the global convergence of the algorithm to the optimal value function is demonstrated. Unlike off-policy learning and Q-learning, the algorithm minimizes the discounted value function by directly optimizing the parameter space of the policy, which provides a new perspective for implementing model-free output control policy optimization. 3) Considering the safety challenges brought by the uncertainties of dynamical systems under the multi-player Stackelberg–Nash differential game, as well as the difficulties in solving the game equilibrium policy with limited communication and computational resources, an event-triggered robust control policy optimization algorithm is proposed. First, the hierarchical decision-making process in dynamical systems is described as a multi-player Stackelberg–Nash differential game. For the nominal system, the cost function is designed to incorporate the upper bound function of the system uncertainty, the system evolutionary state, and the reflection of the hierarchical decision-making mechanism of the leader and the follower, which cleverly transforms the robust control for the uncertain dynamical system to the optimal control for the nominal system. Secondly, in order to effectively save communication resources and reduce the burden of computational capital, an event-triggered mechanism is introduced in the framework of multi-player Stackelberg–Nash differential games, and the event-triggered coupled Hamilton-Jacobi (HJ) equations are constructed and analyzed for the stability and robustness of the closed-loop system. Further, an online control policy optimization algorithm based on critic-only network is proposed to solve the game equilibrium policy and the Zeno phenomenon is excluded. 4) Aiming at the difficult problem of how to ensure the efficient and stable online solution of game equilibrium policy under massive multi-agent mean-field game, an online federal control policy optimization method is proposed. First, based on the idea of mean-field game, the complex interactions between individuals are approximated as the interaction between an individual and a group. A novel undiscounted performance index function incorporating the mean-field coupling term is designed, which eliminates the need to characterize the complex and variable interaction mechanism among individuals, avoids the huge burden of massive multi-agent inter-group communication, and removes the influence of the discount factor on the control accuracy. Furthermore, the Hamilton–Jacobi–Bellman (HJB) and Fokker–Planck (FP) equations are constructed under the framework of the mean-field game, and a new federated adaptive critic-density learning algorithm is proposed to overcome the existing online decentralized learning process that may cause the estimation error of the global density information to continue to accumulate, which may lead to the nonexistence of the equilibrium solution of the game. The developed method also improves the training speed of the critic-density neural network structure while ensuring the stability of the closed-loop system. ﹀
参考文献总数：	140
馆藏地：	图书馆学位论文阅览区（主馆南区三层BC区）
馆藏号：	博071102/24004
开放日期：	2025-06-18

附件下载