Zhong, Kuba, Feng, Hu, Ji, and Yang
1. Introduction
Cooperative Multi-Agent Reinforcement Learning (MARL) is a natural model of learning
in multi-agent systems, such as robot swarms (Hüttenrauch et al., 2017, 2019), autonomous
cars (Cao et al., 2012), and traffic signal control (Calvo and Dusparic, 2018). To solve coop-
erative MARL problems, one naive approach is to directly apply single-agent reinforcement
learning algorithm to each agent and consider other agents as a part of the environment, a
paradigm commonly referred to as Independent Learning (Tan, 1993; de Witt et al., 2020).
Though effective in certain tasks, independent learning fails in the face of more complex
scenarios (Hu et al., 2022b; Foerster et al., 2018), which is intuitively clear: once a learning
agent updates its policy, so do its teammates, which causes changes in the effective environ-
ment of each agent which single-agent algorithms are not prepared for (Claus and Boutilier,
1998). To address this, a learning paradigm named Centralised Training with Decentralised
Execution (CTDE) (Lowe et al., 2017; Foerster et al., 2018; Zhou et al., 2023) was devel-
oped. The CTDE framework learns a joint value function which, during training, has access
to the global state and teammates’ actions. With the help of the centralised value function
that accounts for the non-stationarity caused by others, each agent adapts its policy pa-
rameters accordingly. Thus, it effectively leverages global information while still preserving
decentralised agents for execution. As such, the CTDE paradigm allows a straightforward
extension of single-agent policy gradient theorems (Sutton et al., 2000; Silver et al., 2014)
to multi-agent scenarios (Lowe et al., 2017; Kuba et al., 2021; Mguni et al., 2021). Con-
sequently, numerous multi-agent policy gradient algorithms have been developed (Foerster
et al., 2018; Peng et al., 2017; Zhang et al., 2020; Wen et al., 2018, 2020; Yang et al., 2018;
Ackermann et al., 2019).
Though existing methods have achieved reasonable performance on common bench-
marks, several limitations remain. Firstly, some algorithms (Yu et al., 2022; de Witt et al.,
2020) rely on parameter sharing and require agents to be homogeneous (i.e., share the same
observation space and action space, and play similar roles in a cooperation task), which
largely limits their applicability to heterogeneous-agent settings (i.e., no constraint on the
observation spaces, action spaces, and the roles of agents) and potentially harms the perfor-
mance (Christianos et al., 2021). While there has been work extending parameter sharing
for heterogeneous agents (Terry et al., 2020), their methods rely on padding, which is neither
elegant nor general. Secondly, existing algorithms update the agents simultaneously. As we
show in Section 2.3.1 later, the agents are unaware of partners’ update directions under this
update scheme, which could lead to potentially conflicting updates, resulting in training
instability and failure of convergence. Lastly, some algorithms, such as IPPO and MAPPO,
are developed based on intuition and empirical results. The lack of theory compromises
their trustworthiness for important usage.
To resolve these challenges, in this work we propose Heterogeneous-Agent Reinforce-
ment Learning (HARL) algorithm series, that is meant for the general heterogeneous-agent
settings, achieves effective coordination through a novel sequential update scheme, and is
grounded theoretically.
In particular, we capitalize on the multi-agent advantage decomposition lemma (Kuba
et al., 2021) and derive the theoretically underpinned multi-agent extension of trust region
learning, which is proved to enjoy monotonic improvement property and convergence to
2