Journal of Machine Learning Research 25 (2024) 1-67 Submitted 4/23; Revised 10/23; Published 1/24
Heterogeneous-Agent Reinforcement Learning
Yifan Zhong
1,2,
Jakub Grudzien Kuba
3,
jakub.gr[email protected]c.uk
Xidong Feng
4,
Siyi Hu
5
Jiaming Ji
1
Yaodong Yang
1,
1
Institute for Artificial Intelligence, Peking University
2
Beijing Institute for General Artificial Intelligence
3
University of Oxford
4
University College London
5
ReLER, AAII, University of Technology Sydney
Equal contribution Corresponding author
Editor: George Konidaris
Abstract
The necessity for cooperation among intelligent machines has popularised cooperative
multi-agent reinforcement learning (MARL) in AI research. However, many research en-
deavours heavily rely on parameter sharing among agents, which confines them to only
homogeneous-agent setting and leads to training instability and lack of convergence guar-
antees. To achieve effective cooperation in the general heterogeneous-agent setting, we
propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve
the aforementioned issues. Central to our findings are the multi-agent advantage decompo-
sition lemma and the sequential update scheme. Based on these, we develop the provably
correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and
HAPPO by tractable approximations. Furthermore, we discover a novel framework named
Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees
for HATRPO and HAPPO and provides a general template for cooperative MARL al-
gorithmic designs. We prove that all algorithms derived from HAML inherently enjoy
monotonic improvement of joint return and convergence to Nash Equilibrium. As its natu-
ral outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO,
including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-
counterparts. We comprehensively test HARL algorithms on six challenging benchmarks
and demonstrate their superior effectiveness and stability for coordinating heterogeneous
agents compared to strong baselines such as MAPPO and QMIX.
1
Keywords: cooperative multi-agent reinforcement learning, heterogeneous-agent trust
region learning, heterogeneous-agent mirror learning, heterogeneous-agent reinforcement
learning algorithms, sequential update scheme
1. Our code is available at https://github.com/PKU-MARL/HARL.
c
2024 Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, and Yaodong Yang.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-0488.html.
Zhong, Kuba, Feng, Hu, Ji, and Yang
1. Introduction
Cooperative Multi-Agent Reinforcement Learning (MARL) is a natural model of learning
in multi-agent systems, such as robot swarms (Hüttenrauch et al., 2017, 2019), autonomous
cars (Cao et al., 2012), and traffic signal control (Calvo and Dusparic, 2018). To solve coop-
erative MARL problems, one naive approach is to directly apply single-agent reinforcement
learning algorithm to each agent and consider other agents as a part of the environment, a
paradigm commonly referred to as Independent Learning (Tan, 1993; de Witt et al., 2020).
Though effective in certain tasks, independent learning fails in the face of more complex
scenarios (Hu et al., 2022b; Foerster et al., 2018), which is intuitively clear: once a learning
agent updates its policy, so do its teammates, which causes changes in the effective environ-
ment of each agent which single-agent algorithms are not prepared for (Claus and Boutilier,
1998). To address this, a learning paradigm named Centralised Training with Decentralised
Execution (CTDE) (Lowe et al., 2017; Foerster et al., 2018; Zhou et al., 2023) was devel-
oped. The CTDE framework learns a joint value function which, during training, has access
to the global state and teammates’ actions. With the help of the centralised value function
that accounts for the non-stationarity caused by others, each agent adapts its policy pa-
rameters accordingly. Thus, it effectively leverages global information while still preserving
decentralised agents for execution. As such, the CTDE paradigm allows a straightforward
extension of single-agent policy gradient theorems (Sutton et al., 2000; Silver et al., 2014)
to multi-agent scenarios (Lowe et al., 2017; Kuba et al., 2021; Mguni et al., 2021). Con-
sequently, numerous multi-agent policy gradient algorithms have been developed (Foerster
et al., 2018; Peng et al., 2017; Zhang et al., 2020; Wen et al., 2018, 2020; Yang et al., 2018;
Ackermann et al., 2019).
Though existing methods have achieved reasonable performance on common bench-
marks, several limitations remain. Firstly, some algorithms (Yu et al., 2022; de Witt et al.,
2020) rely on parameter sharing and require agents to be homogeneous (i.e., share the same
observation space and action space, and play similar roles in a cooperation task), which
largely limits their applicability to heterogeneous-agent settings (i.e., no constraint on the
observation spaces, action spaces, and the roles of agents) and potentially harms the perfor-
mance (Christianos et al., 2021). While there has been work extending parameter sharing
for heterogeneous agents (Terry et al., 2020), their methods rely on padding, which is neither
elegant nor general. Secondly, existing algorithms update the agents simultaneously. As we
show in Section 2.3.1 later, the agents are unaware of partners’ update directions under this
update scheme, which could lead to potentially conflicting updates, resulting in training
instability and failure of convergence. Lastly, some algorithms, such as IPPO and MAPPO,
are developed based on intuition and empirical results. The lack of theory compromises
their trustworthiness for important usage.
To resolve these challenges, in this work we propose Heterogeneous-Agent Reinforce-
ment Learning (HARL) algorithm series, that is meant for the general heterogeneous-agent
settings, achieves effective coordination through a novel sequential update scheme, and is
grounded theoretically.
In particular, we capitalize on the multi-agent advantage decomposition lemma (Kuba
et al., 2021) and derive the theoretically underpinned multi-agent extension of trust region
learning, which is proved to enjoy monotonic improvement property and convergence to
2
Heterogeneous-Agent Reinforcement Learning
the Nash Equilibrium (NE) guarantee. Based on this, we propose Heterogeneous-Agent
Trust Region Policy Optimisation (HATRPO) and Heterogeneous-Agent Proximal Policy
Optimisation (HAPPO) as tractable approximations to theoretical procedures.
Furthermore, inspired by Mirror Learning (Kuba et al., 2022b) that provides a theoretical
explanation for the effectiveness of TRPO and PPO , we discover a novel framework named
Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees
for HATRPO and HAPPO and provides a general template for cooperative MARL algorith-
mic designs. We prove that all algorithms derived from HAML inherently satisfy the desired
property of the monotonic improvement of joint return and the convergence to Nash equi-
librium. Thus, HAML dramatically expands the theoretically sound algorithm space and,
potentially, provides cooperative MARL solutions to more practical settings. We explore the
HAML class and derive more theoretically underpinned and practical heterogeneous-agent
algorithms, including HAA2C, HADDPG, and HATD3.
To facilitate the usage of HARL algorithms, we open-source our PyTorch-based in-
tegrated implementation. Based on this, we test HARL algorithms comprehensively on
Multi-Agent Particle Environment (MPE) (Lowe et al., 2017; Mordatch and Abbeel, 2018),
Multi-Agent MuJoCo (MAMuJoCo) (Peng et al., 2021), StarCraft Multi-Agent Challenge
(SMAC) (Samvelyan et al., 2019), SMACv2 (Ellis et al., 2022), Google Research Football
Environment (GRF) (Kurach et al., 2020), and Bi-DexterousHands (Chen et al., 2022). The
empirical results confirm the algorithms’ effectiveness in practice. On all benchmarks with
heterogeneous agents including MPE, MAMuJoCo, GRF, and Bi-Dexteroushands, HARL
algorithms generally outperform their existing MA-counterparts, and their performance gaps
become larger as the heterogeneity of agents increases, showing that HARL algorithms are
more robust and better suited for the general heterogeneous-agent settings. While all HARL
algorithms show competitive performance, they culminate in HAPPO and HATD3 in par-
ticular, which establish the new state-of-the-art results. As an off-policy algorithm, HATD3
also improves sample efficiency, leading to more efficient learning and faster convergence.
On tasks where agents are mostly homogeneous such as SMAC and SMACv2, HAPPO and
HATRPO attain comparable or superior win rates at convergence while not relying on the
parameter-sharing trick, demonstrating their general applicability. Through ablation analy-
sis, we empirically show the novelties introduced by HARL theory and algorithms are crucial
for learning the optimal cooperation strategy, thus signifying their importance. Finally, we
systematically analyse the computational overhead of sequential update and conclude that
it does not need to be a concern.
2. Preliminaries
In this section, we first introduce problem formulation and notations for cooperative MARL,
and then review existing work and analyse their limitations.
2.1 Cooperative MARL Problem Formulation and Notations
We consider a fully cooperative multi-agent task that can be described as a Markov game
(MG) (Littman, 1994), also known as a stochastic game (Shapley, 1953).
3
Zhong, Kuba, Feng, Hu, Ji, and Yang
Definition 1 A cooperative Markov game is defined by a tuple hN, S, A, r, P, γ, di. Here,
N = {1, . . . , n} is a set of n agents, S is the state space, A = ×
n
i=1
A
i
is the products of all
agents’ action spaces, known as the joint action space. Further, r : S × A R is the joint
reward function, P : S × A ×S [0, 1] is the transition probability kernel, γ [0, 1) is the
discount factor, and d P(S) (where P(X) denotes the set of probability distributions over
a set X) is the positive initial state distribution.
Although our results hold for general compact state and action spaces, in this paper
we assume that they are finite, for simplicity. In this work, we will also use the notation
P(X) to denote the power set of a set X. At time step t N, the agents are at state
s
t
; they take independent actions a
i
t
, i N drawn from their policies π
i
(·
i
|s
t
) P(A
i
),
and equivalently, they take a joint action a
t
= (a
1
t
, . . . , a
n
t
) drawn from their joint policy
π(·|s
t
) =
Q
n
i=1
π
i
(·
i
|s
t
) P(A). We write Π
i
,
s∈S
π
i
(·
i
|s) |∀s S, π
i
(·
i
|s) P(A
i
)} to
denote the policy space of agent i, and Π ,
1
, . . . , Π
n
) to denote the joint policy space.
It is important to note that when π
i
(·
i
|s) is a Dirac delta distribution, s S, the policy
is referred to as deterministic (Silver et al., 2014) and we write µ
i
(s) to refer to its centre.
Then, the environment emits the joint reward r
t
= r(s
t
, a
t
) and moves to the next state
s
t+1
P (·|s
t
, a
t
) P(S). The joint policy π, the transition probabililty kernel P , and
the initial state distribution d, induce a marginal state distribution at time t, denoted by
ρ
t
π
. We define an (improper) marginal state distribution ρ
π
,
P
t=0
γ
t
ρ
t
π
. The state value
function and the state-action value function are defined as:
V
π
(s) , E
a
0:
π,s
1:
P
X
t=0
γ
t
r
t
s
0
= s
and
2
Q
π
(s, a) , E
s
1:
P,a
1:
π
X
t=0
γ
t
r
t
s
0
= s, a
0
= a
.
The advantage function is defined to be
A
π
(s, a) , Q
π
(s, a) V
π
(s).
In this paper, we consider the fully-cooperative setting where the agents aim to maximise
the expected joint return, defined as
J(π) , E
s
0:
ρ
0:
π
,a
0:
π
"
X
t=0
γ
t
r
t
#
.
We adopt the most common solution concept for multi-agent problems which is that of
Nash equilibrium (NE) (Nash, 1951; Yang and Wang, 2020; Filar and Vrieze, 2012; Başar
and Olsder, 1998), defined as follows.
Definition 2 In a fully-cooperative game, a joint policy π
= (π
1
, . . . , π
n
) is a Nash equi-
librium (NE) if for every i N, π
i
Π
i
implies J (π
) J
π
i
, π
i
.
2. We write a
i
, a, and s when we refer to the action, joint action, and state as to values, and a
i
, a, and s
as to random variables.
4
Heterogeneous-Agent Reinforcement Learning
NE is a well-established game-theoretic solution concept. Definition 2 characterises the
equilibrium point at convergence for cooperative MARL tasks. To study the problem of
finding a NE, we pay close attention to the contribution to performance from different
subsets of agents. To this end, we introduce the following novel definitions.
Definition 3 Let i
1:m
denote an ordered subset {i
1
, . . . , i
m
} of N. We write i
1:m
to refer
to its complement, and i and i, respectively, when m = 1. We write i
k
when we refer to the
k
th
agent in the ordered subset. Correspondingly, the multi-agent state-action value function
is defined as
Q
i
1:m
π
s, a
i
1:m
, E
a
i
1:m
π
i
1:m
Q
π
s, a
i
1:m
, a
i
1:m

,
In particular, when m = n (the joint action of all agents is considered), then i
1:n
Sym(n),
where Sym(n) denotes the set of permutations of integers 1, . . . , n, known as the symmetric
group. In that case, Q
i
1:n
π
(s, a
i
1:n
) is equivalent to Q
π
(s, a). On the other hand, when
m = 0, i.e., i
1:m
= , the function takes the form of V
π
(s). Moreover, consider two disjoint
subsets of agents, j
1:k
and i
1:m
. Then, the multi-agent advantage function of i
1:m
with respect
to j
1:k
is defined as
A
i
1:m
π
s, a
j
1:k
, a
i
1:m
, Q
j
1:k
,i
1:m
π
s, a
j
1:k
, a
i
1:m
Q
j
1:k
π
s, a
j
1:k
. (1)
In words, Q
i
1:m
π
s, a
i
1:m
evaluates the value of agents i
1:m
taking actions a
i
1:m
in state
s while marginalizing out a
i
1:m
, and A
i
1:m
π
s, a
j
1:k
, a
i
1:m
evaluates the advantage of agents
i
1:m
taking actions a
i
1:m
in state s given that the actions taken by agents j
1:k
are a
j
1:k
, with
the rest of agents’ actions marginalized out by expectation. As we show later in Section 3,
these functions allow to decompose the joint advantage function, thus shedding new light
on the credit assignment problem.
2.2 Dealing With Partial Observability
Notably, in some cooperative multi-agent tasks, the global state s may be only partially
observable to the agents. That is, instead of the omniscient global state, each agent can
only perceive a local observation of the environment, which does not satisfy the Markov
property. The model that accounts for partial observability is Decentralized Partially Ob-
servable Markov Decision Process (Dec-POMDP) (Oliehoek and Amato, 2016). However,
Dec-POMDP is proved to be NEXP-complete (Bernstein et al., 2002) and requires super-
exponential time to solve in the worst case (Zhang et al., 2021). To obtain tractable results,
we assume full observability in theoretical derivations and let each agent take actions con-
ditioning on the global state, i.e., a
i
t
π
i
(·
i
|s), thereby arriving at practical algorithms.
In literature (Yang et al., 2018; Kuba et al., 2021; Wang et al., 2023), this is a common
modeling choice for rigor, consistency, and simplicity of the proofs.
In our implementation, we either compensate for partial observability by employing RNN
so that agent actions are conditioned on the action-observation history, or directly use the
MLP network so that agent actions are conditioned on the partial observations. Both of
them are common approaches adopted by existing work, including MAPPO (Yu et al., 2022),
QMIX (Rashid et al., 2018), COMA (Foerster et al., 2018), OB (Kuba et al., 2021), MACPF
(Wang et al., 2023) etc.. From our experiments (Section 5), we show that both approaches
are capable of solving partially observable tasks.
5
Zhong, Kuba, Feng, Hu, Ji, and Yang
2.3 The State of Affairs in Cooperative MARL
Before we review existing SOTA algorithms for cooperative MARL, we introduce two settings
in which the algorithms can be implemented. Both of them can be considered appealing
depending on the application, but their benefits also come with limitations which, if not
taken care of, may deteriorate an algorithm’s performance and applicability.
2.3.1 Homogeneity vs. Heterogeneity
The first setting is that of homogeneous policies, i.e., those where all agents share one set of
policy parameters: π
i
= π, i N, so that π = (π, . . . , π) (de Witt et al., 2020; Yu et al.,
2022), commonly referred to as Full Parameter Sharing (FuPS) (Christianos et al., 2021).
This approach enables a straightforward adoption of an RL algorithm to MARL, and it does
not introduce much computational and sample complexity burden with the increasing num-
ber of agents. As such, it has been a common practice in the MARL community to improve
sample efficiency and boost algorithm performance (Sunehag et al., 2018; Foerster et al.,
2018; Rashid et al., 2018). However, FuPS could lead to an exponentially-suboptimal out-
come in the extreme case (see Example 2 in Appendix A). While agent identity information
could be added to observation to alleviate this difficulty, FuPS+id still suffers from inter-
ference during agents’ learning process in scenarios where they have different abilities and
goals, resulting in poor performance, as analysed by Christianos et al. (2021) and shown by
our experiments (Figure 8). One remedy is the Selective Parameter Sharing (SePS) (Chris-
tianos et al., 2021), which only shares parameters among similar agents. Nevertheless, this
approach has been shown to be suboptimal and highly scenario-dependent, emphasizing the
need for prior understanding of task and agent attributes to effectively utilize the SePS
strategy (Hu et al., 2022a). More severely, both FuPS and SePS require the observation and
action spaces of agents in a sharing group to be the same, restricting their applicability to
the general heterogeneous-agent setting. Existing work that extends parameter sharing to
heterogeneous agents relies on padding (Terry et al., 2020), which also cannot be generally
applied. To summarize, algorithms relying on parameter sharing potentially suffer from
compromised performance and applicability.
A more ambitious approach to MARL is to allow for heterogeneity of policies among
agents, i.e., to let π
i
and π
j
be different functions when i 6= j N. This setting has greater
applicability as heterogeneous agents can operate in different action spaces. Furthermore,
thanks to this model’s flexibility they may learn more sophisticated joint behaviors. Lastly,
they can recover homogeneous policies as a result of training, if that is indeed optimal.
Nevertheless, training heterogeneous agents is highly non-trivial. Given a joint reward,
an individual agent may not be able to distill its own contribution to it a problem known
as credit assignment (Foerster et al., 2018; Kuba et al., 2021). Furthermore, even if an
agent identifies its improvement direction, it may conflict with those of other agents when
not optimised properly. We provide two examples to illustrate this phenomenon.
The first one is shown in Figure 1. We design a single-state differentiable game where
two agents play continuous actions a
1
, a
2
R respectively, and the reward function is
r(a
1
, a
2
) = a
1
a
2
. When we initialise agent policies in the second or fourth quadrants and set
a large learning rate, the simultaneous update approach could result in a decrease in joint
6
Heterogeneous-Agent Reinforcement Learning
°2.0 °1.5 °1.0 °0.5 0.0 0.5 1.0 1.5 2.0
Agent 2
°2.0
°1.5
°1.0
°0.5
0.0
0.5
1.0
1.5
2.0
Agent 1
Two-player diÆerentiable game
°4
°3
°2
°1
0
1
2
3
4
Reward Value
Individual update direction
Joint update direction
Initial joint policy
Individual update result
Simultaneous update result
Sequential update result
Figure 1: Example of a two-agent differentiable game with r(a
1
, a
2
) = a
1
a
2
. We initialise
the two policies in the fourth quadrant. Under the straightforward simultaneous update
scheme (red), agent 1 takes a positive update to improve the joint reward, meanwhile agent
2 moves towards the negative axis for the same purpose. However, their update directions
conflict with each other and lead to a decrease in the joint return. By contrast, under our
proposed sequential update scheme (blue), agent 1 updates first, and agent 2 adapts to agent
1’ updated policy, jointly leading to improvement.
reward. In contrast, the sequential update proposed in this paper enables agent 2 to fully
adapt to agent 1’s updated policy and improves the joint reward.
We consider a matrix game with discrete action space as the second example. Our matrix
game is illustrated as follows:
Example 1 Let’s consider a fully-cooperative game with 2 agents, one state, and the joint
action space {0, 1}
2
, where the reward is given by r(0, 0) = 0, r(0, 1) = r(1, 0) = 2, and
r(1, 1) = 1. Suppose that π
i
old
(0) > 0.6 for i = 1, 2. Then, if agents i update their policies
by
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
A
π
old
(a
i
, a
i
)
, i N,
then the resulting policy will yield a lower return,
J(π
old
) > J(π
new
) = min
π
J(π).
This example helpfully illustrates the miscoordination problem when agents conduct
independent reward maximisation simultaneously. A similar miscoordination problem when
heterogeneous agents update at the same time is also shown in Example 2 of Alós-Ferrer
and Netzer (2010).
Therefore, our discussion in this section not only implies that homogeneous algorithms
could have restricted performance and applicability, but also highlight that heterogeneous
algorithms should be developed with extra care when not optimised properly (large learning
7
Zhong, Kuba, Feng, Hu, Ji, and Yang
rate in Figure 1 and independent reward maximisation in Example 1), which could be
common in complex high-dimensional problems. In the next subsection, we describe existing
SOTA actor-critic algorithms which, while often very effective, are still not impeccable, as
they suffer from one of the above two limitations.
2.3.2 Analysis of Existing Work
MAA2C (Papoudakis et al., 2021) extends the A2C (Mnih et al., 2016) to MARL by replacing
the RL optimisation (single-agent policy) objective with the MARL one (joint policy),
L
MAA2C
(π) , E
sπ,aπ
A
π
old
(s, a)
, (2)
which computes the gradient with respect to every agent i’s policy parameters, and performs
a gradient-ascent update for each agent. This algorithm is straightforward to implement and
is capable of solving simple multi-agent problems (Papoudakis et al., 2021). We point out,
however, that by simply following their own MAPG, the agents could perform uncoordinated
updates, as illustrated in Figure 1. Furthermore, MAPG estimates have been proved to suffer
from large variance which grows linearly with the number of agents (Kuba et al., 2021), thus
making the algorithm unstable. To assure greater stability, the following MARL methods,
inspired by stable RL approaches, have been developed.
MADDPG (Lowe et al., 2017) is a MARL extension of the popular DDPG algorithm
(Lillicrap et al., 2016). At every iteration, every agent i updates its deterministic policy by
maximising the following objective
L
MADDPG
i
(µ
i
) , E
sβ
µ
old
h
Q
i
µ
old
s, µ
i
(s)
i
= E
sβ
µ
old
h
Q
µ
old
s, µ
i
(s), µ
i
old
(s)
i
, (3)
where β
µ
old
is a state distribution that is not necessarily equivalent to ρ
µ
old
, thus allowing for
off-policy training. In practice, MADDPG maximises Equation (3) by a few steps of gradient
ascent. The main advantages of MADDPG include a small variance of its MAPG estimates—
a property granted by deterministic policies (Silver et al., 2014), as well as low sample
complexity due to learning from off-policy data. Such a combination makes the algorithm
competitive on certain continuous-action tasks (Lowe et al., 2017). However, MADDPG
does not address the multi-agent credit assignment problem (Foerster et al., 2018). Plus,
when training the decentralised actors, MADDPG does not take into account the updates
agents have made and naively uses the off-policy data from the replay buffer which, much
like in Section 2.3.1, leads to uncoordinated updates and suboptimal performance in the face
of harder tasks (Peng et al., 2021; Ray-Team, accessed on 2023-03-14). MATD3 (Ackermann
et al., 2019) proposes to reduce overestimation bias in MADDPG using double centralized
critics, which improves its performance and stability but does not help with getting rid of
the aforementioned limitations.
MAPPO (Yu et al., 2022) is a relatively straightforward extension of PPO (Schulman
et al., 2017) to MARL. In its default formulation, the agents employ the trick of parameter
sharing described in the previous subsection. As such, the policy is updated to maximise
L
MAPPO
(π) , E
sρ
π
old
,aπ
old
"
n
X
i=1
min
π(a
i
|s)
π
old
(a
i
|s)
A
π
old
(s, a), clip
π(a
i
|s)
π
old
(a
i
|s)
, 1 ±
A
π
old
(s, a)
#
,
(4)
8
Heterogeneous-Agent Reinforcement Learning
where the clip(·, 1 ±) operator clips the input to 1 /1 + if it is below/above this value.
Such an operation removes the incentive for agents to make large policy updates, thus sta-
bilising the training effectively. Indeed, the algorithm’s performance on the StarCraftII
benchmark is remarkable, and it is accomplished by using only on-policy data. Neverthe-
less, the parameter-sharing strategy limits the algorithm’s applicability and could lead to
its suboptimality when agents have different roles. In trying to avoid this issue, one can
implement the algorithm without parameter sharing, thus making the agents simply take
simultaneous PPO updates meanwhile employing a joint advantage estimator. In this case,
the updates could be uncoordinated, as we discussed in Section 2.3.1.
In summary, all these algorithms do not possess performance guarantees. Altering their
implementation settings to avoid one of the limitations from Section 2.3.1 makes them, at
best, fall into another. This shows that the MARL problem introduces additional complexity
into the single-agent RL setting, and needs additional care to be rigorously solved. With
this motivation, in the next section, we propose novel heterogeneous-agent methods based
on sequential update with correctness guarantees.
3. Our Methods
The purpose of this section is to introduce Heterogeneous-Agent Reinforcement Learning
(HARL) algorithm series which we prove to solve cooperative problems theoretically. HARL
algorithms are designed for the general and expressive setting of heterogeneous agents, and
their essence is to coordinate agents’ updates, thus resolving the challenges in Section 2.3.1.
We start by developing a theoretically justified Heterogeneous-Agent Trust Region Learning
(HATRL) procedure in Section 3.1 and deriving practical algorithms, namely HATRPO
and HAPPO, as its tractable approximations in Section 3.2. We further introduce the novel
Heterogeneous-Agent Mirror Learning (HAML) framework in Section 3.3, which strengthens
performance guarantees of HATRPO and HAPPO (Section 3.4) and provides a general
template for cooperative MARL algorithmsic design, leading to more HARL algorithms
(Section 3.5).
3.1 Heterogeneous-Agent Trust Region Learning (HATRL)
Intuitively, if we parameterise all agents separately and let them learn one by one, then
we will break the homogeneity constraint and allow the agents to coordinate their updates,
thereby avoiding the two limitations from Section 2.3. Such coordination can be achieved,
for example, by accounting for previous agents’ updates in the optimization objective of the
current one along the aforementioned sequence. Fortunately, this idea is embodied in the
multi-agent advantage function A
i
m
π
s, a
i
1:m1
, a
i
m
which allows agent i
m
to evaluate the
utility of its action a
i
m
given actions of previous agents a
i
1:m1
. Intriguingly, multi-agent
advantage functions allow for rigorous decomposition of the joint advantage function, as
described by the following pivotal lemma.
9
Zhong, Kuba, Feng, Hu, Ji, and Yang
Figure 2: The multi-agent advantage decomposition lemma and the sequential update
scheme are naturally consistent. The former (upper in the figure) decomposes joint advan-
tage into sequential advantage evaluations, each of which takes into consideration previous
agents’ actions. Based on this, the latter (lower in the figure) allows each policy to be up-
dated considering previous updates during the training stage. The rigor of their connection
is embodied in Lemma 6 and Lemma 13, where multi-agent advantage decomposition lemma
is crucial for the proofs and leads to algorithms that employ sequential update scheme.
Lemma 4 (Multi-Agent Advantage Decomposition) In any cooperative Markov games,
given a joint policy π, for any state s, and any agent subset i
1:m
, the below equation holds.
A
i
1:m
π
s, a
i
1:m
=
m
X
j=1
A
i
j
π
s, a
i
1:j1
, a
i
j
.
For proof see Appendix B. Notably, Lemma 4 holds in general for cooperative Markov
games, with no need for any assumptions on the decomposability of the joint value function
such as those in VDN (Sunehag et al., 2018), QMIX (Rashid et al., 2018) or Q-DPP (Yang
et al., 2020).
Lemma 4 confirms that a sequential update is an effective approach to search for the
direction of performance improvement (i.e., joint actions with positive advantage values) in
multi-agent learning. That is, imagine that agents take actions sequentially by following an
arbitrary order i
1:n
. Let agent i
1
take action ¯a
i
1
such that A
i
1
π
(s, ¯a
i
1
) > 0, and then, for the
remaining m = 2, . . . , n, each agent i
m
takes an action ¯a
i
m
such that A
i
m
π
(s,
¯
a
i
1:m1
, ¯a
i
m
) >
0. For the induced joint action
¯
a, Lemma 4 assures that A
π
(s,
¯
a) is positive, thus the
performance is guaranteed to improve. To formally extend the above process into a policy
iteration procedure with monotonic improvement guarantee, we begin by introducing the
following definitions.
Definition 5 Let π be a joint policy, ¯π
i
1:m1
=
Q
m1
j=1
¯π
i
j
be some other joint policy of
agents i
1:m1
, and ˆπ
i
m
be some other policy of agent i
m
. Then
L
i
1:m
π
¯π
i
1:m1
, ˆπ
i
m
, E
sρ
π
,a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

.
10
Heterogeneous-Agent Reinforcement Learning
Note that, for any ¯π
i
1:m1
, we have
L
i
1:m
π
¯π
i
1:m1
, π
i
m
= E
sρ
π
,a
i
1:m1
¯π
i
1:m1
,a
i
m
π
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

= E
sρ
π
,a
i
1:m1
¯π
i
1:m1
E
a
i
m
π
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

= 0. (5)
Building on Lemma 4 and Definition 5, we derive the bound for joint policy update.
Lemma 6 Let π be a joint policy. Then, for any joint policy ¯π, we have
J(¯π) J(π) +
n
X
m=1
L
i
1:m
π
¯π
i
1:m1
, ¯π
i
m
CD
max
KL
(π
i
m
, ¯π
i
m
)
,
where C =
4γ max
s,a
|A
π
(s, a)|
(1 γ)
2
. (6)
For proof see Appendix B.2. This lemma provides an idea about how a joint policy can
be improved. Namely, by Equation (5), we know that if any agents were to set the values of
the above summands L
i
1:m
π
(¯π
i
1:m1
, ¯π
i
m
) CD
max
KL
(π
i
m
, ¯π
i
m
) by sequentially updating their
policies, each of them can always make its summand be zero by making no policy update (i.e.,
¯π
i
m
= π
i
m
). This implies that any positive update will lead to an increment in summation.
Moreover, as there are n agents making policy updates, the compound increment can be
large, leading to a substantial improvement. Lastly, note that this property holds with no
requirement on the specific order by which agents make their updates; this allows for flexible
scheduling on the update order at each iteration. To summarise, we propose the following
Algorithm 1.
Algorithm 1: Multi-Agent Policy Iteration with Monotonic Improvement Guar-
antee
Initialise the joint policy π
0
= (π
1
0
, . . . , π
n
0
).
for k = 0, 1, . . . do
Compute the advantage function A
π
k
(s, a) for all state-(joint)action pairs (s, a).
Compute = max
s,a
|A
π
k
(s, a)| and C =
4γ
(1γ)
2
.
Draw a permutation i
1:n
of agents at random.
for m = 1 : n do
Make an update
π
i
m
k+1
= arg max
π
i
m
h
L
i
1:m
π
k
π
i
1:m1
k+1
, π
i
m
CD
max
KL
(π
i
m
k
, π
i
m
)
i
.
We want to highlight that the algorithm is markedly different from naively applying the
TRPO update on the joint policy of all agents. Firstly, our Algorithm 1 does not update the
entire joint policy at once, but rather updates each agent’s individual policy sequentially.
Secondly, during the sequential update, each agent has a unique optimisation objective that
takes into account all previous agents’ updates, which is also the key for the monotonic
improvement property to hold. We justify by the following theorem that Algorithm 1 enjoys
monotonic improvement property.
Theorem 7 A sequence (π
k
)
k=0
of joint policies updated by Algorithm 1 has the monotonic
improvement property, i.e., J(π
k+1
) J(π
k
) for all k N.
11
Zhong, Kuba, Feng, Hu, Ji, and Yang
For proof see Appendix B.2. With the above theorem, we claim a successful develop-
ment of Heterogeneous-Agent Trust Region Learning (HATRL), as it retains the monotonic
improvement property of trust region learning. Moreover, we take a step further to prove
Algorithm 1’s asymptotic convergence behavior towards NE.
Theorem 8 Supposing in Algorithm 1 any permutation of agents has a fixed non-zero prob-
ability to begin the update, a sequence (π
k
)
k=0
of joint policies generated by the algorithm,
in a cooperative Markov game, has a non-empty set of limit points, each of which is a Nash
equilibrium.
For proof see Appendix B.3. In deriving this result, the novel details introduced by
Algorithm 1 played an important role. The monotonic improvement property (Theorem
7), achieved through the multi-agent advantage decomposition lemma and the sequential
update scheme, provided us with a guarantee of the convergence of the return. Further-
more, randomisation of the update order ensured that, at convergence, none of the agents
is incentified to make an update. The proof is finalised by excluding the possibility that the
algorithm converges at non-equilibrium points.
3.2 Practical Algorithms
When implementing Algorithm 1 in practice, large state and action spaces could prevent
agents from designating policies π
i
(·|s) for each state s separately. To handle this, we
parameterise each agent’s policy π
i
θ
i
by θ
i
, which, together with other agents’ policies, forms
a joint policy π
θ
parametrised by θ = (θ
1
, . . . , θ
n
). In this subsection, we develop two deep
MARL algorithms to optimise the θ.
3.2.1 HATRPO
Computing D
max
KL
π
i
m
θ
i
m
k
, π
i
m
θ
i
m
in Algorithm 1 is challenging; it requires evaluating the KL-
divergence for all states at each iteration. Similar to TRPO, one can ease this maximal
KL-divergence penalty D
max
KL
π
i
m
θ
i
m
k
, π
i
m
θ
i
m
by replacing it with the expected KL-divergence
constraint E
sρ
π
θ
k
h
D
KL
π
i
m
θ
i
m
k
(·|s), π
i
m
θ
i
m
(·|s)
i
δ where δ is a threshold hyperparameter
and the expectation can be easily approximated by stochastic sampling. With the above
amendment, we propose practical HATRPO algorithm in which, at every iteration k +
1, given a permutation of agents i
1:n
, agent i
m∈{1,...,n}
sequentially optimises its policy
parameter θ
i
m
k+1
by maximising a constrained objective:
θ
i
m
k+1
= arg max
θ
i
m
E
sρ
π
θ
k
,a
i
1:m1
π
i
1:m1
θ
i
1:m1
k+1
,a
i
m
π
i
m
θ
i
m
A
i
m
π
θ
k
(s, a
i
1:m1
, a
i
m
)
,
subject to E
sρ
π
θ
k
D
KL
π
i
m
θ
i
m
k
(·|s), π
i
m
θ
i
m
(·|s)

δ. (7)
To compute the above equation, similar to TRPO, one can apply a linear approximation to
the objective function and a quadratic approximation to the KL constraint; the optimisation
12
Heterogeneous-Agent Reinforcement Learning
problem in Equation (7) can be solved by a closed-form update rule as
θ
i
m
k+1
= θ
i
m
k
+ α
j
s
2δ
g
i
m
k
(H
i
m
k
)
1
g
i
m
k
(H
i
m
k
)
1
g
i
m
k
, (8)
where H
i
m
k
=
2
θ
i
m
E
sρ
π
θ
k
D
KL
π
i
m
θ
i
m
k
(·|s), π
i
m
θ
i
m
(·|s)

θ
i
m
=θ
i
m
k
is the Hessian of the expected
KL-divergence, g
i
m
k
is the gradient of the objective in Equation (7), α
j
< 1 is a positive
coefficient that is found via backtracking line search, and the product of (H
i
m
k
)
1
g
i
m
k
can be
efficiently computed with conjugate gradient algorithm.
Estimating E
a
i
1:m1
π
i
1:m1
θ
k+1
,a
i
m
π
i
m
θ
i
m
h
A
i
m
π
θ
k
s, a
i
1:m1
, a
i
m
i
is the last missing piece for
HATRPO, which poses new challenges because each agent’s objective has to take into ac-
count all previous agents’ updates, and the size of input values. Fortunately, with the fol-
lowing proposition, we can efficiently estimate this objective by a joint advantage estimator.
Proposition 9 Let π =
Q
n
j=1
π
i
j
be a joint policy, and A
π
(s, a) be its joint advantage
function. Let ¯π
i
1:m1
=
Q
m1
j=1
¯π
i
j
be some other joint policy of agents i
1:m1
, and ˆπ
i
m
be
some other policy of agent i
m
. Then, for every state s,
E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

= E
aπ
h
ˆπ
i
m
(a
i
m
|s)
π
i
m
(a
i
m
|s)
1
¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m1
(a
i
1:m1
|s)
A
π
(s, a)
i
.
(9)
For proof see Appendix C.1. One benefit of applying Equation (9) is that agents only
need to maintain a joint advantage estimator A
π
(s, a) rather than one centralised critic for
each individual agent (e.g., unlike CTDE methods such as MADDPG). Another practical
benefit one can draw is that, given an estimator
ˆ
A(s, a) of the advantage function A
π
θ
k
(s, a),
for example, GAE (Schulman et al., 2016), E
a
i
1:m1
π
i
1:m1
θ
i
1:m1
k+1
,a
i
m
π
i
m
θ
i
m
h
A
i
m
π
θ
k
s, a
i
1:m1
, a
i
m
i
can be estimated with an estimator of
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
1
M
i
1:m
s, a
, where M
i
1:m
=
π
i
1:m1
θ
i
1:m1
k+1
(a
i
1:m1
|s)
π
i
1:m1
θ
i
1:m1
k
(a
i
1:m1
|s)
ˆ
A
s, a
. (10)
Notably, Equation (10) aligns nicely with the sequential update scheme in HATRPO. For
agent i
m
, since previous agents i
1:m1
have already made their updates, the compound
policy ratio for M
i
1:m
in Equation (10) is easy to compute. Given a batch B of trajectories
with length T , we can estimate the gradient with respect to policy parameters (derived in
Appendix C.2) as follows,
ˆ
g
i
m
k
=
1
|B|
X
τ∈B
T
X
t=0
M
i
1:m
(s
t
, a
t
)
θ
i
m
log π
i
m
θ
i
m
(a
i
m
t
|s
t
)
θ
i
m
=θ
i
m
k
.
13
Zhong, Kuba, Feng, Hu, Ji, and Yang
The term 1 · M
i
1:m
(s, a) of Equation (10) is not reflected in
ˆ
g
i
m
k
, as it only introduces a
constant with zero gradient. Along with the Hessian of the expected KL-divergence, i.e.,
H
i
m
k
, we can update θ
i
m
k+1
by following Equation (8). The detailed pseudocode of HATRPO
is listed in Appendix C.3.
3.2.2 HAPPO
To further alleviate the computation burden from H
i
m
k
in HATRPO, one can follow the idea
of PPO by considering only using first-order derivatives. This is achieved by making agent
i
m
choose a policy parameter θ
i
m
k+1
which maximises the clipping objective of
E
sρ
π
θ
k
,aπ
θ
k
"
min
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
M
i
1:m
(s, a) , clip
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
, 1 ±
M
i
1:m
(s, a)
!#
.
(11)
The optimisation process can be performed by stochastic gradient methods such as Adam
(Kingma and Ba, 2015). We refer to the above procedure as HAPPO and Appendix C.4 for
its full pseudocode.
3.3 Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to
Cooperative MARL
Recently, Mirror Learning (Kuba et al., 2022b) provided a theoretical explanation of the
effectiveness of TRPO and PPO in addition to the original trust region interpretation,
and unifies a class of policy optimisation algorithms. Inspired by their work, we further
discover a novel theoretical framework for cooperative MARL, named Heterogeneous-Agent
Mirror Learning (HAML), which enhances theoretical guarantees of HATRPO and HAPPO.
As a proven template for algorithmic designs, HAML substantially generalises the desired
guarantees of monotonic improvement and NE convergence to a continuum of algorithms
and naturally hosts HATRPO and HAPPO as its instances, further explaining their robust
performance. We begin by introducing the necessary definitions of HAML attributes: the
drift functional.
Definition 10 Let i N, a heterogeneous-agent drift functional (HADF) D
i
of i
consists of a map, which is defined as
D
i
: Π × Π × P(i) × S {D
i
π
(·|s, ¯π
j
1:m
) : P(A
i
) R},
such that for all arguments, under notation D
i
π
ˆπ
i
|s, ¯π
j
1:m
, D
i
π
ˆπ
i
(·
i
|s)|s, ¯π
j
1:m
(·|s)
,
1. D
i
π
ˆπ
i
|s, ¯π
j
1:m
D
i
π
π
i
|s, ¯π
j
1:m
= 0 (non-negativity),
2. D
i
π
ˆπ
i
|s, ¯π
j
1:m
has all Gâteaux derivatives zero at ˆπ
i
= π
i
(zero gradient).
We say that the HADF is positive if D
i
π
(ˆπ
i
|s, ¯π
j
1:m
) = 0, s S implies ˆπ
i
= π
i
, and trivial
if D
i
π
(ˆπ
i
|s, ¯π
j
1:m
) = 0, s S for all π, ¯π
j
1:m
, and ˆπ
i
.
14
Heterogeneous-Agent Reinforcement Learning
Intuitively, the drift D
i
π
(ˆπ
i
|s, ¯π
j
1:m
) is a notion of distance between π
i
and ˆπ
i
, given that
agents j
1:m
just updated to ¯π
j
1:m
. We highlight that, under this conditionality, the same
update (from π
i
to ˆπ
i
) can have different sizes—this will later enable HAML agents to softly
constraint their learning steps in a coordinated way. Before that, we introduce a notion
that renders hard constraints, which may be a part of an algorithm design, or an inherent
limitation.
Definition 11 Let i N. We say that, U
i
: Π × Π
i
P
i
) is a neighbourhood operator
if π
i
Π
i
, U
i
π
(π
i
) contains a closed ball, i.e., there exists a state-wise monotonically non-
decreasing metric χ : Π
i
× Π
i
R such that π
i
Π
i
there exists δ
i
> 0 such that
χ(π
i
, ¯π
i
) δ
i
= ¯π
i
U
i
π
(π
i
).
For every joint policy π, we will associate it with its sampling distribution—a positive
state distribution β
π
P(S) that is continuous in π (Kuba et al., 2022b). With these
notions defined, we introduce the main definition for HAML framework.
Definition 12 Let i N, j
1:m
P(i), and D
i
be a HADF of agent i. The heterogeneous-
agent mirror operator (HAMO) integrates the advantage function as
M
(ˆπ
i
)
D
i
,¯π
j
1:m
A
π
(s) , E
a
j
1:m
¯π
j
1:m
,a
i
ˆπ
i
h
A
i
π
(s, a
j
1:m
, a
i
)
i
D
i
π
ˆπ
i
s, ¯π
j
1:m
.
Note that when ˆπ
i
= π
i
, HAMO evaluates to zero. Therefore, as the HADF is non-
negative, a policy ˆπ
i
that improves HAMO must make it positive and thus leads to the
improvement of the multi-agent advantage of agent i. It turns out that, under certain
configurations, agents’ local improvements result in the joint improvement of all agents, as
described by the lemma below, proved in Appendix D.
Lemma 13 (HAMO Is All You Need) Let π
old
and π
new
be joint policies and let i
1:n
Sym(n) be an agent permutation. Suppose that, for every state s S and every m = 1, . . . , n,
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s). (12)
Then, π
new
is jointly better than π
old
, so that for every state s,
V
π
new
(s) V
π
old
(s).
Subsequently, the monotonic improvement property of the joint return follows naturally,
as
J(π
new
) = E
sd
V
π
new
(s)
E
sd
V
π
old
(s)
= J(π
old
).
However, the conditions of the lemma require every agent to solve |S| instances of In-
equality (12), which may be an intractable problem. We shall design a single optimisation
objective whose solution satisfies those inequalities instead. Furthermore, to have a practical
application to large-scale problems, such an objective should be estimatable via sampling. To
handle these challenges, we introduce the following Algorithm Template 2 which generates
a continuum of HAML algorithms.
15
Zhong, Kuba, Feng, Hu, Ji, and Yang
Algorithm Template 2: Heterogeneous-Agent Mirror Learning
Initialise a joint policy π
0
= (π
1
0
, . . . , π
n
0
);
for k = 0, 1, . . . do
Compute the advantage function A
π
k
(s, a) for all state-(joint)action pairs (s, a);
Draw a permutaion i
1:n
of agents at random //from a positive distribution
p P(Sym(n));
for m = 1 : n do
Make an update π
i
m
k+1
= arg max
π
i
m
∈U
i
m
π
k
(π
i
m
k
)
E
sβ
π
k
h
M
(π
i
m
)
D
i
m
,π
i
1:m1
k+1
A
π
k
(s)
i
;
Output: A limit-point joint policy π
Based on Lemma 13 and the fact that π
i
U
i
π
(π
i
), i N, π
i
Π
i
, we can know any
HAML algorithm (weakly) improves the joint return at every iteration. In practical settings,
such as deep MARL, the maximisation step of a HAML method can be performed by a few
steps of gradient ascent on a sample average of HAMO (see Definition 10). We also highlight
that if the neighbourhood operators U
i
can be chosen so that they produce small policy-
space subsets, then the resulting updates will be not only improving but also small. This,
again, is a desirable property while optimising neural-network policies, as it helps stabilise
the algorithm. Similar to HATRL, the order of agents in HAML updates is randomised
at every iteration; this condition has been necessary to establish convergence to NE, which
is intuitively comprehensible: fixed-point joint policies of this randomised procedure assure
that none of the agents is incentivised to make an update, namely reaching a NE. We provide
the full list of the most fundamental HAML properties in Theorem 14 which shows that any
method derived from Algorithm Template 2 solves the cooperative MARL problem.
Theorem 14 (The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)
Let, for every agent i N, D
i
be a HADF, U
i
be a neighbourhood operator, and let the sam-
pling distributions β
π
depend continuously on π. Let π
0
Π, and the sequence of joint
policies (π
k
)
k=0
be obtained by a HAML algorithm induced by D
i
, U
i
, i N, and β
π
.
Then, the joint policies induced by the algorithm enjoy the following list of properties
1. Attain the monotonic improvement property,
J(π
k+1
) J(π
k
),
2. Their value functions converge to a Nash value function V
NE
lim
k→∞
V
π
k
= V
NE
,
3. Their expected returns converge to a Nash return,
lim
k→∞
J(π
k
) = J
NE
,
4. Their ω-limit set consists of Nash equilibria.
See the proof in Appendix E. With the above theorem, we can conclude that HAML
provides a template for generating theoretically sound, stable, monotonically improving
algorithms that enable agents to learn solving multi-agent cooperation tasks.
16
Heterogeneous-Agent Reinforcement Learning
3.4 Casting HATRPO and HAPPO as HAML Instances
In this section, we show that HATRPO and HAPPO are in fact valid instances of HAML,
which provides a more direct theoretical explanation for their excellent empirical perfor-
mance.
We begin with the example of HATRPO, where agent i
m
(the permutation i
1:n
is drawn
from the uniform distribution) updates its policy so as to maximise (in ¯π
i
m
)
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
¯π
i
m
h
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
i
, subject to D
KL
(π
i
m
old
, ¯π
i
m
) δ.
This optimisation objective can be casted as a HAMO with the HADF D
i
m
0, and
the KL-divergence neighbourhood operator
U
i
m
π
(π
i
m
) =
n
¯π
i
m
E
sρ
π
h
KL
π
i
m
(·
i
m
|s), ¯π
i
m
(·
i
m
|s)
i
δ
o
.
The sampling distribution used in HATRPO is β
π
= ρ
π
. Lastly, as the agents up-
date their policies in a random loop, the algorithm is an instance of HAML. Hence, it is
monotonically improving and converges to a Nash equilibrium set.
In HAPPO, the update rule of agent i
m
is changed with respect to HATRPO as
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
h
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m
)
i
,
where r(¯π
i
) =
¯π
i
(a
i
|s)
π
i
old
(a
i
|s)
. We show in Appendix F that this optimisation objective is equivalent
to
E
sρ
π
old
h
E
a
i
1:m1
π
i
1:m1
new
,a
i
m
¯π
i
m
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
E
a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
ReLU

r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m
)

i
.
The purple term is clearly non-negative due to the presence of the ReLU function.
Furthermore, for policies ¯π
i
m
sufficiently close to π
i
m
old
, the clip operator does not activate,
thus rendering r(¯π
i
m
) unchanged. Therefore, the purple term is zero at and in a region
around ¯π
i
m
= π
i
m
old
, which also implies that its Gâteaux derivatives are zero. Hence, it
evaluates a HADF for agent i
m
, thus making HAPPO a valid HAML instance.
Finally, we would like to highlight that these conclusions about HATRPO and HAPPO
strengthen the results in Section 3.1 and 3.2. In addition to their origin in HATRL, we
now show that their optimisation objectives directly enjoy favorable theoretical properties
endowed by HAML framework. Both interpretations underpin their empirical performance.
3.5 More HAML Instances
In this subsection, we exemplify how HAML can be used for derivation of principled MARL
algorithms, solely by constructing valid drift functional, neighborhood operator, and sam-
pling distribution. Our goal is to verify the correctness of HAML theory and enrich the
cooperative MARL with more theoretically guaranteed and practical algorithms. The re-
sults are more robust heterogeneous-agent versions of popular RL algorithms including A2C,
DDPG, and TD3, different from those in Section 2.3.2.
17
Zhong, Kuba, Feng, Hu, Ji, and Yang
Figure 3: This figure presents a simplified schematic overview of HARL algorithms repre-
sented as valid instances of HAML. The complete details are available in Appendix H. By
recasting HATRPO and HAPPO as HAML formulations, we demonstrate that their guaran-
tees pertaining to monotonic improvement and NE convergence are enhanced by leveraging
the HAML framework. Moreover, HAA2C, HADDPG, and HATD3 are obtained by design-
ing HAML components, thereby securing those same performance guarantees. The variety
of drift functionals, neighborhood operators, and sampling distributions utilised by these
approaches further attests to the versatility and richness of the HAML framework.
3.5.1 HAA2C
HAA2C intends to optimise the policy for the joint advantage function at every iteration,
and similar to A2C, does not impose any penalties or constraints on that procedure. This
learning procedure is accomplished by, first, drawing a random permutation of agents i
1:n
,
and then performing a few steps of gradient ascent on the objective of
E
sρ
π
old
,a
i
1:m
π
i
1:m
old
h
π
i
1:m1
new
(a
i
1:m1
|s)π
i
m
(a
i
m
|s)
π
i
1:m1
old
(a
i
1:m1
|s)π
i
m
old
(a
i
m
|s)
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
i
, (13)
with respect to π
i
m
parameters, for each agent i
m
in the permutation, sequentially. In
practice, we replace the multi-agent advantage A
i
m
π
old
(s, a
i
1:m1
, a
i
m
) with the joint advantage
estimate which, thanks to the joint importance sampling in Equation (13), poses the same
objective on the agent (see Appendix G for full pseudocode).
3.5.2 HADDPG
HADDPG exploits the fact that β
π
can be independent of π and aims to maximise the
state-action value function off-policy. As it is a deterministic-action method, importance
sampling in its case translates to replacement of the old action inputs to the critic with the
new ones. Namely, agent i
m
in a random permutation i
1:n
maximises
E
sβ
µ
old
h
Q
i
1:m
µ
old
s, µ
i
1:m1
new
(s), µ
i
m
(s)
i
, (14)
with respect to µ
i
m
, also with a few steps of gradient ascent. Similar to HAA2C, optimising
the state-action value function (with the old action replacement) is equivalent to the original
multi-agent value (see Appendix G for full pseudocode).
18
Heterogeneous-Agent Reinforcement Learning
3.5.3 HATD3
HATD3 improves HADDPG with tricks proposed by Fujimoto et al. (2018). Similar to
HADDPG, HATD3 is also an off-policy algorithm and optimises the same target, but it
employs target policy smoothing, clipped double Q-learning, and delayed policy updates
techniques (see Appendix G for full pseudocode). We observe that HATD3 consistently
outperforms HADDPG on all tasks, showing that relevant reinforcement learning can be
directly applied to MARL without the need for rediscovery, another benefit of the HAML.
As the HADDPG and HATD3 algorithms have been derived, it is logical to consider
the possibility of HADQN, given that DQN can be viewed as a pure value-based version of
DDPG for discrete action problems. In light of this, we introduce HAD3QN, a value-based
approximation of HADDPG that incorporates techniques proposed by Van Hasselt et al.
(2016) and Wang et al. (2016). The details of HAD3QN are presented in Appendix I, which
includes the pseudocode, performance analysis, and an ablation study demonstrating the
importance of the dueling double Q-network architecture for achieving stable and efficient
multi-agent learning.
To elucidate the formulations and differences of HARL approaches in their HAML repre-
sentation, we provide a simplified summary in Figure 3 and list the full details in Appendix
H. While these approaches have already tailored HADFs, neighbourhood operators, and
sampling distributions, we speculate that the entire abundance of the HAML framework
can still be explored with more future work. Nevertheless, we commence addressing the
heterogeneous-agent cooperation problem with these five methods, and analyse their perfor-
mance in Section 5.
4. Related Work
There have been previous attempts that tried to solve the cooperative MARL problem by
developing multi-agent trust region learning theories. Despite empirical successes, most of
them did not manage to propose a theoretically-justified trust region protocol in multi-agent
learning, or maintain the monotonic improvement property. Instead, they tend to impose
certain assumptions to enable direct implementations of TRPO/PPO in MARL problems.
For example, IPPO (de Witt et al., 2020) assumes homogeneity of action spaces for all
agents and enforces parameter sharing. Yu et al. (2022) proposed MAPPO which enhances
IPPO by considering a joint critic function and finer implementation techniques for on-
policy methods. Yet, it still suffers similar drawbacks of IPPO due to the lack of monotonic
improvement guarantee especially when the parameter-sharing condition is switched off.
Wen et al. (2022) adjusted PPO for MARL by considering a game-theoretical approach at the
meta-game level among agents. Unfortunately, it can only deal with two-agent cases due to
the intractability of Nash equilibrium. Recently, Li and He (2023) tried to implement TRPO
for MARL through distributed consensus optimisation; however, they enforced the same
ratio ¯π
i
(a
i
|s)
i
(a
i
|s) for all agents (see their Equation (7)), which, similar to parameter
sharing, largely limits the policy space for optimisation. Moreover, their method comes
with a δ/n KL-constraint threshold that fails to consider scenarios with large agent number.
While Coordinated PPO (CoPPO) (Wu et al., 2021) derived a theoretically-grounded joint
objective and obtained practical algorithms through a set of approximations, it still suffers
from the non-stationarity problem as it updates agents simultaneously.
19
Zhong, Kuba, Feng, Hu, Ji, and Yang
One of the key ideas behind our Heterogeneous-Agent algorithm series is the sequential
update scheme. A similar idea of multi-agent sequential update was also discussed in the
context of dynamic programming (Bertsekas, 2019) where artificial "in-between" states have
to be considered. On the contrary, our sequential update scheme is developed based on
Lemma 4, which does not require any artificial assumptions and holds for any cooperative
games. The idea of sequential update also appeared in principal component analysis; in
EigenGame (Gemp et al., 2021) eigenvectors, represented as players, maximise their own
utility functions one-by-one. Although EigenGame provably solves the PCA problem, it is
of little use in MARL, where a single iteration of sequential updates is insufficient to learn
complex policies. Furthermore, its design and analysis rely on closed-form matrix calculus,
which has no extension to MARL.
Lastly, we would like to highlight the importance of the decomposition result in Lemma
4. This result could serve as an effective solution to value-based methods in MARL where
tremendous efforts have been made to decompose the joint Q-function into individual Q-
functions when the joint Q-function is decomposable (Rashid et al., 2018). Lemma 4, in
contrast, is a general result that holds for any cooperative MARL problems regardless of
decomposability. As such, we think of it as an appealing contribution to future developments
on value-based MARL methods.
Our work is an extension of previous work HATRPO / HAPPO, which was originally
proposed in a conference paper (Kuba et al., 2022a). The main additions in our work are:
Introducing Heterogeneous-Agent Mirror Learning (HAML), a more general theoretical
framework that strengthens theoretical guarantees for HATRPO and HAPPO and can
induce a continuum of sound algorithms with guarantees of monotonic improvement
and convergence to Nash Equilibrium;
Designing novel algorithm instances of HAML including HAA2C, HADDPG, and
HATD3, which attain better performance than their existing MA-counterparts, with
HATD3 establishing the new SOTA results for off-policy algorithms;
Releasing PyTorch-based implementation of HARL algorithms, which is more unified,
modularised, user-friendly, extensible, and effective than the previous one;
Conducting comprehensive experiments evaluating HARL algorithms on six challeng-
ing benchmarks Multi-Agent Particle Environment (MPE), Multi-Agent MuJoCo (MA-
MuJoCo), StarCraft Multi-Agent Challenge (SMAC), SMACv2, Google Research Foot-
ball Environment (GRF), and Bi-DexterousHands.
5. Experiments and Analysis
In this section, we evaluate and analyse HARL algorithms on six cooperative multi-agent
benchmarks Multi-Agent Particle Environment (MPE) (Lowe et al., 2017; Mordatch and
Abbeel, 2018), Multi-Agent MuJoCo (MAMuJoCo) (Peng et al., 2021), StarCraft Multi-
Agent Challenge (SMAC) (Samvelyan et al., 2019), SMACv2 (Ellis et al., 2022), Google Re-
search Football Environment (GRF) (Kurach et al., 2020), and Bi-DexterousHands (Chen
et al., 2022), as shown in Figure 4 and compare their performance to existing SOTA
20
Heterogeneous-Agent Reinforcement Learning
Figure 4: The six environments used for testing HARL algorithms.
methods. These benchmarks are diverse in task difficulty, agent number, action type, di-
mensionality of observation space and action space, and cooperation strategy required, and
hence provide a comprehensive assessment of the effectiveness, stability, robustness, and
generality of our methods. The experimental results demonstrate that HAPPO, HADDPG,
and HATD3 generally outperform their MA-counterparts on heterogeneous-agent coopera-
tion tasks. Moreover, HARL algorithms culminate in HAPPO and HATD3, which exhibit
superior effectiveness and stability for heterogeneous-agent cooperation tasks over existing
strong baselines such as MAPPO, QMIX, MADDPG, and MATD3, refreshing the state-of-
the-art results. Our ablation study also reveals that the novel details introduced by HATRL
and HAML theories, namely non-sharing of parameters and randomised order in sequential
update, are crucial for obtaining the strong performance. Finally, we empirically show that
the computational overhead introduced by sequential update does not need to be a concern.
Our implementation of HARL algorithms takes advantage of the sequential update
scheme and the CTDE framework that HARL algorithms share in common, and unifies
them into either the on-policy or the off-policy training pipeline, resulting in modularisation
and extensibility. It also naturally hosts MAPPO, MADDPG, and MATD3 as special cases
and provides the (re)implementation of these three algorithms along with HARL algorithms.
For fair comparisons, we use our (re)implementation of MAPPO, MADDPG, and MATD3 as
baselines on MPE and MAMuJoCo, where their publicly acknowledged performance report
under exactly the same settings is lacking, and we ensure that their performance matches
or exceeds the results reported by their original paper and subsequent papers; on the other
benchmarks, the original implementations of baselines are used. To be consistent with the
officially reported results of MAPPO, we let it utilize parameter sharing on all but Bi-
DexterousHands and the Speaker Listener task in MPE. Details of hyper-parameters and
experiment setups can be found in Appendix K.
5.1 MPE Testbed
We consider the three fully cooperative tasks in MPE (Lowe et al., 2017): Spread, Reference,
and Speaker Listener. These tasks require agents to explore and then learn the optimal
cooperation strategies, such as spreading to targets as quickly as possible without collision,
instructing companions, and so on. The Speaker Listener scenario, in particular, explicitly
designs different roles and fails the homogeneous agent approach. As the original codebase
of MPE is no longer maintained, we choose to use its PettingZoo version (Terry et al.,
2021). To make it compatible with the cooperative MARL problem formulation in Section
2, we implement the interface of MPE so that agents do not have access to their individual
rewards, as opposed to the setting used by MADDPG. Instead, individual rewards of agents
are summed up to form the joint reward, which is available during centralised training. We
21
Zhong, Kuba, Feng, Hu, Ji, and Yang
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
50
40
30
20
10
0
Episode Return
Reference (continuous)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
40
30
20
10
0
Episode Return
Speaker Listener (continuous)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
120
110
100
90
80
70
60
50
Episode Return
Spread (continuous)
MATD3
MADDPG
MAPPO
HATD3
HADDPG
HAA2C
HATRPO
HAPPO
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
50
40
30
20
10
0
Episode Return
Reference (discrete)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
40
30
20
10
0
Episode Return
Speaker Listener (discrete)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
120
110
100
90
80
70
60
50
Episode Return
Spread (discrete)
MAPPO
HAA2C
HATRPO
HAPPO
Figure 5: Comparisons of average episode return on Multi-Agent Particle Environments.
The “continuous” and “discrete” in parenthesis refer to the type of action space in each task.
evaluate HAPPO, HATRPO, HAA2C, HADDPG, and HATD3 on the continuous action-
space version of these three tasks against MAPPO, MADDPG, and MATD3, with on-policy
algorithms running for 10 million steps and off-policy ones for 5 million steps. Since the
stochastic policy algorithms, namely HAPPO, HATRPO, and HAA2C, can also be applied to
discrete action-space scenarios, we additionally compare them with MAPPO on the discrete
version of these three tasks, using the same number of timesteps. The learning curves plotted
from training data across three random seeds are shown in Figure 5.
While MPE tasks are relatively simple, it is sufficient for identifying several patterns.
HAPPO consistently solves all six combinations of tasks, with its performance comparable
to or better than MAPPO. With a single set of hyper-parameters, HATRPO also solves
five combinations easily and achieves steady learning curves due to the explicitly specified
distance constraint and reward improvement between policy updates. It should be noted that
the oscillations observed after convergence are due to the randomness of test environments
which affects the maximum reward an algorithm can attain. HAA2C, on the other hand,
is equally competitive on the discrete version of tasks, but shows higher variance and is
empirically harder to achieve the same level of episode return on the continuous versions,
which is a limitation of this method since its update rule can not be precisely realised
in practice and meanwhile it imposes no constraint. Nevertheless, it still constitutes a
potentially competitive solution.
Furthermore, two off-policy HARL methods, HADDPG and HATD3, exhibit extremely
fast mastery of the three tasks with small variance, demonstrating their advantage in high
sample efficiency. Their performances are similar to MA-counterparts on these simple tasks,
with TD3-based methods achieving faster convergence rate and higher total rewards, es-
tablishing new SOTA off-policy results. Off-policy HARL methods consistently converge
with much fewer samples than on-policy methods across all tasks, holding the potential to
22
Heterogeneous-Agent Reinforcement Learning
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
1000
2000
3000
4000
5000
6000
Episode Return
Ant-v2 4x2
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
6000
7000
Episode Return
HalfCheetah-v2 2x3
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
500
1000
1500
2000
2500
3000
3500
Episode Return
Hopper-v2 3x1
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
6000
Episode Return
Walker2d-v2 2x3
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
Episode Return
Walker2d-v2 6x1
MAPPO
HAA2C
HATRPO
HAPPO
Figure 6: Comparisons of average episode return of on-policy algorithms on Multi-Agent
MuJoCo. HAPPO generally outperforms MAPPO, refreshing the state-of-the-art (SOTA)
results for on-policy algorithms.
alleviate the high sample complexity and slow training speed problems, which are commonly
observed in MARL experiments.
These observations show that while HARL algorithms have the same improvement and
convergence guarantees in theory, they differ in learning behaviours due to diverse algorith-
mic designs. In general, they complement each other and collectively solve all tasks.
5.2 MAMuJoCo Testbed
The Multi-Agent MuJoCo (MAMuJoCo) environment is a multi-agent extension of MuJoCo.
While the MuJoCo tasks challenge a robot to learn an optimal way of motion, MAMuJoCo
models each part of a robot as an independent agent for example, a leg for a spider or
an arm for a swimmer and requires the agents to collectively perform efficient motion.
With the increasing variety of the body parts, modeling heterogeneous policies becomes
necessary. Thus, we believe that MAMuJoCo is a suitable task suite for evaluating the
effectiveness of our heterogeneous-agent methods. We evaluate HAPPO, HATRPO, HAA2C,
HADDPG, and HATD3 on the five most representative tasks against MAPPO, MADDPG,
and MATD3 and plot the learning curves across at least three seeds in Figure 6 and 7.
We observe that on all five tasks, HAPPO, HADDPG, and HATD3 achieves generally
better average episode return than their MA-counterparts. HATRPO and HAA2C also
achieve strong and steady learning behaviours on most tasks. Since the running motion
are hard to be realised by any subset of all agents, the episode return metric measures
the quality of agents’ cooperation. Rendered videos from the trained models of HARL
algorithms confirm that agents develop effective cooperation strategies for controlling their
corresponding body parts. For example, on the 2-agent HalfCheetah task, agents trained by
23
Zhong, Kuba, Feng, Hu, Ji, and Yang
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
6000
Episode Return
Ant-v2 4x2
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
2000
4000
6000
8000
Episode Return
HalfCheetah-v2 2x3
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
Episode Return
Hopper-v2 3x1
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
2000
4000
6000
8000
Episode Return
Walker2d-v2 2x3
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
2000
4000
6000
8000
Episode Return
Walker2d-v2 6x1
MADDPG
MATD3
HADDPG
HATD3
Figure 7: Comparisons of average episode return of off-policy algorithms on Multi-Agent Mu-
JoCo. HADDPG and HATD3 generally outperform MADDPG and MATD3, while HATD3
achieves the highest average return across all tasks, thereby refreshing the state-of-the-art
(SOTA) results for off-policy algorithms.
HAPPO learn to alternately hit the ground, forming a swift kinematic gait that resembles a
real cheetah. The motion performed by each agent is meaningless alone and only takes effect
when combined with the other agent’s actions. In other words, all agents play indispensable
roles and have unique contributions in completing the task, which is the most desirable form
of cooperation. Empirically, HARL algorithms prove their capability to generate this level
of cooperation from random initialisation.
As for the off-policy HARL algorithms, HATD3 outperforms both MATD3 and HAD-
DPG on all tasks, due to the beneficial combination of sequential update and the stabilising
effects brought by twin critics, delayed actor update, and target action smoothing tricks.
This also admits the feasibility of introducing RL tricks to MARL. Its performance is even
generally better than HAPPO, showing the competence to handle continuous tasks. Ex-
perimental results on MAMuJoCo not only prove the superiority of HARL algorithms over
existing strong baselines, but also reveal that HARL renders multiple effective solutions to
multi-agent cooperation tasks.
Though MAMuJoCo tasks are heterogeneous in nature, parameter sharing is still effec-
tive in scenarios where learning a “versatile” policy to control all body parts by relying on the
expressiveness of neural network is enough. As a result, on these five tasks, MAPPO under-
performs HAPPO by not very large margins. To fully distinguish HAPPO from MAPPO, we
additionally compare them on the 17-agent Humanoid task and report the learning curves
averaged across three seeds in Figure 8. In this scenario, the 17 agents control dissimilar
body parts and it is harder for a single policy to select the right action for each part. In-
deed, MAPPO completely fails to learn. In contrast, HAPPO still manages to coordinate the
agents’ updates with its sequential update scheme which leads to a walking humanoid with
24
Heterogeneous-Agent Reinforcement Learning
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
6000
7000
Episode Return
Humanoid-v2 17x1
MAPPO
HATD3
HAPPO
Figure 8: Comparisons of average episode return on the 17-agent Humanoid control task
in Multi-Agent MuJoCo. In the face of this many-heterogeneous-agent task, HAPPO and
HATD3 achieve state-of-the-art (SOTA) performance, while MAPPO fails completely. This
highlights the superior effectiveness of HARL algorithms for promoting cooperation among
heterogeneous agents.
the joint effort from all agents. With the same theoretical properties granted by HAML,
HATD3 also successfully learns to control the 17-agent humanoid. Therefore, HARL al-
gorithms are more applicable and effective for the general many-heterogeneous-agent cases.
Their advantage becomes increasingly significant with the increasing heterogeneity of agents.
5.3 SMAC & SMACv2 Testbed
The StarCraft Multi-Agent Challenge (SMAC) contains a set of StarCraft maps in which
a team of mostly homogeneous ally units aims to defeat the opponent team. It challenges
an algorithm to develop effective teamwork and decentralised unit micromanagement, and
serves as a common arena for algorithm comparison. We benchmark HAPPO and HATRPO
on five hard maps and five super hard maps in SMAC against QMIX (Rashid et al., 2018)
and MAPPO (Yu et al., 2022), which are known to achieve supreme results. Furthermore,
as Ellis et al. (2022) proposes SMACv2 to increase randomness of tasks and diversity among
unit types in SMAC, we additionally test HAPPO and HATRPO on five maps in SMACv2
against QMIX and MAPPO. On these two sets of tasks, we adopt the implementations of
QMIX and MAPPO that have achieved the best-reported results, i.e. in SMAC we use the
implementation by Yu et al. (2022) and in SMACv2 we use the implementation by Ellis
et al. (2022). Following the evaluation metric proposed by Wang et al. (2021), we report the
win rates computed across at least three seeds in Table 1 and provide the learning curves in
Appendix J.
We observe that HAPPO and HATRPO are able to achieve comparable or superior per-
formance to QMIX and MAPPO across five hard maps and five super hard maps in SMAC,
while not relying on the restrictive parameter-sharing trick, as opposed to MAPPO. From
the learning curves, it shows that HAPPO and HATRPO exhibit steadily improving learn-
ing behaviours, while baselines experience large oscillations on 25m and 27m_vs_30m, again
demonstrating the monotonic improvement property of our methods. On SMACv2, though
randomness and heterogeneity increase, HAPPO and HATRPO robustly achieve competi-
25
Zhong, Kuba, Feng, Hu, Ji, and Yang
Map Difficulty HAPPO HATRPO MAPPO QMIX Steps
8m_vs_9m Hard 83.8(4.1) 92.5(3.7) 87.5(4.0) 92.2(1.0) 1e7
25m Hard 95.0(2.0) 100.0(0.0) 100.0(0.0) 89.1(3.8) 1e7
5m_vs_6m Hard 77.5(7.2) 75.0(6.5) 75.0(18.2) 77.3(3.3) 1e7
3s5z Hard 97.5(1.2) 93.8(1.2) 96.9(0.7) 89.8(2.5) 1e7
10m_vs_11m Hard 87.5(6.7) 98.8(0.6) 96.9(4.8) 95.3(2.2) 1e7
MMM2 Super Hard 88.8(2.0) 97.5(6.4) 93.8(4.7) 87.5(2.5) 2e7
3s5z_vs_3s6z Super Hard 66.2(3.1) 72.5(14.7) 70.0(10.7) 87.5(12.6) 2e7
27m_vs_30m Super Hard 76.6(1.3) 93.8(2.1) 80.0(6.2) 45.3(14.0) 2e7
corridor Super Hard 92.5(13.9) 88.8(2.7) 97.5(1.2) 82.8(4.4) 2e7
6h_vs_8z Super Hard 76.2(3.1) 78.8(0.6) 85.0(2.0) 92.2(26.2) 4e7
protoss_5_vs_5 - 57.5(1.2) 50.0(2.4) 56.2(3.2) 65.6(3.9) 1e7
terran_5_vs_5 - 57.5(1.3) 56.8(2.9) 53.1(2.7) 62.5(3.8) 1e7
zerg_5_vs_5 - 42.5(2.5) 43.8(1.2) 40.6(7.0) 34.4(2.2) 1e7
zerg_10_vs_10 - 28.4(2.2) 34.6(0.2) 37.5(3.2) 40.6(3.4) 1e7
zerg_10_vs_11 - 16.2(0.6) 19.3(2.1) 29.7(3.8) 25.0(3.9) 1e7
Table 1: Median evaluation win rate and standard deviation on ten SMAC maps (upper
in the table) and five SMACv2 maps (lower in the table) for different methods. All values
within 1 standard deviation of the maximum win rate are marked in bold. The column
labeled “Steps” specifies the number of steps used for training. Our results suggest that
HAPPO and HATRPO perform comparably or better than MAPPO and QMIX on these
tasks, which mainly involve homogeneous agents. Moreover, HAPPO and HATRPO do
not rely on the restrictive parameter-sharing technique, demonstrating their versatility in
various scenarios.
tive win rates and are comparable to QMIX and MAPPO. Another important observation
is that HATRPO is more effective than HAPPO in SMAC and SMACv2, outperforming
HAPPO on 10 out of 15 tasks. This implies that HATRPO could enhance learning stabil-
ity by imposing explicit constraints on update distance and reward improvement, making
it a promising approach to tackling novel and challenging tasks. Overall, the performance
of HAPPO and HATRPO in SMAC and SMACv2 confirms their capability to coordinate
agents’ training in largely homogeneous settings.
5.4 Google Research Football Testbed
Google Research Football Environment (GRF) composes a series of tasks where agents are
trained to play football in an advanced, physics-based 3D simulator. From literature (Yu
et al., 2022), it is shown that GRF is still challenging to existing methods. We apply
HAPPO to the five academy tasks of GRF, namely 3 vs 1 with keeper (3v.1), counterattack
(CA) easy and hard, pass and shoot with keeper (PS), and run pass and shoot with keeper
(RPS), with MAPPO and QMIX as baselines. As GRF does not provide a global state
interface, our solution is to implement a global state based on agents’ observations following
the Simple115StateWrapper of GRF. Concretely, the global state consists of common com-
ponents in agents’ observations and the concatenation of agent-specific parts, and is taken as
26
Heterogeneous-Agent Reinforcement Learning
scenarios HAPPO MAPPO QMIX
PS 96.93(1.11) 94.92(0.85) 8.05(5.58)
RPS 77.30(7.40) 76.83(3.57) 8.08(3.29)
3v.1 94.74(3.05) 88.03(4.15) 8.12(4.46)
CA(easy) 92.00(1.62) 87.76(6.40) 15.98(11.77)
CA(hard) 88.14(5.77) 77.38(10.95) 3.22(4.39)
Table 2: Average evaluation score rate and standard deviation (over six seeds) on GRF
scenarios for different methods. All values within 1 standard deviation of the maximum score
rate are marked in bold. Our results reveal that HAPPO generally outperforms MAPPO
and QMIX on all tasks, setting a new state-of-the-art performance benchmark.
0.0 0.5 1.0 1.5 2.0 2.5
Step
×10
7
0
20
40
60
80
100
Score Rate (%)
pass and shoot with keeper (2 agents)
0.0 0.5 1.0 1.5 2.0 2.5
Step
×10
7
0
20
40
60
80
100
Score Rate (%)
run pass and shoot with keeper (2 agents)
0.0 0.5 1.0 1.5 2.0 2.5
Step
×10
7
0
20
40
60
80
100
Score Rate (%)
3 vs 1 with keeper (3 agents)
0.0 0.5 1.0 1.5 2.0 2.5
Step
×10
7
0
20
40
60
80
100
Score Rate (%)
counterattack easy (4 agents)
0 1 2 3 4 5
Step
×10
7
0
20
40
60
80
100
Score Rate (%)
counterattack hard (4 agents)
QMIX
MAPPO
HAPPO
Figure 9: The figure displays the average score rate comparisons for different methods on
GRF, and also illustrates how the performance gaps between HAPPO and MAPPO widen
as the roles and difficulty levels of tasks increase. Overall, our results demonstrate that
HAPPO outperforms MAPPO in tackling complex multi-agent scenarios.
input by the centralised critic for value prediction. We also utilize the dense-reward setting
in GRF. All methods are trained for 25 million environment steps in all scenarios with the
exception of CA (hard), in which methods are trained for 50 million environment steps. We
compute the success rate over 100 rollouts of the game and report the average success rate
over the last 10 evaluations across 6 seeds in Table 2. We also report the learning curves of
the algorithms in Figure 9.
We observe that HAPPO is generally better than MAPPO, establishing new state-of-the-
art results, and they both significantly outperform QMIX. In particular, as the number of
agents increases and the roles they play become more diverse, the performance gap between
HAPPO and MAPPO becomes larger, again showing the effectiveness and advantage of
HARL algorithms for the many-heterogeneous-agent settings. From the rendered videos, it
27
Zhong, Kuba, Feng, Hu, Ji, and Yang
0 1 2 3 4 5
Step
×10
7
0
10
20
30
Episode Return
ShadowHandOver
0 1 2 3 4 5
Step
×10
7
0
5
10
15
20
25
30
Episode Return
ShadowHandCatchOver2Underarm
0 1 2 3 4 5
Step
×10
7
0
50
100
150
200
Episode Return
ShadowHandPen
PPO
MAPPO
HAPPO
Figure 10: Comparisons of average episode return on Bi-DexterousHands. The learning
curves demonstrate that HAPPO consistently achieves the highest return, outperforming
both MAPPO and PPO.
is shown that agents trained by HAPPO develop clever teamwork strategies for ensuring a
high score rate, such as cooperative breakthroughs to form one-on-one chances, etc. This
result further supports the effectiveness of applying HAPPO to cooperative MARL problems.
5.5 Bi-DexterousHands Testbed
Based on IsaacGym, Bi-DexterousHands provides a suite of tasks for learning human-level
bimanual dexterous manipulation. It leverages GPU parallelisation and enables simultane-
ous instantiation of thousands of environments. Compared with other CPU-based environ-
ments, Bi-DexterousHands significantly increases the number of samples generated in the
same time interval, thus alleviating the sample efficiency problem of on-policy algorithms.
We choose three representative tasks and compare HAPPO with MAPPO as well as PPO.
As the existing reported results of MAPPO on these tasks do not utilize parameter sharing,
we follow them in order to be consistent. The learning curves plotted from training data
across three random seeds are shown in Figure 10. On all three tasks, HAPPO consistently
outperforms MAPPO, and is at least comparable to or better than the single-agent baseline
PPO, while also showing less variance. The comparison between HAPPO and MAPPO
demonstrates the superior competence of the sequential update scheme adopted by HARL
algorithms over simultaneous updates for coordinating multiple heterogeneous agents.
5.6 Ablation Experiments
In this subsection, we conduct ablation study to investigate the importance of two key
novelties that our HARL algorithms introduced; they are heterogeneity of agents’ parameters
and the randomisation of order of agents in the sequential update scheme. We compare the
performance of original HAPPO with a version that shares parameters, and with a version
where the order in sequential update scheme is fixed throughout training. We run the
experiments on two MAMuJoCo tasks, namely 2-agent Walker and 6-agent Walker.
The experiments reveal that the deviation from the theory harms performance. In par-
ticular, parameter sharing introduces unreasonable policy constraints to training, harms the
monotonic improvement property (Theorem 7 assumes heterogeneity), and causes HAPPO
to converge to suboptimal policies. The suboptimality is more severe in the task with more
diverse agents, as discussed in Section 2.3.1. Similarly, fixed order in the sequential up-
28
Heterogeneous-Agent Reinforcement Learning
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
6000
Episode Return
Walker2d-v2 2x3
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
1000
2000
3000
4000
5000
Episode Return
Walker2d-v2 6x1
HAPPO (no random order)
HAPPO (share param)
HAPPO (original)
Figure 11: Performance comparison between original HAPPO, and its modified versions:
HAPPO with parameter sharing, and HAPPO without randomisation of the sequential
update scheme.
date scheme negatively affects the performance at convergence, as suggested by Theorem
8. In the 2-agent task, fixing update order leads to inferior performance throughout the
training process; in the 6-agent task, while the fixed order version initially learns faster,
it is gradually overtaken by the randomised order version and achieves worse convergence
results. We conclude that the fine performance of HARL algorithms relies strongly on the
close connection between theory and implementation.
5.7 Analysis of Computational Overhead
We then analyse the computational overhead introduced by the sequential update scheme.
We mainly compare HAPPO with MAPPO in parameter-sharing setting, where our im-
plementation conducts the single vectorized update
3
. We conduct experiments on seven
MAMuJoCo tasks with all hyperparameters fixed. Both methods are trained for 1 million
steps and we record the computational performance in Table 3. The machine for experi-
ments in this subsection is equipped with an AMD Ryzen 9 5950X 16-Core Processor and
an NVIDIA RTX 3090 Ti GPU, and we ensure that no other experiments are running.
We generally observe a linear relationship between update times for both HAPPO and
MAPPO and the number of agents. For HAPPO, each agent is trained on a constant-
sized batch input, denoted as |B|, across tasks. Thus the total time consumed to update
all agents correlates directly with the agent count. For MAPPO, on the other hand, the
shared parameter is trained on a batch input of size n × |B| when the number of agents is
n. However, as the batch size |B| used in MAMuJoCo is typically large, in this case 4000,
vectorizing agents data does not significantly enhance GPU parallelization. This is evidenced
by the relatively consistent FLOPS recorded across tasks. As a result, the MAPPO update
timeframe also exhibits linear growth with increasing agents. The ratio of HAPPO and
MAPPO update time is almost constant on the first six tasks and it nearly degenerates
to 1 when both of them sufficiently utilize the computational resources, as shown in the
case of 17-agent Humanoid where the significantly higher-dimensional observation space
3. Corresponding to the original implementation at https://github.com/marlbenchmark/on-policy/
blob/0affe7f4b812ed25e280af8115f279fbffe45bbe/onpolicy/algorithms/r_mappo/r_mappo.py#
L205.
29
Zhong, Kuba, Feng, Hu, Ji, and Yang
scenarios
HAPPO MAPPO
experiment
time(s)
agents
update
time(s)
FLOPS
experiment
time(s)
share param
update
time(s)
FLOPS
HalfCheetah 2x3 203.4
0.4
8.6
0.0
368 197.6
1.0
4.9
0.1
588
HalfCheetah 3x2 264.7
1.0
12.9
0.0
368 256.0
1.1
6.2
0.0
644
HalfCheetah 6x1 451.0
1.4
25.3
0.0
366 441.4
0.9
12.3
0.0
717
Walker 2x3 193.9
2.8
8.6
0.0
368 187.4
4.8
4.9
0.0
588
Walker 3x2 245.0
6.2
12.9
0.2
368 232.6
1.5
6.3
0.0
649
Walker 6x1 408.6
9.9
25.4
0.2
370 383.2
5.6
12.2
0.0
711
Humanoid 17x1 912.0
11.8
76.7
0.9
568 988.0
2.1
71.3
0.1
738
Table 3: Computational performance comparisons between HAPPO and MAPPO on seven
MAMuJoCo tasks across three seeds. As for the comparison items, “experiment time” de-
notes the overall running time of a single experiment; “agents update time” of HAPPO
denotes the total time of all agent updates; “share param update time” of MAPPO de-
notes the total time consumed in updating the shared parameters; “FLOPS” (floating-point
operations per second) during the update is calculated as the total floating-point opera-
tions in a network forward pass divided by data transfer time plus computation time (unit:
GFLOPS). The main figure represents the mean and the subscript represents the standard
deviation. These figures suggest that the sequential update scheme does not introduce much
computational burden compared to a single vectorized update.
0 500 1000 1500 2000 2500
Time(s)
0
1000
2000
3000
4000
5000
6000
7000
Reward
HalfCheetah-v2 2x3
HAPPO
MAPPO
(a) Return vs. time on 2-agent HalfCheetah.
0 1000 2000 3000 4000
Time(s)
0
1000
2000
3000
4000
5000
Reward
Walker2d-v2 6x1
HAPPO
MAPPO
(b) Return vs. time on 6-agent Walker.
Figure 12: Performance comparison between HAPPO and MAPPO with the x-axis being the
wall-time. At the same time, HAPPO generally outperforms parameter-sharing MAPPO.
30
Heterogeneous-Agent Reinforcement Learning
leads to increased GPU utilization, i.e. FLOPS, for HAPPO. These facts suggest that
the sequential update scheme does not introduce much computational burden compared to
the single vectorized update. As the update only constitutes a small portion of the whole
experiment, such an additional computational overhead is almost negligible.
In Figure 12, we further provide the learning curves of HAPPO and MAPPO on two
MAMuJoCo tasks corresponding to Figure 6, with the x-axis being wall-time. The oscillation
observed in Figure 12(b) is due to a slight difference in training time across the seeds
rather than the instability of algorithms. These figures demonstrate that HAPPO generally
outperforms MAPPO at the same wall-time. To run 10 million steps, HAPPO needs 8.12%
and 8.64% more time than MAPPO respectively, an acceptable tradeoff to enjoy the benefits
of the sequential update scheme in terms of improved performance and rigorous theoretical
guarantees. Thus, we justify that computational overhead does not need to be a concern.
6. Conclusion
In this paper, we present Heterogeneous-Agent Reinforcement Learning (HARL) algorithm
series, a set of powerful solutions to cooperative multi-agent problems with theoretical guar-
antees of monotonic improvement and convergence to Nash Equilibrium. Based on the multi-
agent advantage decomposition lemma and the sequential update scheme, we successfully
develop Heterogeneous-Agent Trust Region Learning (HATRL) and introduce two practical
algorithms HATRPO and HAPPO by tractable approximations. We further discover
the Heterogeneous-Agent Mirror Learning (HAML) framework, which strengthens valida-
tions for HATRPO and HAPPO and is a general template for designing provably correct
MARL algorithms whose properties are rigorously profiled. Its consequences are the deriva-
tion of more HARL algorithms, HAA2C, HADDPG, and HATD3, which significantly enrich
the tools for solving cooperative MARL problems. Experimental analysis on MPE, MA-
MuJoCo, SMAC, SMACv2, GRF, and Bi-DexterousHands confirms that HARL algorithms
generally outperform existing MA-counterparts and refresh SOTA results on heterogeneous-
agent benchmarks, showing their superior effectiveness for heterogeneous-agent cooperation
over strong baselines such as MAPPO and QMIX. Ablation studies further substantiate the
key novelties required in theoretical reasoning and enhance the connection between HARL
theory and implementation. For future work, we plan to consider more possibilities of the
HAML framework and validate the effectiveness of HARL algorithms on real-world multi-
robot cooperation tasks.
Acknowledgments
We would like to thank Chengdong Ma for insightful discussions; the authors of MAPPO
(Yu et al., 2022) for providing original training data of MAPPO and QMIX on SMAC
and GRF; the authors of SMACv2 (Ellis et al., 2022) for providing original training data
of MAPPO and QMIX on SMACv2; and the authors of Bi-DexterousHands (Chen et al.,
2022) for providing original training data of MAPPO and PPO on Bi-DexterousHands.
This project is funded by National Key R&D Program of China (2022ZD0114900) ,
Collective Intelligence & Collaboration Laboratory (QXZ23014101) , CCF-Tencent Open
31
Zhong, Kuba, Feng, Hu, Ji, and Yang
Research Fund (RAGR20220109) , Young Elite Scientists Sponsorship Program by CAST
(2022QNRC002), Beijing Municipal Science & Technology Commission (Z221100003422004).
Appendix A. Proofs of Example 2 and 1
Example 2 Consider a fully-cooperative game with an even number of agents n, one state,
and the joint action space {0, 1}
n
, where the reward is given by r(0
n/2
, 1
n/2
) = r(1
n/2
, 0
n/2
) =
1, and r(a
1:n
) = 0 for all other joint actions. Let J
be the optimal joint reward, and J
share
be the optimal joint reward under the shared policy constraint. Then
J
share
J
=
2
2
n
.
Proof Clearly J
= 1. An optimal joint policy in this case is, for example, the deterministic
policy with joint action (0
n/2
, 1
n/2
).
Now, let the shared policy be (θ, 1θ), where θ determines the probability that an agent
takes action 0. Then, the expected reward is
J(θ) = Pr
a
1:n
= (0
n/2
, 1
n/2
)
· 1 + Pr
a
1:n
= (1
n/2
, 0
n/2
)
· 1 = 2 · θ
n/2
(1 θ)
n/2
.
In order to maximise J(θ), we must maximise θ(1 θ), or equivalently,
p
θ(1 θ). By the
artithmetic-geometric means inequality, we have
p
θ(1 θ)
θ + (1 θ)
2
=
1
2
,
where the equality holds if and only if θ = 1 θ, that is θ =
1
2
. In such case we have
J
share
= J
1
2
= 2 · 2
n/2
· 2
n/2
=
2
2
n
,
which finishes the proof.
Example 1 Let’s consider a fully-cooperative game with 2 agents, one state, and the joint
action space {0, 1}
2
, where the reward is given by r(0, 0) = 0, r(0, 1) = r(1, 0) = 2, and
r(1, 1) = 1. Suppose that π
i
old
(0) > 0.6 for i = 1, 2. Then, if agents i update their policies
by
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
A
π
old
(a
i
, a
i
)
, i N,
then the resulting policy will yield a lower return,
J(π
old
) > J(π
new
) = min
π
J(π).
Proof As there is only one state, we can ignore the infinite horizon and the discount factor
γ, thus making the state-action value and the reward functions equivalent, Q r.
32
Heterogeneous-Agent Reinforcement Learning
Let us, for brevity, define π
i
= π
i
old
(0) > 0.6, for i = 1, 2. We have
J(π
old
) = Pr(a
1
= a
2
= 0)r(0, 0) +
1 Pr(a
1
= a
2
= 0)
E[r(a
1
, a
2
)|(a
1
, a
2
) 6= (0, 0)]
> 0.6
2
× 0 (1 0.6
2
) = 0.64.
The update rule stated in the proposition can be equivalently written as
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
Q
π
old
(a
i
, a
i
)
. (15)
We have
E
a
i
π
i
old
Q
π
old
(0, a
i
)
= π
i
Q(0, 0) + (1 π
i
)Q(0, 1) = π
i
r(0, 0) + (1 π
i
)r(0, 1) = 2(1 π
i
),
and similarly
E
a
i
π
i
old
Q
π
old
(1, a
i
)
= π
i
r(1, 0) + (1 π
i
)r(1, 1) = 2π
i
(1 π
i
) = 3π
i
1.
Hence, if π
i
> 0.6, then
E
a
i
π
i
old
Q
π
old
(1, a
i
)
= 3π
i
1 > 3 × 0.6 1 = 0.8 > 2 2π
i
= E
a
i
π
i
old
Q
π
old
(0, a
i
)
.
Therefore, for every i, the solution to Equation (15) is the greedy policy π
i
new
(1) = 1.
Therefore,
J(π
new
) = Q(1, 1) = r(1, 1) = 1,
which finishes the proof.
Appendix B. Derivation and Analysis of Algorithm 1
B.1 Recap of Existing Results
Lemma 15 (Performance Difference) Let ¯π and π be two policies. Then, the following
identity holds,
J(¯π) J(π) = E
sρ
¯π
,a¯π
[A
π
(s, a)] .
Proof See Kakade and Langford (2002) (Lemma 6.1) or Schulman et al. (2015) (Appendix
A).
Theorem 16 (Schulman et al., 2015, Theorem 1) Let π be the current policy and ¯π be
the next candidate policy. We define L
π
(¯π) = J(π) + E
sρ
π
,a¯π
[A
π
(s, a)] , D
max
KL
(π, ¯π) =
max
s
D
KL
(π(·|s), ¯π(·|s)) . Then the inequality of
J(¯π) L
π
(¯π) CD
max
KL
π, ¯π
(16)
holds, where C =
4γ max
s,a
|A
π
(s,a)|
(1γ)
2
.
Proof See Schulman et al. (2015) (Appendix A and Equation (9) of the paper).
33
Zhong, Kuba, Feng, Hu, Ji, and Yang
B.2 Analysis of Training of Algorithm 1
Lemma 17 (Multi-Agent Advantage Decomposition) In any cooperative Markov games,
given a joint policy π, for any state s, and any agent subset i
1:m
, the below equation holds.
A
i
1:m
π
s, a
i
1:m
=
m
X
j=1
A
i
j
π
s, a
i
1:j1
, a
i
j
.
Proof By the definition of multi-agent advantage function,
A
i
1:m
π
(s, a
i
1:m
) = Q
i
1:m
π
(s, a
i
1:m
) V
π
(s)
=
m
X
k=1
h
Q
i
1:k
π
(s, a
i
1:k
) Q
i
1:k1
π
(s, a
i
1:k1
)
i
=
m
X
k=1
A
i
k
π
(s, a
i
1:k1
, a
i
k
),
which finishes the proof.
Note that a similar finding has been shown in Kuba et al. (2021).
Lemma 18 Let π =
Q
n
i=1
π
i
and ¯π =
Q
n
i=1
¯π
i
be joint policies. Then
D
max
KL
(π, ¯π)
n
X
i=1
D
max
KL
π
i
, ¯π
i
Proof For any state s, we have
D
KL
(π(·|s), ¯π(·|s)) = E
aπ
[log π(a|s) log ¯π(a|s)]
= E
aπ
"
log
n
Y
i=1
π
i
(a
i
|s)
!
log
n
Y
i=1
¯π
i
(a
i
|s)
!#
= E
aπ
"
n
X
i=1
log π
i
(a
i
|s)
n
X
i=1
log ¯π
i
(a
i
|s)
#
=
n
X
i=1
E
a
i
π
i
,a
i
π
i
log π
i
(a
i
|s) log ¯π
i
(a
i
|s)
=
n
X
i=1
D
KL
π
i
(·|s), ¯π
i
(·|s)
. (17)
Now, taking maximum over s on both sides yields
D
max
KL
(π, ¯π)
n
X
i=1
D
max
KL
π
i
, ¯π
i
,
as required.
34
Heterogeneous-Agent Reinforcement Learning
Lemma 19 Let π be a joint policy. Then, for any joint policy ¯π, we have
J(¯π) J(π) +
n
X
m=1
L
i
1:m
π
¯π
i
1:m1
, ¯π
i
m
CD
max
KL
(π
i
m
, ¯π
i
m
)
,
where C =
4γ max
s,a
|A
π
(s, a)|
(1 γ)
2
. (6)
Proof By Theorem 16
J(¯π) L
π
(¯π) CD
max
KL
(π, ¯π)
= J(π) + E
sρ
π
,a¯π
[A
π
(s, a)] CD
max
KL
(π, ¯π)
which by Lemma 4 equals
= J(π) + E
sρ
π
,a¯π
"
n
X
m=1
A
i
m
π
s, a
i
1:m1
, a
i
m
#
CD
max
KL
(π, ¯π)
and by Lemma 18 this is at least
J(π) + E
sρ
π
,a¯π
"
n
X
m=1
A
i
m
π
s, a
i
1:m1
, a
i
m
#
n
X
m=1
CD
max
KL
(π
i
m
, ¯π
i
m
)
= J(π) +
n
X
m=1
E
sρ
π
,a
i
1:m1
¯π
i
1:m1
,a
i
m
¯π
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

n
X
m=1
CD
max
KL
(π
i
m
, ¯π
i
m
)
= J(π) +
n
X
m=1
L
i
1:m
π
¯π
i
1:m1
, ¯π
i
m
CD
max
KL
(π
i
m
, ¯π
i
m
)
.
Theorem 7 A sequence (π
k
)
k=0
of joint policies updated by Algorithm 1 has the monotonic
improvement property, i.e., J(π
k+1
) J(π
k
) for all k N.
Proof Let π
0
be any joint policy. For every k 0, the joint policy π
k+1
is obtained from
π
k
by Algorithm 1 update; for m = 1, . . . , n,
π
i
m
k+1
= arg max
π
i
m
h
L
i
1:m
π
k
π
i
1:m1
k+1
, π
i
m
CD
max
KL
π
i
m
k
, π
i
m
i
.
35
Zhong, Kuba, Feng, Hu, Ji, and Yang
By Theorem 16, we have
J(π
k+1
) L
π
k
(π
k+1
) CD
max
KL
(π
k
, π
k+1
),
which by Lemma 18 is lower-bounded by
L
π
k
(π
k+1
)
n
X
m=1
CD
max
KL
(π
i
m
k
, π
i
m
k+1
)
= J(π
k
) +
n
X
m=1
L
i
1:m
π
k
(π
i
1:m1
k+1
, π
i
m
k+1
) CD
max
KL
(π
i
m
k
, π
i
m
k+1
)
, (18)
and as for every m, π
i
m
k+1
is the argmax, this is lower-bounded by
J(π
k
) +
n
X
m=1
L
i
1:m
π
k
(π
i
1:m1
k+1
, π
i
m
k
) CD
max
KL
(π
i
m
k
, π
i
m
k
)
,
which, as mentioned in Definition 5, equals
= J(π
k
) +
n
X
m=1
0 = J(π
k
),
where the last inequality follows from Equation (5). This proves that Algorithm 1 achieves
monotonic improvement.
B.3 Analysis of Convergence of Algorithm 1
Theorem 8 Supposing in Algorithm 1 any permutation of agents has a fixed non-zero prob-
ability to begin the update, a sequence (π
k
)
k=0
of joint policies generated by the algorithm,
in a cooperative Markov game, has a non-empty set of limit points, each of which is a Nash
equilibrium.
Proof
Step 1 (convergence). Firstly, it is clear that the sequence (J(π
k
))
k=0
converges as, by
Theorem 7, it is non-decreasing and bounded above by
R
max
1γ
. Let us denote the limit by
¯
J.
For every k, we denote the tuple of agents, according to whose order the agents perform the
sequential updates, by i
k
1:n
, and we note that
i
k
1:n
kN
is a random process. Furthermore,
we know that the sequence of policies (π
k
) is bounded, so by Bolzano-Weierstrass Theorem,
it has at least one convergent subsequence. Let ¯π be any limit point of the sequence (note
that the set of limit points is a random set), and
π
k
j
j=0
be a subsequence converging to
¯π (which is a random subsequence as well). By continuity of J in π , we have
J(
¯
π) = J
lim
j→∞
π
k
j
= lim
j→∞
J
π
k
j
=
¯
J. (19)
For now, we introduce an auxiliary definition.
Definition 20 (TR-Stationarity) A joint policy ¯π is trust-region-stationary (TR-stationary)
if, for every agent i,
36
Heterogeneous-Agent Reinforcement Learning
¯π
i
= arg max
π
i
E
sρ
¯π
,a
i
π
i
A
i
¯π
(s, a
i
)
C
¯π
D
max
KL
¯π
i
, π
i

,
where C
¯π
=
4γ
(1γ)
2
, and = max
s,a
|A
¯π
(s, a)|.
We will now establish the TR-stationarity of any limit point joint policy
¯
π (which, as
stated above, is a random variable). Let E
i
0:
1:n
[·] denote the expected value operator under
the random process (i
0:
1:n
). Let also
k
= max
s,a
|A
π
k
(s, a)|, and C
k
=
4γ
k
(1γ)
2
. We have
0 = lim
k→∞
E
i
0:
1:n
[J(π
k+1
) J(π
k
)]
lim
k→∞
E
i
0:
1:n
[L
π
k
(π
k+1
) C
k
D
max
KL
(π
k
, π
k+1
)] by Theorem 16
lim
k→∞
E
i
0:
1:n
h
L
i
k
1
π
k
π
i
k
1
k+1
C
k
D
max
KL
π
i
k
1
k
, π
i
k
1
k+1
i
by Equation (18) and the fact that each of its summands is non-negative.
Now, we consider an arbitrary limit point ¯π from the (random) limit set, and a (random)
subsequence
π
k
j
j=0
that converges to ¯π. We get
0 lim
j→∞
E
i
0:
1:n
L
i
k
j
1
π
k
j
π
i
k
j
1
k
j
+1
C
k
j
D
max
KL
π
i
k
j
1
k
j
, π
i
k
j
1
k
j
+1

.
As the expectation is taken of non-negative random variables, and for every i N and
k N, with some positive probability p
i
, we have i
k
j
1
= i (because every permutation has
non-zero probability), the above is bounded from below by
p
i
lim
j→∞
max
π
i
h
L
i
π
k
j
(π
i
) C
k
j
D
max
KL
π
i
k
j
, π
i
i
,
which, as π
k
j
converges to ¯π, equals to
p
i
max
π
i
L
i
¯π
(π
i
) C
¯π
D
max
KL
¯π
i
, π
i

0, by Equation (5).
This proves that, for any limit point ¯π of the random process (π
k
) induced by Algorithm 1,
max
π
i
L
i
¯π
(π
i
) C
¯π
D
max
KL
¯π
i
, π
i

= 0, which is equivalent with Definition 20.
Step 2 (dropping the penalty term). Now, we have to prove that TR-stationary
points are NEs of cooperative Markov games. The main step is to prove the following
statement: a TR-stationary joint policy ¯π, for every state s S, satisfies
¯π
i
= arg max
π
i
E
a
i
π
i
A
i
¯π
(s, a
i
)
. (20)
We will use the technique of the proof by contradiction. Suppose that there is a state s
0
such that there exists a policy ˆπ
i
with
E
a
i
ˆπ
i
A
i
¯π
(s
0
, a
i
)
> E
a
i
¯π
i
A
i
¯π
(s
0
, a
i
)
. (21)
37
Zhong, Kuba, Feng, Hu, Ji, and Yang
Let us parametrise the policies π
i
according to the template
π
i
(·|s
0
) =
x
i
1
, . . . , x
i
d
i
1
, 1
d
i
1
X
j=1
x
i
j
where the values of x
i
j
(j = 1, . . . , d
i
1) are such that π
i
(·|s
0
) is a valid probability
distribution. Then we can rewrite our quantity of interest (the objective of Equation (20)
as
E
a
i
π
i
A
i
¯π
(s
0
, a
i
)
=
d
i
1
X
j=1
x
i
j
· A
i
¯π
s
0
, a
i
j
+ (1
d
i
1
X
h=1
x
i
h
)A
i
¯π
s
0
, a
i
d
i
=
d
i
1
X
j=1
x
i
j
A
i
¯π
s
0
, a
i
j
A
i
¯π
s
0
, a
i
d
i

+ A
i
¯π
s
0
, a
i
d
i
,
which is an affine function of the policy parameterisation. It follows that its gradient (with
respect to x
i
) and directional derivatives are constant in the space of policies at state s
0
.
The existance of policy ˆπ
i
(·|s
0
), for which Inequality (21) holds, implies that the directional
derivative in the direction from ¯π
i
(·|s
0
) to ˆπ
i
(·|s
0
) is strictly positive. We also have
D
KL
(¯π
i
(·|s
0
), π
i
(·|s
0
))
x
i
j
=
x
i
j
(¯π
i
(·|s
0
))
T
(log ¯π
i
(·|s
0
) log π
i
(·|s
0
))
=
x
i
j
(¯π
i
)
T
log π
i
(omitting state s
0
for brevity)
=
x
i
j
d
i
1
X
k=1
¯π
i
k
log x
i
k
x
i
j
¯π
i
d
i
log
1
d
i
1
X
k=1
x
i
k
!
=
¯π
i
j
x
i
j
+
¯π
i
d
i
1
P
d
i
1
k=1
x
i
k
=
¯π
i
j
π
i
j
+
¯π
i
d
i
π
i
d
i
= 0, when evaluated at π
i
= ¯π
i
, (22)
which means that the KL-penalty has zero gradient at ¯π
i
(·|s
0
). Hence, when evaluated at
π
i
(·|s
0
) = ¯π
i
(·|s
0
), the objective
ρ
¯π
(s
0
)E
a
i
π
i
A
i
¯π
(s
0
, a
i
)
C
¯π
D
KL
¯π
i
(·|s
0
), π
i
(·|s
0
)
has a strictly positive directional derivative in the direction of ˆπ
i
(·|s
0
). Thus, there exists a
policy eπ
i
(·|s
0
), sufficiently close to ¯π
i
(·|s
0
) on the path joining it with ˆπ
i
(·|s
0
), for which
ρ
¯π
(s
0
)E
a
i
eπ
i
A
i
¯π
(s
0
, a
i
)
C
¯π
D
KL
¯π
i
(·|s
0
), eπ
i
(·|s
0
)
> 0.
Let π
i
be a policy such that π
i
(·|s
0
) = eπ
i
(·|s
0
), and π
i
(·|s) = ¯π
i
(·|s) for states s 6= s
0
. As
for these states we have
ρ
¯π
(s)E
a
i
π
i
A
i
¯π
(s, a
i
)
= ρ
¯π
(s)E
a
i
¯π
i
A
i
¯π
(s, a
i
)
= 0, and D
KL
(¯π
i
(·|s), π
i
(·|s)) = 0,
38
Heterogeneous-Agent Reinforcement Learning
it follows that
L
i
¯π
(π
i
) C
¯π
D
max
KL
(¯π
i
, π
i
) = ρ
¯π
(s
0
)E
a
i
eπ
i
A
i
¯π
(s
0
, a
i
)
C
¯π
D
KL
¯π
i
(·|s
0
), eπ
i
(·|s
0
)
> 0 = L
i
¯π
(¯π
i
) C
¯π
D
max
KL
(¯π
i
, ¯π
i
),
which is a contradiction with TR-stationarity of ¯π. Hence, the claim of Equation (20) is
proved.
Step 3 (optimality). Now, for a fixed joint policy ¯π
i
of other agents, ¯π
i
satisfies
¯π
i
= arg max
π
i
E
a
i
π
i
A
i
¯π
(s, a
i
)
= arg max
π
i
E
a
i
π
i
Q
i
¯π
(s, a
i
)
, s S,
which is the Bellman optimality equation (Sutton and Barto, 2018). Hence, for a fixed joint
policy ¯π
i
, the policy ¯π
i
is optimal:
¯π
i
= arg max
π
i
J(π
i
, ¯π
i
).
As agent i was chosen arbitrarily, ¯π is a Nash equilibrium.
Appendix C. HATRPO and HAPPO
C.1 Proof of Proposition 9
Proposition 21 Let π =
Q
n
j=1
π
i
j
be a joint policy, and A
π
(s, a) be its joint advantage
function. Let ¯π
i
1:m1
=
Q
m1
j=1
¯π
i
j
be some other joint policy of agents i
1:m1
, and ˆπ
i
m
be
some other policy of agent i
m
. Then, for every state s,
E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
A
i
m
π
s, a
i
1:m1
, a
i
m

= E
aπ
h
ˆπ
i
m
(a
i
m
|s)
π
i
m
(a
i
m
|s)
1
¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m1
(a
i
1:m1
|s)
A
π
(s, a)
i
.
(9)
39
Zhong, Kuba, Feng, Hu, Ji, and Yang
Proof
E
aπ
h
ˆπ
i
m
(a
i
m
|s)
π
i
m
(a
i
m
|s)
1
¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m1
(a
i
1:m1
|s)
A
π
(s, a)
i
= E
aπ
ˆπ
i
m
(a
i
m
|s)¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m
(a
i
1:m
|s)
A
π
(s, a)
¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m1
(a
i
1:m1
|s)
A
π
(s, a)
= E
a
i
1:m
π
i
1:m
,a
i
1:m
π
i
1:m
ˆπ
i
m
(a
i
m
|s)¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m
(a
i
1:m
|s)
A
π
(s, a
i
1:m
, a
i
1:m
)
E
a
i
1:m1
π
i
1:m1
,a
i
1:m1
π
i
1:m1
¯π
i
1:m1
(a
i
1:m1
|s)
π
i
1:m1
(a
i
1:m1
|s)
A
π
(s, a
i
1:m1
, a
i
1:m1
)
= E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
,a
i
1:m
π
i
1:m
A
π
(s, a
i
1:m
, a
i
1:m
)
E
a
i
1:m1
¯π
i
1:m1
,a
i
1:m1
π
i
1:m1
A
π
(s, a
i
1:m1
, a
i
1:m1
)
= E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
E
a
i
1:m
π
i
1:m
A
π
(s, a
i
1:m
, a
i
1:m
)

E
a
i
1:m1
¯π
i
1:m1
E
a
i
1:m1
π
i
1:m1
A
π
(s, a
i
1:m1
, a
i
1:m1
)

= E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
A
i
1:m
π
(s, a
i
1:m
)
E
a
i
1:m1
¯π
i
1:m1
h
A
i
1:m1
π
(s, a
i
1:m1
)
i
,
= E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
h
A
i
1:m
π
(s, a
i
1:m
) A
i
1:m1
π
(s, a
i
1:m1
)
i
which, by Lemma 4, equals
= E
a
i
1:m1
¯π
i
1:m1
,a
i
m
ˆπ
i
m
A
i
m
π
(s, a
i
1:m1
, a
i
m
)
.
C.2 Derivation of the gradient estimator for HATRPO
θ
i
m
E
sρ
π
θ
k
,aπ
θ
k
"
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
1
!
M
i
1:m
(s, a)
#
=
θ
i
m
E
sρ
π
θ
k
,aπ
θ
k
"
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
M
i
1:m
(s, a)
#
θ
i
m
E
sρ
π
θ
k
,aπ
θ
k
"
M
i
1:m
(s, a)
#
= E
sρ
π
θ
k
,aπ
θ
k
"
θ
i
m
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
M
i
1:m
(s, a)
#
= E
sρ
π
θ
k
,aπ
θ
k
"
π
i
m
θ
i
m
(a
i
m
|s)
π
i
m
θ
i
m
k
(a
i
m
|s)
θ
i
m
log π
i
m
θ
i
m
(a
i
m
|s)M
i
1:m
(s, a)
#
.
Evaluated at θ
i
m
= θ
i
m
k
, the above expression equals
E
sρ
π
θ
k
,aπ
θ
k
h
M
i
1:m
(s, a)
θ
i
m
log π
i
m
θ
i
m
(a
i
m
|s)
θ
i
m
=θ
i
m
k
i
,
which finishes the derivation.
40
Heterogeneous-Agent Reinforcement Learning
C.3 Pseudocode of HATRPO
Algorithm 3: HATRPO
Input: Stepsize α, batch size B, number of: agents n, episodes K, steps per
episode T , possible steps in line search L, line search acceptance threshold κ.
Initialize: Actor networks {θ
i
0
, i N}, Global V-value network {φ
0
}, Replay
buffer B
for k = 0, 1, . . . , K 1 do
Collect a set of trajectories by running the joint policy π
θ
k
= (π
1
θ
1
k
, . . . , π
n
θ
n
k
).
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B.
Sample a random minibatch of B transitions from B.
Compute advantage function
ˆ
A(s, a) based on global V-value network with GAE.
Draw a random permutation of agents i
1:n
.
Set M
i
1
(s, a) =
ˆ
A(s, a).
for agent i
m
= i
1
, . . . , i
n
do
Estimate the gradient of the agent’s maximisation objective
ˆ
g
i
m
k
=
1
B
B
P
b=1
T
P
t=1
θ
i
m
k
log π
i
m
θ
i
m
k
a
i
m
t
| o
i
m
t
M
i
1:m
(s
t
, a
t
).
Use the conjugate gradient algorithm to compute the update direction
ˆ
x
i
m
k
(
ˆ
H
i
m
k
)
1
ˆ
g
i
m
k
,
where
ˆ
H
i
m
k
is the Hessian of the average KL-divergence
1
BT
B
P
b=1
T
P
t=1
D
KL
π
i
m
θ
i
m
k
(·|o
i
m
t
), π
i
m
θ
i
m
(·|o
i
m
t
)
.
Estimate the maximal step size allowing for meeting the KL-constraint
ˆ
β
i
m
k
s
2δ
(
ˆ
x
i
m
k
)
T
ˆ
H
i
m
k
ˆ
x
i
m
k
.
Update agent i
m
’s policy by
θ
i
m
k+1
= θ
i
m
k
+ α
j
ˆ
β
i
m
k
ˆ
x
i
m
k
,
where j {0, 1, . . . , L} is the smallest such j which improves the sample loss
by at least κα
j
ˆ
β
i
m
k
ˆ
x
i
m
k
·
ˆ
g
i
m
k
, found by the backtracking line search.
Compute M
i
1:m+1
(s, a) =
π
i
m
θ
i
m
k+1
(
a
i
m
|o
i
m
)
π
i
m
θ
i
m
k
(a
i
m
|o
i
m
)
M
i
1:m
(s
t
, a
t
). //Unless m = n.
Update V-value network by following formula:
φ
k+1
= arg min
φ
1
BT
B
P
b=1
T
P
t=0
V
φ
(s
t
)
ˆ
R
t
2
41
Zhong, Kuba, Feng, Hu, Ji, and Yang
C.4 Pseudocode of HAPPO
Algorithm 4: HAPPO
Input: Stepsize α, batch size B, number of: agents n, episodes K, steps per
episode T .
Initialize: Actor networks {θ
i
0
, i N}, Global V-value network {φ
0
}, Replay
buffer B
for k = 0, 1, . . . , K 1 do
Collect a set of trajectories by running the joint policy π
θ
k
= (π
1
θ
1
k
, . . . , π
n
θ
n
k
).
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B.
Sample a random minibatch of B transitions from B.
Compute advantage function
ˆ
A(s, a) based on global V-value network with GAE.
Draw a random permutation of agents i
1:n
.
Set M
i
1
(s, a) =
ˆ
A(s, a).
for agent i
m
= i
1
, . . . , i
n
do
Update actor i
m
with θ
i
m
k+1
, the argmax of the PPO-Clip objective
1
BT
B
P
b=1
T
P
t=0
min
π
i
m
θ
i
m
(
a
i
m
t
|o
i
m
t
)
π
i
m
θ
i
m
k
(
a
i
m
t
|o
i
m
t
)
M
i
1:m
(s
t
, a
t
), clip
π
i
m
θ
i
m
(
a
i
m
t
|o
i
m
t
)
π
i
m
θ
i
m
k
(
a
i
m
t
|o
i
m
t
)
, 1 ±
M
i
1:m
(s
t
, a
t
)
!
.
Compute M
i
1:m+1
(s, a) =
π
i
m
θ
i
m
k+1
(
a
i
m
|o
i
m
)
π
i
m
θ
i
m
k
(a
i
m
|o
i
m
)
M
i
1:m
(s, a). //Unless m = n.
Update V-value network by the following formula:
φ
k+1
= arg min
φ
1
BT
B
P
b=1
T
P
t=0
V
φ
(s
t
)
ˆ
R
t
2
Appendix D. Proof of HAMO Is All You Need Lemma
Lemma 22 (HAMO Is All You Need) Let π
old
and π
new
be joint policies and let i
1:n
Sym(n) be an agent permutation. Suppose that, for every state s S and every m = 1, . . . , n,
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s). (12)
Then, π
new
is jointly better than π
old
, so that for every state s,
V
π
new
(s) V
π
old
(s).
42
Heterogeneous-Agent Reinforcement Learning
Proof Let
e
D
π
old
(π
new
|s) ,
P
n
m=1
D
i
m
π
old
(π
i
m
new
|s, π
i
1:m1
new
). Combining this with Lemma 4
gives
E
aπ
new
A
π
old
(s, a)
e
D
π
old
(π
new
|s)
=
n
X
m=1
E
a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
new
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
D
i
m
π
old
(π
i
m
new
|s, π
i
1:m1
new
)
by Inequality (12)
n
X
m=1
E
a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
D
i
m
π
old
(π
i
m
old
|s, π
i
1:m1
new
)
= E
aπ
old
A
π
old
(s, a)
e
D
π
old
(π
old
|s).
The resulting inequality can be equivalently rewritten as
E
aπ
new
Q
π
old
(s, a)
e
D
π
old
(π
new
|s) E
aπ
old
Q
π
old
(s, a)
e
D
π
old
(π
old
|s), s S. (23)
We use it to prove the claim as follows,
V
π
new
(s) = E
aπ
new
Q
π
new
(s, a)
= E
aπ
new
Q
π
old
(s, a)
e
D
π
old
(π
new
|s)
+
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
,
by Inequality (23)
E
aπ
old
Q
π
old
(s, a)
e
D
π
old
(π
old
|s)
+
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
,
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + E
aπ
new
,s
0
P
r(s, a) + γV
π
new
(s
0
) r(s, a) γV
π
old
(s
0
)
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + γE
aπ
new
,s
0
P
V
π
new
(s
0
) V
π
old
(s
0
)
V
π
old
(s) + γ inf
s
0
V
π
new
(s
0
) V
π
old
(s
0
)
.
Hence V
π
new
(s) V
π
old
(s) γ inf
s
0
V
π
new
(s
0
) V
π
old
(s
0
)
.
Taking infimum over s and simplifying
(1 γ) inf
s
V
π
new
(s) V
π
old
(s)
0.
Therefore, inf
s
V
π
new
(s) V
π
old
(s)
0, which proves the lemma.
43
Zhong, Kuba, Feng, Hu, Ji, and Yang
Appendix E. Proof of Theorem 14
Lemma 23 Suppose an agent i
m
maximises the expected HAMO
π
i
m
new
= arg max
π
i
m
∈U
i
m
π
old
(π
i
m
old
)
E
sβ
π
old
h
M
(π
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
. (24)
Then, for every state s S
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s).
Proof We will prove this statement by contradiction. Suppose that there exists s
0
S such
that
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
) <
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
). (25)
Let us define the following policy ˆπ
i
m
.
ˆπ
i
m
(·
i
m
|s) =
(
π
i
m
old
(·
i
m
|s), at s = s
0
π
i
m
new
(·
i
m
|s), at s 6= s
0
Note that ˆπ
i
m
is (weakly) closer to π
i
m
old
than π
i
m
new
at s
0
, and at the same distance at other
states. Together with π
i
m
new
U
i
m
π
old
(π
i
m
old
), this implies that ˆπ
i
m
U
i
m
π
old
(π
i
m
old
). Further,
E
sβ
π
old
h
M
(ˆπ
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
E
sβ
π
old
h
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
= β
π
old
(s
0
)

M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
)
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
)
> 0.
The above contradicts π
i
m
new
as being the argmax of Inequality (25), as ˆπ
i
m
is strictly better.
The contradiction finishes the proof.
Theorem 14 (The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)
Let, for every agent i N, D
i
be a HADF, U
i
be a neighbourhood operator, and let the sam-
pling distributions β
π
depend continuously on π. Let π
0
Π, and the sequence of joint
policies (π
k
)
k=0
be obtained by a HAML algorithm induced by D
i
, U
i
, i N, and β
π
.
Then, the joint policies induced by the algorithm enjoy the following list of properties
1. Attain the monotonic improvement property,
J(π
k+1
) J(π
k
),
2. Their value functions converge to a Nash value function V
NE
lim
k→∞
V
π
k
= V
NE
,
3. Their expected returns converge to a Nash return,
lim
k→∞
J(π
k
) = J
NE
,
4. Their ω-limit set consists of Nash equilibria.
44
Heterogeneous-Agent Reinforcement Learning
Proof
Proof of Property 1. It follows from combining Lemmas 13 & 23.
Proof of Properties 2, 3 & 4.
Step 1: convergence of the value function. By Lemma 13, we have that V
π
k
(s)
V
π
k+1
(s), s S, and that the value function is upper-bounded by V
max
. Hence, the
sequence of value functions (V
π
k
)
kN
converges. We denote its limit by V .
Step 2: characterisation of limit points. As the joint policy space Π is bounded, by
Bolzano-Weierstrass theorem, we know that the sequence (π
k
)
kN
has a convergent subse-
quence. Therefore, it has at least one limit point policy. Let ¯π be such a limit point. We
introduce an auxiliary notation: for a joint policy π and a permutation i
1:n
, let HU(π, i
1:n
)
be a joint policy obtained by a HAML update from π along the permutation i
1:n
.
Claim: For any permutation z
1:n
Sym(n),
¯π = HU(¯π, z
1:n
). (26)
Proof of Claim. Let
ˆ
π = HU(¯π, z
1:n
) 6= ¯π and (π
k
r
)
rN
be a subsequence converging to
¯π. Let us recall that the limit value function is unique and denoted as V . Writing E
i
0:
1:n
[·]
for the expectation operator under the stochastic process (i
k
1:n
)
kN
of update orders, for a
state s S, we have
0 = lim
r→∞
E
i
0:
1:n
V
π
k
r
+1
(s) V
π
k
r
(s)
as every choice of permutation improves the value function
lim
r→∞
P(i
k
r
1:n
= z
1:n
)
V
HU(π
k
r
,z
1:n
)
(s) V
π
k
r
(s)
= p(z
1:n
) lim
r→∞
V
HU(π
k
r
,z
1:n
)
(s) V
π
k
r
(s)
.
By the continuity of the expected HAMO , we obtain that the first component of HU(π
k
r
, z
1:n
),
which is π
z
1
k
r
+1
, is continuous in π
k
r
by Berge’s Maximum Theorem (Ausubel and Deneckere,
1993). Applying this argument recursively for z
2
, . . . , z
n
, we have that HU(π
k
r
, z
1:n
) is con-
tinuous in π
k
r
. Hence, as π
k
r
converges to ¯π, its HU converges to the HU of ¯π, which is
ˆ
π.
Hence, we continue writing the above derivation as
= p(z
1:n
)
V
ˆ
π
(s) V
¯π
(s)
0, by Lemma 13.
As s was arbitrary, the state-value function of
ˆ
π is the same as that of π: V
ˆ
π
= V
π
, by the
Bellman equation (Sutton and Barto, 2018): Q(s, a) = r(s, a) + γEV (s
0
), this also implies
that their state-value and advantage functions are the same: Q
ˆ
π
= Q
¯π
and A
ˆ
π
= A
¯π
. Let
m be the smallest integer such that ˆπ
z
m
6= ¯π
z
m
. This means that ˆπ
z
m
achieves a greater
45
Zhong, Kuba, Feng, Hu, Ji, and Yang
expected HAMO than ¯π
z
m
, for which it is zero. Hence,
0 < E
sβ
π
h
M
(ˆπ
z
m
)
D
z
m
,¯π
z
1:m1
A
¯π
(s)
i
= E
sβ
π
h
E
a
z
1:m
¯π
z
1:m1
,a
z
m
ˆπ
z
m
A
z
m
¯π
(s, a
z
1:m1
, a
z
m
)
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
= E
sβ
π
h
E
a
z
1:m
¯π
z
1:m1
,a
z
m
ˆπ
z
m
A
z
m
ˆ
π
(s, a
z
1:m1
, a
z
m
)
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
and as the expected value of the multi-agent advantage function is zero
= E
sβ
π
h
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
0.
This is a contradiction, and so the claim in Equation (26) is proved, and the Step 2 is
finished.
Step 3: dropping the HADF. Consider an arbitrary limit point joint policy ¯π. By
Step 2, for any permutation i
1:n
, considering the first component of the HU,
¯π
i
1
= arg max
π
i
1
∈U
i
1
¯π
(¯π
i
1
)
E
sβ
¯π
h
M
(π
i
1
)
D
i
1
A
¯π
(s)
i
(27)
= arg max
π
i
1
∈U
i
1
¯π
(¯π
i
1
)
E
sβ
¯π
h
E
a
i
1
π
i
1
A
i
1
¯π
(s, a
i
1
)
D
i
1
¯π
(π
i
1
|s)
i
.
As the HADF is non-negative, and at π
i
1
= ¯π
i
1
its value and of its all Gâteaux derivatives
are zero, it follows by Step 3 of Theorem 1 of Kuba et al. (2022b) that for every s S,
¯π
i
1
(·
i
1
|s) = arg max
π
i
1
∈P(A
i
1
)
E
a
i
1
π
i
1
Q
i
1
¯π
(s, a
i
1
)
.
Step 4: Nash equilibrium. We have proved that ¯π satisfies
¯π
i
(·
i
|s) = arg max
π
i
(·
i
|s)∈P(A
i
)
E
a
i
π
i
Q
i
¯π
(s, a
i
)
= arg max
π
i
(·
i
|s)∈P(A
i
)
E
a
i
π
i
,a
i
¯π
i
Q
¯π
(s, a)
, i N, s S.
Hence, by considering ¯π
i
fixed, we see that ¯π
i
satisfies the condition for the optimal policy
Sutton and Barto (2018), and hence
¯π
i
= arg max
π
i
Π
i
J(π
i
, ¯π
i
).
Thus, ¯π is a Nash equilibrium. Lastly, this implies that the value function corresponds to a
Nash value function V
NE
, the return corresponds to a Nash return J
NE
.
46
Heterogeneous-Agent Reinforcement Learning
Appendix F. Casting HAPPO as HAML
The maximisation objective of agent i
m
in HAPPO is
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
h
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m
)
i
.
Fixing s and a
i
1:m1
, we can rewrite it as
E
a
i
m
¯π
i
m
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
E
a
i
m
π
i
m
old
h
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
.
By the multi-agent advantage decomposition,
E
a
i
m
¯π
i
m
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
= A
i
1:m1
π
old
(s, a
i
1:m1
) + E
a
i
m
¯π
i
m
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
.
Hence, the presence of the joint advantage of agents i
1:m
is equivalent to the multi-agent
advantage of i
m
given a
i
1:m1
that appears in HAMO, since the term A
i
1:m1
π
old
(s, a
i
1:m1
)
cancels out with 1 · M
i
1:m
(s, a) of Equation 10 that we drop due to its zero gradient.
Hence, we only need to show that that the subtracted term is an HADF. Firstly, we change
min into max with the identity min f(x) = max[f(x)].
E
a
i
m
π
i
m
old
h
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
+ max
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
which we then simplify
E
a
i
m
π
i
m
old
h
max
0,
r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
= E
a
i
m
π
i
m
old
h
ReLU
r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
.
As discussed in the main body of the paper, this is an HADF.
47
Zhong, Kuba, Feng, Hu, Ji, and Yang
Appendix G. Algorithms
Algorithm 5: HAA2C
Input: stepsize α, batch size B, number of: agents n, episodes K, steps per episode
T , mini-epochs e;
Initialize: the critic network: φ, the policy networks: {θ
i
}
i∈N
, replay buffer B;
for k = 0, 1, . . . , K 1 do
Collect a set of trajectories by letting the agents act according to their policies,
a
i
π
i
θ
i
(·
i
|o
i
);
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B;
Sample a random minibatch of B transitions from B;
Estimate the returns R and the advantage function,
ˆ
A(s, a), using
ˆ
V
φ
and GAE;
Draw a permutation of agents i
1:n
at random;
Set M
i
1
(s, a) =
ˆ
A(s, a);
for agent i
m
= i
1
, . . . , i
n
do
Set π
i
m
0
(a
i
m
|o
i
m
) = π
i
m
θ
i
m
(a
i
m
|o
i
m
);
for mini-epoch= 1, . . . , e do
Compute agent i
m
’s policy gradient
g
i
m
=
θ
i
m
1
B
B
P
b=1
M
i
m
(s
b
, a
b
)
π
i
m
θ
i
m
(a
i
m
b
|o
i
m
b
)
π
i
m
0
(a
i
m
b
|o
i
m
b
)
.
Update agent i
m
’s policy by
θ
i
m
= θ
i
m
+ αg
i
m
.
Compute M
i
m+1
(s, a) =
π
i
m
θ
i
m
(a
i
m
|o
i
m
)
π
i
m
0
(a
i
m
|o
i
m
)
M
i
m
(s, a) //Unless m = n.
Update the critic by gradient descent on
1
B
P
b
ˆ
V
φ
(s
b
) R
b
2
.
Discard φ. Deploy {θ
i
}
i∈N
in execution;
48
Heterogeneous-Agent Reinforcement Learning
Algorithm 6: HADDPG
Input: stepsize α, Polyak coefficient τ , batch size B, number of: agents n, episodes
K, steps per episode T , mini-epochs e;
Initialize: the critic networks: φ and
ˆ
φ and policy networks: {θ
i
}
i∈N
and {
ˆ
θ
i
}
i∈N
,
replay buffer B, random processes {X
i
}
i∈N
for exploration;
for k = 0, 1, . . . , K 1 do
Collect a set of transitions by letting the agents act according to their
deterministic policies with the exploratory noise
a
i
t
= µ
i
θ
i
(o
i
t
) + X
i
t
.
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B;
Sample a random minibatch of B transitions from B;
Compute the critic targets
y
t
= r
t
+ γQ
ˆ
φ
(s
t+1
,
ˆ
a
t+1
), where
ˆ
a
t+1
is sampled by {
ˆ
θ
i
}
i∈N
.
Update the critic by minimising the loss
φ = arg min
φ
1
B
P
t
y
t
Q
φ
(s
t
, a
t
)
2
.
Draw a permutation of agents i
1:n
at random;
for agent i
m
= i
1
, . . . , i
n
do
Update agent i
m
by solving
θ
i
m
new
=
arg max
˜
θ
i
m
1
B
X
t
Q
φ
s
t
, µ
i
1:m1
θ
i
1:m1
new
(o
i
1:m1
t
), µ
i
m
˜
θ
i
m
(o
i
m
t
), µ
i
m+1:n
θ
i
m+1:n
old
(o
i
m+1:n
t
)
.
with e mini-epochs of deterministic policy gradient ascent;
Update the target networks smoothly
ˆ
φ = τ φ + (1 τ)
ˆ
φ.
ˆ
θ
i
= τθ
i
+ (1 τ )
ˆ
θ
i
.
Discard φ,
ˆ
φ, and
ˆ
θ
i
, i N. Deploy θ
i
, i N in execution.
49
Zhong, Kuba, Feng, Hu, Ji, and Yang
Algorithm 7: HATD3
Input: stepsize α, Polyak coefficient τ , batch size B, number of: agents n, episodes
K, steps per episode T , mini-epochs e, target noise range c;
Initialize: the critic networks: φ
1
, φ
2
and
ˆ
φ
1
,
ˆ
φ
2
and policy networks: {θ
i
}
i∈N
and
{
ˆ
θ
i
}
i∈N
, replay buffer B, random processes {X
i
}
i∈N
for exploration;
for k = 0, 1, . . . , K 1 do
Collect a set of transitions by letting the agents act according to their
deterministic policies with the exploratory noise
a
i
t
= µ
i
θ
i
(o
i
t
) + X
i
t
.
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B;
Sample a random minibatch of B transitions from B;
Compute the critic targets
y
t
= r
t
+ γ min
j=1,2
Q
ˆ
φ
j
(s
t+1
,
ˆ
a
t+1
), where
ˆa
i
t+1
= clip(µ
i
ˆ
θ
i
(o
i
t+1
) + , a
i
Low
, a
i
High
), clip(N(0, ˜σ), c, c). B Here N
denotes Normal distribution.
Update the critic by minimising the loss
φ
j
= arg min
φ
j
1
B
P
t
y
t
Q
φ
j
(s
t
, a
t
)
2
, j = 1, 2.
if k mod policy_delay = 0 then
Draw a permutation of agents i
1:n
at random;
for agent i
m
= i
1
, . . . , i
n
do
Update agent i
m
by solving
θ
i
m
new
=
arg max
˜
θ
i
m
1
B
X
t
Q
φ
1
s
t
, µ
i
1:m1
θ
i
1:m1
new
(o
i
1:m1
t
), µ
i
m
˜
θ
i
m
(o
i
m
t
), µ
i
m+1:n
θ
i
m+1:n
old
(o
i
m+1:n
t
)
.
with e mini-epochs of deterministic policy gradient ascent;
Update the target networks smoothly
ˆ
φ
1
= τφ
1
+ (1 τ )
ˆ
φ
1
.
ˆ
φ
2
= τφ
2
+ (1 τ )
ˆ
φ
2
.
ˆ
θ
i
= τθ
i
+ (1 τ )
ˆ
θ
i
.
Discard φ
1
, φ
2
,
ˆ
φ
1
,
ˆ
φ
2
, and
ˆ
θ
i
, i N. Deploy θ
i
, i N in execution.
50
Heterogeneous-Agent Reinforcement Learning
Appendix H. The Summary of HARL algorithms as Instances of HAML
Recap of HAML
Definition of HAMO:
M
(π
i
m
)
D
i
m
,π
i
1:m1
k+1
A
π
k
(s) , E
a
i
1:m1
π
i
1:m1
k+1
,a
i
m
π
i
m
h
A
i
m
π
k
s, a
i
1:m1
, a
i
m
i
D
i
m
π
k
π
i
m
s, π
i
1:m1
k+1
.
Optimisation target: π
i
m
k+1
= arg max
π
i
m
∈U
i
m
π
k
(π
i
m
k
)
E
sβ
π
k
h
M
(π
i
m
)
D
i
m
,π
i
1:m1
k+1
A
π
k
(s)
i
HATRPO
π
i
m
k+1
= arg max
π
i
m
E
sρ
π
k
,a
i
1:m1
π
i
1:m1
k+1
,a
i
m
π
im
A
i
m
π
k
s, a
i
1:m1
, a
i
m

,
subject to
¯
D
KL
π
i
m
k
, π
i
m
δ. (28)
Drift functional: HADF D
i
m
π
k
π
i
m
s, π
i
1:m1
k+1
0.
Neighborhood operator:
U
i
m
π
k
(π
i
m
k
) =
n
π
i
m
Π
i
m
E
sρ
π
k
h
D
KL
π
i
m
k
(·|s), π
i
m
(·|s)
i
δ
o
.
Sampling distribution: β
π
k
= ρ
π
k
.
HAPPO
π
i
m
k+1
= arg max
π
i
m
E
sρ
π
k
,a
i
1:m1
π
i
1:m1
k+1
,a
i
m
π
i
m
k
h
min
r(π
i
m
)A
i
1:m
π
k
(s, a
i
1:m
), clip
r(π
i
m
), 1 ±
A
i
1:m
π
k
(s, a
i
1:m
)
i
,
where r(π
i
m
) =
π
i
m
(a
i
m
|s)
π
i
m
k
(a
i
m
|s)
. (29)
Drift functional:
D
i
m
π
k
π
i
m
s, π
i
1:m1
k+1
=
E
a
i
1:m1
π
i
1:m1
k+1
,a
i
m
π
i
m
k
ReLU

r(π
i
m
) clip
r(π
i
m
), 1 ±

A
i
1:m
π
k
(s, a
i
1:m
)

(30)
Neighborhood operator: U
i
m
π
k
(π
i
m
k
) Π
i
m
.
Sampling distribution: β
π
k
= ρ
π
k
.
51
Zhong, Kuba, Feng, Hu, Ji, and Yang
HAA2C
π
i
m
k+1
= arg max
π
i
m
E
sρ
π
k
,a
i
1:m1
π
i
1:m1
k+1
,a
i
m
π
im
A
i
m
π
k
s, a
i
1:m1
, a
i
m

(31)
Drift functional: HADF D
i
m
π
k
π
i
m
s, π
i
1:m1
k+1
0.
Neighborhood operator: U
i
m
π
k
(π
i
m
k
) Π
i
m
.
Sampling distribution: β
π
k
= ρ
π
k
.
HADDPG & HATD3
µ
i
m
k+1
= arg max
µ
i
m
E
sβ
µ
k
h
Q
i
1:m
µ
k
s, µ
i
1:m1
k+1
(s), µ
i
m
(s)
i
, (32)
Drift functional: HADF D
i
m
µ
k
µ
i
m
s, µ
i
1:m1
k+1
0.
Neighborhood operator:
U
i
m
µ
k
(µ
i
m
k
) Π
i
m
(the deterministic policy space).
Sampling distribution: β
µ
k
is a uniform distribution over the states in the off-policy
replay buffer.
52
Heterogeneous-Agent Reinforcement Learning
Appendix I. HAD3QN: A Pure Value-based Approximation to HADDPG
In this section, we propose HAD3QN, which is a pure value-based approximation of HAD-
DPG. Corresponding to HADDPG where each agent learns to maximise the joint target
given the previous agents’ updates, HAD3QN models decentralised agents as individual Q
networks that predict the centralised critic’s output. In particular, the centralised critic’s
output is sequentially maximised for sequential learning. During execution, for each obser-
vation each agent chooses the action that maximises its individual Q network. We provide
its pseudocode in Algorithm 8.
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
40
30
20
10
0
Episode Return
Speaker Listener (discrete)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
120
110
100
90
80
70
60
50
Episode Return
Spread (discrete)
MAPPO
HAD3QN
HAA2C
HATRPO
HAPPO
Figure 13: Average episode return of HAD3QN on Speaker Listener and Spread compared
with existing methods.
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
40
30
20
10
0
Episode Return
Speaker Listener (discrete)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
120
110
100
90
80
70
60
50
Episode Return
Spread (discrete)
vanilla HADQN
HAD3QN
Figure 14: Ablation study on the effect of dueling network architecture in HAD3QN.
Empirically, we test it on the Speaker Listener and Spread task in MPE, and observe
that HAD3QN is able to solve them within 10 million steps (Figure 13). Compared with
the vanilla HADQN where dueling architecture is not utilised (Figure 14), we find that the
dueling network architecture effectively improves learning efficiency and stability, and is
crucial for HAD3QN to achieve higher return. The hyperparameters are reported in Section
K.
However, we note that HAD3QN does not scale well as it suffers from the curse of di-
mensionality with the growing number of agents and increasing dimensionality of individual
action space. This phenomenon is similar to what has been discussed in the DQN case in
RL by Lillicrap et al. (2016). The purpose of proposing HAD3QN is not to refresh SOTA
53
Zhong, Kuba, Feng, Hu, Ji, and Yang
methods, but to show that discretised approximation of HADDPG is also possible and it
performs well on low-dimensional tasks. It also shows that our HARL framework allows
direct extension of RL research results, in this case being the dueling network design, which
is potentially powerful as the efforts to re-derive similar multi-agent results can be saved.
Appendix J. Additional Experiment Results
In this section, we present the learning curves of HAPPO, HATRPO, MAPPO, and QMIX
across at least three seeds on ten SMAC maps and five SMACv2 maps in Figure 15.
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
8m vs 9m (Hard)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
25m (Hard)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
5m vs 6m (Hard)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
3s5z (Hard)
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
10m vs 11m (Hard)
QMIX
MAPPO
HATRPO
HAPPO
0.0 0.5 1.0 1.5 2.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
MMM2 (Super Hard)
0.0 0.5 1.0 1.5 2.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
3s5z vs 3s6z (Super Hard)
0.0 0.5 1.0 1.5 2.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
27m vs 30m (Super Hard)
0.0 0.5 1.0 1.5 2.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
corridor (Super Hard)
0 1 2 3 4
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
6h vs 8z (Super Hard)
QMIX
MAPPO
HATRPO
HAPPO
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
protoss 5 vs 5
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
terran 5 vs 5
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
zerg 5 vs 5
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
zerg 10 vs 10
0.0 0.2 0.4 0.6 0.8 1.0
Step
×10
7
0
20
40
60
80
100
Win Rate (%)
zerg 10 vs 11
QMIX
MAPPO
HATRPO
HAPPO
Figure 15: Comparisons of average win rate on SMAC and SMACv2. It should be noted
that some of the QMIX experiments were terminated early if they had already converged, as
observed in MMM2, 3s5z_vs_3s6z, and corridor, or if the computational resources required
were excessive, as observed in the case of 27m_vs_30m. Specifically, running QMIX for a
single seed for 20 million steps in 27m_vs_30m would have necessitated more than 250 GB
memory and 10 days, which exceeded the computational budget allocated for this study.
Consequently, we executed the experiment for only 10 million steps.
54
Heterogeneous-Agent Reinforcement Learning
Algorithm 8: HAD3QN
Input: stepsize α, Polyak coefficient τ , batch size B, exploration parameter ,
number of: agents n, episodes K, steps per episode T .
Initialize: global critic and target networks: φ, and
ˆ
φ, distributed critic and target
networks: {θ
i
, i N} and {
ˆ
θ
i
, i N}, replay buffer B.
for k = 0, 1, . . . , K 1 do
Collect a set of trajectories by letting the agents act -greedily with respect to
the distributed critics
a
i
t
=
(
arg max
a
i
Q
i
θ
i
(o
i
t
, a
i
) with probability 1
random with probability .
Push transitions {(s
t
, o
i
t
, a
i
t
, r
t
, s
t+1
, o
i
t+1
), i N, t T } into B.
Sample a random minibatch of B transitions from B.
Compute the global target
y
t
= r
t
+ γ ·Q
ˆ
φ
(s
t+1
, a
),
where a
i
= arg max
a
i
Q
i
ˆ
θ
i
(o
i
t+1
, a
i
), for all i N.
Compute the global loss
L(φ) =
1
B
B
P
b=1
Q
φ
(s
b
, a
b
) y
b
2
.
Update the critic parameters
φ = φ α
φ
L(φ).
Draw a permutation of agents i
1:n
at random;
for agent i
m
= i
1
, . . . , i
n
do
Compute the local targets
y
i
m
t
= Q
φ
(s
t
, a
i
1:m1
, a
i
1:m1
t
),
where a
i
j
= arg max
a
i
j
Q
φ
(s
t
, a
i
1:j1
, a
i
j
, a
i
1:j
t
), for j < m.
Compute the agent’s local loss
L(θ
i
m
) =
1
B
B
P
b=1
Q
i
m
θ
i
m
(o
i
m
b
, a
i
m
b
) y
i
m
b
2
.
Update the critic parameters
θ
i
m
= θ
i
m
α
θ
i
m
L(θ
i
m
).
Update the target networks smoothly
ˆ
φ = τ φ + (1 τ)
ˆ
φ,
ˆ
θ
i
= τθ
i
+ (1 τ )
ˆ
θ
i
.
Discard φ,
ˆ
φ, and
ˆ
θ
i
, i N. Deploy θ
i
, i N in execution.
55
Zhong, Kuba, Feng, Hu, Ji, and Yang
Appendix K. Hyperparameter Settings for Experiments
Before we report the hyperparameters used in the experiments, we would like to clarify the
reporting conventions that we follow. Firstly, for simplicity and clarity reasons, we specify
the network architecture to be MLP or RNN, but in configuration files the corresponding
term is a boolean value use_recurrent_policy . The only difference between RNN network
and MLP network is that the former has a GRU layer after the same MLP backbone, and the
related configuration of this GRU layer is provided in Table 4. Secondly, the hyperparameters
will only take effect when they are used. For example, the number of GRU layers is set to
1 across all environments, but it should only be considered when the network architecture
is RNN; as another example, while we report kl_threshold in on-policy hyperparameter
tables, it is only useful when HATRPO is applied. Finally, the batch_size reported for on-
policy algorithms is calculated as the product of n_rollout_threads and episode_length.
K.1 Common Hyperparameters Across All Environments
In this part, we present the common hyperparameters used for on-policy algorithms in Table
4 and for off-policy algorithms in Table 5 across all environments.
Table 4: Common hyperparameters used for on-policy algorithms HAPPO, HATRPO,
HAA2C, and MAPPO (when our MAPPO implementation is used) across all environments.
hyperparameters value hyperparameters value
use valuenorm True use proper time limits True
activation ReLU use feature normalization True
initialization method orthogonal gain 0.01
use naive recurrent policy False num GRU layers 1
data chunk length 10 optim eps 1e 5
weight decay 0 std x coef 1
std y coef 0.5 use clipped value loss True
value loss coef 1 use max grad norm True
max grad norm 10.0 use GAE True
GAE lambda 0.95 use huber loss True
use policy active masks True huber delta 10.0
action aggregation prod ls step 10
accept ratio 0.5
K.2 Multi-Agent Particle Environment (MPE)
In this part, we present the hyperparameters used in MPE tasks for HAPPO, HATRPO,
HAA2C, and MAPPO in Table 6, for HADDPG, HATD3, MADDPG, and MATD3 in Table
7, and for HAD3QN in Table 8.
56
Heterogeneous-Agent Reinforcement Learning
Table 5: Common hyperparameters used for off-policy algorithms HADDPG, HATD3,
HAD3QN, MADDPG, and MATD3 across all environments.
hyperparameters value hyperparameters value
proper time limits True warmup steps 1e4
activation ReLU final activation Tanh
base activation ReLU dueling v activation Hardswish
dueling a activation Hardswish buffer size 1e6
batch size 1000 polyak 0.005
epsilon 0.05 policy noise 0.2
noise clip 0.5
Table 6: Common hyperparameters used for HAPPO, HATRPO, HAA2C, and MAPPO in
the MPE domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 4000 linear lr decay False network MLP
hidden sizes [128, 128] actor lr 5e 4 critic lr 5e 4
ppo epoch 5 critic epoch 5 a2c epoch 5
clip param 0.2 actor mini batch 1 critic mini batch 1
entropy coef 0.01 gamma 0.99 kl threshold 0.005
backtrack coeff 0.8
Table 7: Common hyperparameters used for HADDPG, HATD3, MADDPG, and MATD3
in the MPE domain.
hyperparameters value hyperparameters value hyperparameters value
rollout threads 20 train interval 50 update per train 1
linear lr decay False hidden sizes [128, 128] actor lr 5e 4
critic lr 1e 3 gamma 0.99 n step 1
policy update frequency 2
Table 8: Common hyperparameters used for HAD3QN in the MPE domain.
hyperparameters value hyperparameters value hyperparameters value
rollout threads 20 train interval 50 base hidden sizes [128, 128]
linear lr decay False update per train 1 dueling v hidden sizes [128]
actor lr 5e 4 critic lr 1e 3 dueling a hidden sizes [128]
gamma 0.95 n step 1
57
Zhong, Kuba, Feng, Hu, Ji, and Yang
K.3 Multi-Agent MuJoCo (MAMuJoCo)
In this part, we report the hyperparameters used in MAMuJoCo tasks for HAPPO, HA-
TRPO, HAA2C, and MAPPO in Table 9, 10, 11, and 12, and for HADDPG, HATD3,
MADDPG, and MATD3 in Table 13, 14, 15, and 16.
Table 9: Common hyperparameters used for HAPPO, HATRPO, HAA2C, and MAPPO in
the MAMuJoCo domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 4000 network MLP hidden sizes [128, 128, 128]
gamma 0.99 backtrack coeff 0.8
Table 10: Different hyperparameters used for HAPPO and MAPPO in the MAMuJoCo
domain.
scenarios
linear
lr decay
actor/critic
lr
ppo/critic
epoch
clip
param
actor/critic
mini batch
entropy
coef
Ant 4x2 False 5e 4 5 0.1 1 0
HalfCheetah 2x3 False 5e 4 15 0.05 1 0.01
Hopper 3x1 True 5e 4 10 0.05 1 0
Walker 2x3 True 1e 3 5 0.05 2 0
Walker 6x1 False 5e 4 5 0.1 1 0.01
Humanoid 17x1 True 5e 4 5 0.1 1 0
Table 11: Different hyperparameters used for HATRPO in the MAMuJoCo domain.
scenarios
linear
lr decay
critic
lr
critic
epoch
clip
param
critic
mini batch
kl
threshold
Ant 4x2 False 5e 4 5 0.2 1 5e 3
HalfCheetah 2x3 False 5e 4 5 0.2 1 1e 2
Hopper 3x1 False 5e 4 5 0.2 1 1e 3
Walker 2x3 False 5e 4 5 0.2 1 1e 2
Walker 6x1 False 5e 4 5 0.2 1 5e 3
K.4 StarCraft Multi-Agent Challenge (SMAC)
In the SMAC domain, for MAPPO and QMIX baselines we adopt the implementation and
tuned hyperparameters reported in the MAPPO paper. Here we report the hyperparam-
eters for HAPPO and HATRPO in Table 17, 18, 19, and 20, which are kept comparable
with the baselines for fairness purposes. The state type hyperparameter can take “EP”
(for Environment-Provided global state) and “FP” (for Featured-Pruned agent-specific global
state), as named by Yu et al. (2022).
58
Heterogeneous-Agent Reinforcement Learning
Table 12: Different hyperparameters used for HAA2C in the MAMuJoCo domain.
scenarios
linear
lr decay
actor/critic
lr
a2c/critic
epoch
clip
param
actor/critic
mini batch
entropy
coef
Ant 4x2 True 5e 4 5 0.1 1 0
HalfCheetah 2x3 True 5e 4 5 0.1 1 0
Hopper 3x1 True 1e 4 3 0.1 1 0
Walker 2x3 True 1e 4 5 0.1 1 0
Walker 6x1 True 1e 4 5 0.1 1 0
Table 13: Common hyperparameters used for HADDPG and MADDPG in the MAMuJoCo
domain.
hyperparameters value hyperparameters value hyperparameters value
rollout threads 10 train interval 50 linear lr decay False
hidden sizes [256, 256] actor lr 5e 4 critic lr 1e 3
gamma 0.99
Table 14: Different hyperparameters used for HADDPG and MADDPG in the MAMuJoCo
domain.
scenarios
update
per train
exploration
noise
n step
Ant 4x2 0.5 0.05 20
HalfCheetah 2x3 1 0.1 20
Hopper 3x1 1 0.1 20
Walker 2x3 1 0.1 10
Walker 6x1 1 0.1 20
Table 15: Common hyperparameters used for HATD3 and MATD3 in the MAMuJoCo
domain.
hyperparameters value hyperparameters value hyperparameters value
rollout threads 10 train interval 50 update per train 1
linear lr decay False hidden sizes [256, 256] actor lr 5e 4
critic lr 1e 3 gamma 0.99 exploration noise 0.1
Table 16: Different hyperparameters used for HATD3 and MATD3 in the MAMuJoCo
domain.
scenarios policy update frequency n step
Ant 4x2 2 5
HalfCheetah 2x3 2 10
Hopper 3x1 2 5
Walker 2x3 8 20
Walker 6x1 2 25
Humanoid 17x1 2 5
59
Zhong, Kuba, Feng, Hu, Ji, and Yang
Table 17: Common hyperparameters used for HAPPO in the SMAC domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 3200 linear lr decay False hidden sizes [64, 64, 64]
actor lr 5e 4 critic lr 5e 4 entropy coef 0.01
Table 18: Different hyperparameters used for HAPPO in the SMAC domain.
Map network
ppo/critic
epoch
clip
param
actor/critic
mini batch
gamma
state
type
8m_vs_9m RNN 5 0.05 1 0.95 EP
25m RNN 5 0.2 1 0.99 EP
5m_vs_6m RNN 5 0.05 1 0.95 FP
3s5z RNN 5 0.2 1 0.99 EP
10m_vs_11m RNN 5 0.05 1 0.95 FP
MMM2 MLP 5 0.2 1 0.95 EP
3s5z_vs_3s6z RNN 5 0.1 2 0.95 FP
27m_vs_30m RNN 5 0.05 1 0.95 FP
6h_vs_8z MLP 10 0.05 2 0.95 FP
corridor MLP 5 0.2 1 0.99 FP
Table 19: Common hyperparameters used for HATRPO in the SMAC domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 3200 linear lr decay False hidden sizes [64, 64, 64]
critic epoch 5 clip param 0.2 critic mini batch 1
Table 20: Different hyperparameters used for HATRPO in the SMAC domain.
Map network
critic
lr
gamma
kl
threshold
backtrack
coeff
state
type
8m_vs_9m MLP 5e 4 0.99 5e 3 0.5 FP
25m RNN 5e 4 0.99 1e 2 0.5 EP
5m_vs_6m RNN 5e 4 0.99 1e 2 0.5 FP
3s5z MLP 5e 4 0.95 1e 2 0.5 EP
10m_vs_11m MLP 5e 4 0.95 5e 3 0.5 FP
MMM2 MLP 5e 4 0.95 6e 2 0.5 EP
3s5z_vs_3s6z MLP 5e 4 0.99 5e 3 0.5 FP
27m_vs_30m RNN 5e 4 0.99 1e 3 0.8 FP
6h_vs_8z MLP 1e 3 0.99 1e 3 0.8 FP
corridor RNN 5e 4 0.99 6e 2 0.5 FP
60
Heterogeneous-Agent Reinforcement Learning
K.5 SMACv2
In the SMACv2 domain, for MAPPO and QMIX baselines we adopt the implementation and
tuned hyperparameters reported in Ellis et al. (2022). Here we report the hyperparameters
for HAPPO and HATRPO in Table 21 and 22, which are kept comparable with the baselines
for fairness purposes.
Table 21: Hyperparameters used for HAPPO in the SMACv2 domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 3200 linear lr decay False hidden sizes [64]
network RNN ppo / critic epoch 5 clip param 0.05
actor lr 5e 4 critic lr 5e 4 entropy coef 0.01
gamma 0.99 actor / critic mini batch 2 for terran_5_vs_5 and 1 otherwise
Table 22: Hyperparameters used for HATRPO on all tasks in the SMACv2 domain.
hyperparameters value hyperparameters value hyperparameters value
batch size 3200 linear lr decay False hidden sizes [64]
network RNN critic lr 5e 4 critic epoch 5
clip param 0.2 critic mini batch 1 gamma 0.99
kl threshold 5e 3 backtrack coeff 0.5
K.6 Google Research Football Environment (GRF)
In the GRF domain, for MAPPO and QMIX baselines we adopt the implementation and
tuned hyperparameters reported in the MAPPO paper. Here we report the hyperparameters
for HAPPO in Table 23 and 24, which are kept similar and comparable to the baselines for
fairness purposes.
Table 23: Common hyperparameters used for HAPPO in the GRF domain.
hyperparameters value hyperparameters value hyperparameters value
rollout threads 50 hidden sizes [64, 64] actor lr 5e 4
critic lr 5e 4 ppo epoch 15 critic epoch 15
clip param 0.2 actor mini batch 2 critic mini batch 2
entropy coef 0.01 gamma 0.99
K.7 Bi-DexterousHands
In the Bi-DexterousHands domain, we use the PPO and MAPPO baselines implemented
in the Bi-DexterousHands benchmark for comparison and for them we adopt the officially
reported hyperparameters. Here we report the hyperparameters used for HAPPO in Table
25. As Bi-DexterousHands tasks are GPU-parallelised, we reload the configuration term
n_rollout_threads with a meaning of number of parallel environments. Thus, parallel
envs in Table 25 refers to n_rollout_threads.
61
Zhong, Kuba, Feng, Hu, Ji, and Yang
Table 24: Different hyperparameters used for HAPPO in the GRF domain.
scenarios network
episode
length
linear
lr decay
PS RNN 200 True
RPS MLP 200 False
3v.1 MLP 200 True
CA(easy) MLP 200 True
CA(hard) MLP 1000 True
Table 25: Common hyperparameters used for HAPPO in the Bi-DexterousHands domain.
hyperparameters value hyperparameters value hyperparameters value
parallel envs 256 linear lr decay False network MLP
hidden sizes [256, 256, 256] actor lr 5e 4 critic lr 5e 4
ppo epoch 5 critic epoch 5 clip param 0.2
actor mini batch 1 critic mini batch 1 entropy coef 0.01
gamma 0.95
References
Johannes Ackermann, Volker Gabler, Takayuki Osa, and Masashi Sugiyama. Reducing
overestimation bias in multi-agent domains using double centralized critics. arXiv preprint
arXiv:1910.01465, 2019.
Carlos Alós-Ferrer and Nick Netzer. The logit-response dynamics. Games and Economic
Behavior, 68(2):413–427, 2010.
Lawrence M Ausubel and Raymond J Deneckere. A generalized theorem of the maximum.
Economic Theory, 3(1):99–107, 1993.
Tamer Başar and Geert Jan Olsder. Dynamic noncooperative game theory. SIAM, 1998.
Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity
of decentralized control of markov decision processes. Mathematics of operations research,
27(4):819–840, 2002.
Dimitri Bertsekas. Multiagent rollout algorithms and reinforcement learning. arXiv preprint
arXiv:1910.00120, 2019.
Jeancarlo Arguello Calvo and Ivana Dusparic. Heterogeneous multi-agent deep reinforcement
learning for traffic lights control. In AICS, pages 2–13, 2018.
Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress
in the study of distributed multi-agent coordination. IEEE Transactions on Industrial
informatics, 9(1):427–438, 2012.
62
Heterogeneous-Agent Reinforcement Learning
Yuanpei Chen, Yaodong Yang, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang,
Zongqing Lu, Stephen Marcus McAleer, Hao Dong, and Song-Chun Zhu. Towards human-
level bimanual dexterous manipulation with reinforcement learning. In Thirty-sixth Con-
ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
URL https://openreview.net/forum?id=D29JbExncTP.
Filippos Christianos, Georgios Papoudakis, Muhammad A Rahman, and Stefano V Al-
brecht. Scaling multi-agent reinforcement learning with selective parameter sharing. In
International Conference on Machine Learning, pages 1989–1998. PMLR, 2021.
Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative
multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998.
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk,
Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you
need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.
Benjamin Ellis, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N
Foerster, and Shimon Whiteson. Smacv2: An improved benchmark for cooperative multi-
agent reinforcement learning. arXiv preprint arXiv:2212.07489, 2022.
Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science &
Business Media, 2012.
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon
Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 32, 2018.
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in
actor-critic methods. In International conference on machine learning, pages 1587–1596.
PMLR, 2018.
Ian Gemp, Brian McWilliams, Claire Vernade, and Thore Graepel. Eigengame: {PCA} as
a nash equilibrium. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=NzTU59SYbNq.
Siyi Hu, Chuanlong Xie, Xiaodan Liang, and Xiaojun Chang. Policy diagnosis via measur-
ing role diversity in cooperative multi-agent rl. In International Conference on Machine
Learning, pages 9041–9071. PMLR, 2022a.
Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Zhihui Li, Xiaodan Liang,
Xiaojun Chang, and Yaodong Yang. Marllib: Extending rllib for multi-agent reinforce-
ment learning. arXiv preprint arXiv:2210.13708, 2022b.
Maximilian Hüttenrauch, Adrian Šošić, and Gerhard Neumann. Guided deep reinforcement
learning for swarm systems. arXiv preprint arXiv:1709.06011, 2017.
Maximilian Hüttenrauch, Sosic Adrian, Gerhard Neumann, et al. Deep reinforcement learn-
ing for swarm systems. Journal of Machine Learning Research, 20(54):1–31, 2019.
63
Zhong, Kuba, Feng, Hu, Ji, and Yang
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learn-
ing. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
International Conference on Learning Representations, 2015.
Jakub Grudzien Kuba, Muning Wen, Linghui Meng, Haifeng Zhang, David Mguni, Jun
Wang, Yaodong Yang, et al. Settling the variance of multi-agent policy gradients. Advances
in Neural Information Processing Systems, 34:13458–13470, 2021.
Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang,
and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learn-
ing. In International Conference on Learning Representations, 2022a. URL https:
//openreview.net/forum?id=EcGGFkNTxdJ.
Jakub Grudzien Kuba, Christian Schroeder de Witt, and Jakob Foerster. Mirror learning:
A unifying framework of policy optimisation. ICML, 2022b.
Karol Kurach, Anton Raichuk, Piotr Stańczyk, Michał Zając, Olivier Bachem, Lasse Es-
peholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al.
Google research football: A novel reinforcement learning environment. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 34, pages 4501–4510, 2020.
Hepeng Li and Haibo He. Multiagent trust region policy optimization. IEEE Transactions
on Neural Networks and Learning Systems, 2023.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval
Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement
learning. In International Conference on Learning Representations, 2016.
Michael L Littman. Markov games as a framework for multi-agent reinforcement learning.
In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent
actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st
International Conference on Neural Information Processing Systems, pages 6382–6393,
2017.
David H Mguni, Yutong Wu, Yali Du, Yaodong Yang, Ziyi Wang, Minne Li, Ying Wen, Joel
Jennings, and Jun Wang. Learning in nonzero-sum stochastic games with potentials. In
Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference
on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages
7688–7699. PMLR, 18–24 Jul 2021.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep
reinforcement learning. In International conference on machine learning, pages 1928–
1937. PMLR, 2016.
64
Heterogeneous-Agent Reinforcement Learning
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-
agent populations. In Proceedings of the AAAI conference on artificial intelligence, vol-
ume 32, 2018.
John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
Frans A Oliehoek and Christopher Amato. A concise introduction to decentralized POMDPs.
Springer, 2016.
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Bench-
marking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Pro-
ceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks
(NeurIPS), 2021. URL http://arxiv.org/abs/2006.07869.
Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip
Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised
policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221,
2021.
P Peng, Q Yuan, Y Wen, Y Yang, Z Tang, H Long, and J Wang. Multiagent bidirectionally-
coordinated nets for learning to play starcraft combat games. arxiv 2017. arXiv preprint
arXiv:1703.10069, 2017.
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster,
and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent
reinforcement learning. In International Conference on Machine Learning, pages 4295–
4304. PMLR, 2018.
Ray-Team. Ray rllib documentation: Multi-agent deep deterministic policy gra-
dient (maddpg). https://docs.ray.io/en/latest/rllib/rllib-algorithms.html#
multi-agent-deep-deterministic-policy-gradient-maddpg, accessed on 2023-03-14.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas
Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and
Shimon Whiteson. The StarCraft Multi-Agent Challenge. CoRR, abs/1902.04043, 2019.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In International conference on machine learning, pages 1889–
1897. PMLR, 2015.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. In International
Conference on Learning Representations, 2016.
John Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):
1095–1100, 1953.
65
Zhong, Kuba, Feng, Hu, Ji, and Yang
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried-
miller. Deterministic policy gradient algorithms. In International conference on machine
learning, pages 387–395. PMLR, 2014.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zam-
baldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al.
Value-decomposition networks for cooperative multi-agent learning based on team reward.
In Proceedings of the 17th International Conference on Autonomous Agents and MultiA-
gent Systems, pages 2085–2087, 2018.
R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for re-
inforcement learning with function approximation. In Advances in Neural Information
Processing Systems 12, volume 12, pages 1057–1063. MIT Press, 2000.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT
press, 2018.
Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In
Proceedings of the tenth international conference on machine learning, pages 330–337,
1993.
J Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sulli-
van, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al.
Pettingzoo: Gym for multi-agent reinforcement learning. Advances in Neural Information
Processing Systems, 34:15032–15043, 2021.
Justin K Terry, Nathaniel Grammel, Sanghyun Son, and Benjamin Black. Parameter
sharing for heterogeneous agents in multi-agent reinforcement learning. arXiv preprint
arXiv:2005.13625, 2020.
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30,
2016.
Jiangxing Wang, Deheng Ye, and Zongqing Lu. More centralized training, still decentralized
execution: Multi-agent conditional policy factorization. In The Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?
id=znLlSgN-4S0.
Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, and Chongjie
Zhang. Rode: Learning roles to decompose multi-agent tasks. International Conference
on Learning Representations, 2021.
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas.
Dueling network architectures for deep reinforcement learning. In International conference
on machine learning, pages 1995–2003. PMLR, 2016.
Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive rea-
soning for multi-agent reinforcement learning. In International Conference on Learning
Representations, 2018.
66
Heterogeneous-Agent Reinforcement Learning
Ying Wen, Yaodong Yang, and Jun Wang. Modelling bounded rationality in multi-agent
interactions by generalized recursive reasoning. In Christian Bessiere, editor, Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20,
pages 414–421. International Joint Conferences on Artificial Intelligence Organization, 7
2020. Main track.
Ying Wen, Hui Chen, Yaodong Yang, Minne Li, Zheng Tian, Xu Chen, and Jun Wang.
A game-theoretic approach to multi-agent trust region optimization. In International
Conference on Distributed Artificial Intelligence, pages 74–87. Springer, 2022.
Zifan Wu, Chao Yu, Deheng Ye, Junge Zhang, Hankz Hankui Zhuo, et al. Coordinated
proximal policy optimization. Advances in Neural Information Processing Systems, 34:
26437–26448, 2021.
Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game
theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.
Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field
multi-agent reinforcement learning. In International Conference on Machine Learning,
pages 5571–5580. PMLR, 2018.
Yaodong Yang, Ying Wen, Jun Wang, Liheng Chen, Kun Shao, David Mguni, and Weinan
Zhang. Multi-agent determinantal q-learning. In International Conference on Machine
Learning, pages 10757–10766. PMLR, 2020.
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and
Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. In Thirty-
sixth Conference on Neural Information Processing Systems Datasets and Benchmarks
Track, 2022.
Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and
Jun Wang. Bi-level actor-critic for multi-agent coordination. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 7325–7332, 2020.
Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A
selective overview of theories and algorithms. Handbook of reinforcement learning and
control, pages 321–384, 2021.
Ming Zhou, Ziyu Wan, Hanjing Wang, Muning Wen, Runzhe Wu, Ying Wen, Yaodong Yang,
Yong Yu, Jun Wang, and Weinan Zhang. Malib: A parallel framework for population-
based multi-agent reinforcement learning. Journal of Machine Learning Research, 24(150):
1–12, 2023. URL http://jmlr.org/papers/v24/22-0169.html.
67