Heterogeneous-Agent Mirror Learning:
A Continuum of Solutions to Cooperative MARL
Jakub Grudzien Kuba
,1
, Xidong Feng
,2
,
Shiyao Ding
3
, Hao Dong
4
, Jun Wang
2
, Yaodong Yang
,4
1
University of Oxford,
2
University College London,
3
Kyoto University,
4
Peking University
Abstract
The necessity for cooperation among intelligent machines has popularised coop-
erative multi-agent reinforcement learning (MARL) in the artificial intelligence
(AI) research community. However, many research endeavours have been focused
on developing practical MARL algorithms whose effectiveness has been studied
only empirically, thereby lacking theoretical guarantees. As recent studies have
revealed, MARL methods often achieve performance that is unstable in terms of
reward monotonicity or suboptimal at convergence. To resolve these issues, in
this paper, we introduce a novel framework named Heterogeneous-Agent Mirror
Learning (HAML) that provides a general template for MARL algorithmic designs.
We prove that algorithms derived from the HAML template satisfy the desired
properties of the monotonic improvement of the joint reward and the convergence
to Nash equilibrium. We verify the practicality of HAML by proving that the
current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO,
are in fact HAML instances. Next, as a natural outcome of our theory, we pro-
pose HAML extensions of two well-known RL algorithms, HAA2C (for A2C)
and HADDPG (for DDPG), and demonstrate their effectiveness against strong
baselines on StarCraftII and Multi-Agent MuJoCo tasks.
1 Introduction
While the policy gradient (PG) formula has been long known in the reinforcement learning (RL)
community [
25
], it has not been until trust region learning [
21
] that deep RL algorithms started
to solve complex tasks such as real-world robotic control successfully. Nowadays, methods that
followed the trust-region framework, including TRPO [
21
], PPO [
23
] and their extensions [
22
,
7
],
became effective tools for solving challenging AI problems [
2
]. It was believed that the key to their
success are the rigorously described stability and the monotonic improvement property of trust-region
learning that they approximate. This reasoning, however, would have been of limited scope since it
failed to explain why some algorithms following it (e.g. PPO-KL) largely underperform in contrast to
success of other ones (e.g. PPO-clip) [
23
]. Furthermore, the trust-region interpretation of PPO has
been formally rejected by recent studies both empirically [
5
] and theoretically [
27
]; this revealed that
the algorithm violates the trust-region constraints—it neither constraints the KL-divergence between
two consecutive policies, nor does it bound their likelihood ratios. These findings have suggested
that, while the number of available RL algorithms grows, our understanding of them does not, and
the algorithms often come without theoretical guarantees either.
Only recently, Kuba et al. [
9
] showed that the well-known algorithms, such as PPO, are in fact
instances of the so-called mirror learning framework, within which any induced algorithm is theoreti-
cally sound. On a high level, methods that fall into this class optimise the mirror objective, which
shapes an advantage surrogate by means of a drift functional—a quasi-distance between policies.
Equal contribution.
Corresponding author <[email protected]>.
Preprint. Under review.
arXiv:2208.01682v1 [cs.MA] 2 Aug 2022
Such an update provably leads them to monotonic improvements of the return, as well as the conver-
gence to the optimal policy. The result of mirror learning offers RL researchers strong confidence
that there exists a connection between an algorithm’s practicality and its theoretical properties and
assures soundness of the common RL practice.
While the problem of the lack of theoretical guarantees has been severe in RL, in multi-agent
reinforcement learning (MARL) it has only been exacerbated. Although the PG theorem has been
successfully extended to the multi-agent PG (MAPG) version [
31
], it has only recently been shown
that the variance of MAPG estimators grows linearly with the number of agents [
10
]. Prior to this,
however, a novel paradigm of centralised training for decentralised execution (CTDE) [
6
,
14
] greatly
alleviated the difficulty of the multi-agent learning by assuming that the global state and opponents’
actions and policies are accessible during the training phase; this enabled developments of practical
MARL methods by merely extending single-agent algorithms’ implementations to the multi-agent
setting. As a result, direct extensions of TRPO [
11
] and PPO [
3
,
30
] have been proposed whose
performance, although is impressive in some settings, varies according to the version used and
environment tested against. However, these extensions do not assure the monotonic improvement
property or convergence result of any kind [
8
]. Importantly, these methods can be proved to be
suboptimal at convergence in the common setting of parameter sharing [
8
] which is considered as
default by popular multi-agent algorithms [
30
] and popular multi-agent benchmarks such as SMAC
[20] due to the computational convenience it provides.
In this paper, we resolve these issues by proposing Heterogeneous-Agent Mirror Learning (HAML)—
a template that can induce a continuum of cooperative MARL algorithms with theoretical guarantees
for monotonic improvement as well as Nash equilibrium (NE) convergence. The purpose of HAML
is to endow MARL researchers with a template for rigorous algorithmic design so that having
been granted a method’s correctness upfront, they can focus on other aspects, such as effective
implementation through deep neural networks. We demonstrate the expressive power of the HAML
framework by showing that two of existing state-of-the-art (SOTA) MARL algorithms, HATRPO
and HAPPO [
8
], are rigorous instances of HAML. This stands in contrast to viewing them as merely
approximations to provably correct multi-agent trust-region algorithms as which they were originally
considered [
8
]. Furthermore, although HAML is mainly a theoretical contribution, we can naturally
demonstrate its usefulness by using it to derive two heterogeneous-agent extensions of successful RL
algorithms: HAA2C (for A2C [
16
]) and HADDPG (for DDPG [
12
]), whose strength are demonstrated
on benchmarks of StarCraftII (SMAC) [
3
] and Multi-Agent MuJoCo [
4
] against strong baselines
such as MADDPG [14] and MAA2C [18].
2 Problem Formulations
We formulate the cooperative MARL problem as cooperative Markov game [
13
] defined by a tuple
hN , S, A, r, P, γ, di
. Here,
N = {1, . . . , n}
is a set of
n
agents,
S
is the state space,
A = ×
n
i=1
A
i
is the products of all agents’ action spaces, known as the joint action space. Although our results
hold for general compact state and action spaces, in this paper we assume that they are finite, for
simplicity. Further,
r : S × A R
is the joint reward function,
P : S × A × S [0, 1]
is the
transition probability kernel,
γ [0, 1)
is the discount factor, and
d P(S)
(where
P(X)
denotes
the set of probability distributions over a set
X
) is the positive initial state distribution. At time
step
t N
, the agents are at state
s
t
(which may not be fully observable); they take independent
actions
a
i
t
, i N
drawn from their policies
π
i
(·
i
|s
t
) P(A
i
)
, and equivalently, they take a joint
action
a
t
= (a
1
t
, . . . , a
n
t
)
drawn from their joint policy
π(·|s
t
) =
Q
n
i=1
π
i
(·
i
|s
t
) P(A)
. We
write
Π
i
,
s∈S
π
i
(·
i
|s) |∀s S, π
i
(·
i
|s) P(A
i
)}
to denote the policy space of agent
i
, and
Π ,
1
, . . . , Π
n
)
to denote the joint policy space. It is important to note that when
π
i
(·
i
|s)
is a
Dirac delta ditribution, the policy is referred to as deterministic [
24
] and we write
µ
i
(s)
to refer
to its centre. Then, the environment emits the joint reward
r(s
t
, a
t
)
and moves to the next state
s
t+1
P (·|s
t
, a
t
) P(S)
. The initial state distribution
d
, the joint policy
π
, and the transition
kernel
P
induce the (improper) marginal state distribution
ρ
π
(s) ,
P
t=0
γ
t
Pr(s
t
= s|d, π)
. The
agents aim to maximise the expected joint return, defined as
J(π) = E
s
0
d,a
0:
π,s
1:
P
"
X
t=0
γ
t
r(s
t
, a
t
)
#
.
2
We adopt the most common solution concept for multi-agent problems which is that of Nash equilibria
[
17
,
29
]. We say that a joint policy
π
NE
Π
is a NE if none of the agents can increase the joint
return by unilaterally altering its policy. More formally, π
NE
is a NE if
i N , π
i
Π
i
, J(π
i
, π
i
NE
) J(π
NE
).
To study the problem of finding a NE, we introduce the following notions. Let
i
1:m
= (i
1
, . . . , i
m
)
N
be an ordered subset of agents. We write
i
1:m
to refer to its complement, and
i
and
i
,
respectively, when m = 1. We define the multi-agent state-action value function as
Q
i
1:m
π
(s, a
i
1:m
) , E
a
i
1:m
0
π
i
1:m
,s
1:
P,a
1:
π
"
X
t=0
γ
t
r(s
t
, a
t
)
s
0
= s, a
i
1:m
0
= a
i
1:m
#
.
When
m = n
(the joint action of all agents is considered), then
i
1:n
Sym(n)
, where
Sym(n)
denotes the set of permutations of integers
1, . . . , n
, known as the
symmetric group
. In that case we
write
Q
i
1:n
π
(s, a
i
1:n
) = Q
π
(s, a)
which is known as the (joint) state-action value function. On the
other hand, when
m = 0
, i.e.,
i
1:m
=
, the function takes the form
V
π
(s)
, known as the state-value
function. Consider two disjoint subsets of agents,
j
1:k
and
i
1:m
. Then, the multi-agent advantage
function of i
1:m
with respect to j
1:k
is defined as
A
i
1:m
π
s, a
j
1:k
, a
i
1:m
, Q
j
1:k
,i
1:m
π
s, a
j
1:k
, a
i
1:m
Q
j
1:k
π
s, a
j
1:k
.
As it has been shown in [
10
], the multi-agent advantage function allows for additive decomposition
of the joint advantge function by the means of the following lemma.
Lemma 1
(Multi-Agent Advantage Decomposition)
.
Let
π
be a joint policy, and
i
1
, . . . , i
m
be an
arbitrary ordered subset of agents. Then, for any state s and joint action a
i
1:m
,
A
i
1:m
π
s, a
i
1:m
=
m
X
j=1
A
i
j
π
s, a
i
1:j1
, a
i
j
. (1)
Although the multi-agent advantage function has been discovered only recently and has not been
studied thoroughly, it is this function and the above lemma that builds the foundation for the
development of the HAML theory.
3 The State of Affairs in MARL
Before we review existing SOTA algorithms for cooperative MARL, we introduce two settings in
which the algorithms can be implemented. Both of them can be considered appealing depending
on the application, but these pros also come with “traps” which, if not taken care of, may provably
deteriorate an algorithm’s performance.
3.1 Homogeneity vs. Heterogeneity
The first setting is that of homogeneous policies, i.e., those where agents all agents share one policy:
π
i
= π, ∀N
, so that
π = (π, . . . , π)
[
3
,
30
]. This approach enables a straightforward adoption of
an RL algorithm to MARL, and it does not introduce much computational and sample complexity
burden with the increasing number of agents. Nevertheless, sharing one policy across all agents
requires that their action spaces are also the same, i.e.,
A
i
= A
j
, i, j N
. Furthermore, policy
sharing prevents agents from learning different skills. This scenario may be particularly dangerous in
games with a large number of agents, as shown in Proposition 1, proved by [8].
Proposition 1
(Trap of Homogeneity)
.
Let’s consider a fully-cooperative game with an even num-
ber of agents
n
, one state, and the joint action space
{0, 1}
n
, where the reward is given by
r(0
n/2
, 1
n/2
) = r(1
n/2
, 0
n/2
) = 1
, and
r(a
1:n
) = 0
for all other joint actions. Let
J
be the
optimal joint reward, and
J
share
be the optimal joint reward under the shared policy constraint. Then
J
share
J
=
2
2
n
.
3
A more ambitious approach to MARL is to allow for heterogeneity of policies among agents, i.e.,
to let
π
i
and
π
j
be different functions when
i 6= j N
. This setting has greater applicability as
heterogeneous agents can operate in different action spaces. Furthermore, thanks to this model’s
flexibility they may learn more sophisticated joint behaviours. Lastly, they can recover homogeneous
policies as a result of training, if that is indeed optimal. Nevertheless, training heterogeneous agents
is highly non-trivial. Given a joint reward, an individual agent may not be able to distill its own
contibution to it—a problem known as credit assignment [
6
,
10
]. Furthermore, even if an agent
identifies its improvement direction, it may be conflicting with those of other agents, which then
results in performance damage, as we exemplify in Proposition 2, proved in Appendix A.2.
Proposition 2
(Trap of Heterogeneity)
.
Let’s consider a fully-cooperative game with
2
agents,
one state, and the joint action space
{0, 1}
2
, where the reward is given by
r(0, 0) = 0, r(0, 1) =
r(1, 0) = 2,
and
r(1, 1) = 1
. Suppose that
π
i
old
(0) > 0.6
for
i = 1, 2
. Then, if agents
i
update
their policies by
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
A
π
old
(a
i
, a
i
)
, i N ,
then the resulting policy will yield a lower return,
J(π
old
) > J(π
new
) = min
π
J(π).
Consequently, these facts imply that homogeneous algorithms should not be applied to complex
problems, but they also highlight that heterogeneous algorithms should be developed with extra
cares. In the next subsection, we describe existing SOTA actor-critic algorithms which, while often
performing greatly, are still not impeccable, as they fall into one of the above two traps.
3.2 A Second Look at SOTA MARL Algorithms
Multi-Agent Advantage Actor-Critic
MAA2C [
18
] extends the A2C [
16
] algorithm to MARL
by replacing the RL optimisation (single-agent policy) objective with the MARL one (joint policy),
L
MAA2C
(π) , E
sπ,aπ
A
π
old
(s, a)
, (2)
which computes the gradient with respect to every agent
i
s policy parameters, and performs a
gradient-ascent update for each agent. This algorithm is straightforward to implement and is capable
of solving simple multi-agent problems [
18
]. We point out, however, that by simply following
their own MAPG, the agents perform uncoordinated updates, thus getting caught by Proposition 2.
Furthermore, MAPG estimates have been proved to suffer from large variance which grows linearly
with the number of agents [
10
], thus making the algorithm unstable. To assure greater stability, the
following MARL methods, inspired by stable RL approaches, have been developed.
Multi-Agent Deep Deterministic Policy Gradient
MADDPG [
15
] is a MARL extension of the
popular DDPG algorithm [
12
]. At every iteration, every agent
i
updates its deterministic policy by
maximising the following objective
L
MADDPG
i
(µ
i
) , E
sβ
µ
old
h
Q
i
µ
old
s, µ
i
(s)
i
= E
sβ
µ
old
h
Q
i
µ
old
s, µ
i
, µ
i
old
(s)
i
, (3)
where
β
µ
old
is a state distribution that is not necessarily equivalent to
ρ
µ
old
, thus allowing for off-
policy training. In practice, MADDPG maximises Equation (3) by a few steps of gradient ascent.
The main advantages of MADDPG include small variance of its MAPG estimates—a property
granted by deterministic policies [
24
], as well as low sample complexity due to learning from off-
policy data. Such a combination gives the algorithm a strong performance in continuous-action
tasks [
15
,
8
]. However, this method’s strengths also constitute its limitations—it is applicable to
continuous problems only (discrete problems require categorical-distribution policies), and relies on
large memory capacity to store the off-policy data.
Multi-Agent PPO
MAPPO [
30
] is a relatively straightforward extension of PPO [
23
] to MARL.
In its default formulation, the agents employ the trick of policy sharing described in the previous
subsection. As such, the policy is updated to maximise
L
MAPPO
(π) , E
sρ
π
old
,aπ
old
"
n
X
i=1
min
π(a
i
|s)
π
old
(a
i
|s)
A
π
old
(s, a), clip
π(a
i
|s)
π
old
(a
i
|s)
, 1 ±
A
π
old
(s, a)
#
,
(4)
4
where the
clip(·, 1 ± )
operator clips the input to
1
/
1 +
if it is below/above this value. Such an
operation removes the incentive for agents to make large policy updates, thus stabilising the training
effectively. Indeed, the algorithm’s performance on the StarCraftII benchmark is remarkable, and it
is accomplished by using only on-policy data. Nevertheless, the policy-sharing strategy limits the
algorithm’s applicability and leads to its suboptimality, as we discussed in Proposition 1 and also in
[
8
]. Trying to avoid this issue, one can implement the algorithm without policy sharing, thus making
the agents simply take simultaneous PPO updates meanwhile employing a joint advantage estimator.
In this case, the updates are not coordinated, making MAPPO fall into the trap of Proposition 2.
In summary, all these algorithms do not possess performance guarantees. Altering their implementa-
tion settings to escape one of the traps from Subsection 3.1 makes them, at best, fall into another. This
shows that the MARL problem introduces additional complexity into the single-agent RL setting, and
needs additional care to be rigorously solved. With this motivation, in the next section, we develop a
theoretical framework for development of MARL algorithms with correctness guarantees.
4 Heterogeneous-Agent Mirror Learning
In this section, we introduce heterogeneous-agent mirror learning (HAML)—a template that includes
a continuum of MARL algorithms which we prove to solve cooperative problems with correctness
guarantees. HAML is designed for the general and expressive setting of heterogeneous agents, thus
avoiding Proposition 1, and it is capable of coordinating their updates, leaving Proposition 2 behind.
4.1 Setting up HAML
We begin by introducing the necessary definitions of HAML attributes: the drift functional.
Definition 1.
Let
i N
, a
heterogeneous-agent drift functional
(HADF)
D
i,ν
of
i
consists of a
map, which is defined as
D
i
: Π × Π × P(i) × S {D
i
π
(·|s, ¯π
j
1:m
) : P(A
i
) R},
such that for all arguments, under notation D
i
π
ˆπ
i
|s, ¯π
j
1:m
, D
i
π
ˆπ
i
(·
i
|s)|¯π
j
1:m
, s
,
1. D
i
π
ˆπ
i
|s, ¯π
j
1:m
D
i
π
π
i
|s, ¯π
j
1:m
= 0 (non-negativity),
2. D
i
π
ˆπ
i
|s, ¯π
j
1:m
has all Gâteaux derivatives zero at ˆπ
i
= π
i
(zero gradient),
and a probability distribution
ν
i
π,ˆπ
i
P(S)
over states that can (but does not have to) depend on
π
and ˆπ
i
, and such that the drift D
i,ν
of ˆπ
i
from π
i
with respect to ¯π
j
1:m
, defined as
D
i,ν
π
(ˆπ
i
|¯π
j
1:m
) , E
sν
i
π, ˆπ
i
D
i
π
ˆπ
i
|s, ¯π
j
1:m

,
is continuous with respect to
π, ¯π
j
1:m
, and
ˆπ
i
. We say that the HADF is positive if
D
i,ν
π
(ˆπ
i
|¯π
j
1:m
) = 0
implies ˆπ
i
= π
i
, and trivial if D
i,ν
π
(ˆπ
i
|¯π
j
1:m
) = 0 for all π, ¯π
j
1:m
, and ˆπ
i
, .
Intuitively, the drift
D
i,ν
π
(ˆπ
i
|¯π
j
1:m
)
is a notion of distance between
π
i
and
ˆπ
i
, given that agents
j
1:m
just updated to
¯π
j
1:m
. We highlight that, under this conditionality, the same update (from
π
i
to
ˆπ
i
)
can have different sizes—this will later enable HAML agents to softly constraint their learning steps
in a coordinated way. Before that, we introduce a notion that renders hard constraints, which may be
a part of an algorithm design, or an inherrent limitation.
Definition 2.
Let
i N
. We say that,
U
i
: Π×Π
i
P
i
)
is a neighbourhood operator if
π
i
Π
i
,
U
i
π
(π
i
)
contains a closed ball, i.e., there exists a state-wise monotonically non-decreasing metric
χ : Π
i
×Π
i
R
such that
π
i
Π
i
there exists
δ
i
> 0
such that
χ(π
i
, ¯π
i
) δ
i
= ¯π
i
U
i
π
(π
i
)
.
Throughout this work, for every joint policy
π
, we will associate it with its sampling distribution—a
positive state distribution
β
π
P(S)
that is continuous in
π
[
9
]. With these notions defined, we
introduce the main definition of the paper.
Definition 3.
Let
i N
,
j
1:m
P(i)
, and
D
i,ν
be a HADF of agent
i
. The
heterogeneous-agent
mirror operator (HAMO) integrates the advantage function as
M
(ˆπ
i
)
D
i,ν
,¯π
j
1:m
A
π
(s) , E
a
j
1:m
¯π
j
1:m
,a
i
ˆπ
i
h
A
i
π
(s, a
j
1:m
, a
i
)
i
ν
i
π,ˆπ
i
(s)
β
π
(s)
D
i
π
ˆπ
i
s, ¯π
j
1:m
.
5
We note two important facts. First, when
ν
i
π,ˆπ
i
= β
π
, the fraction from the front of the HADF in
HAMO disappears, making it only a difference between the advantage and the HADF’s evaluation.
Second, when
ˆπ
i
= π
i
, HAMO evaluates to zero. Therefore, as the HADF is non-negative, a policy
ˆπ
i
that improves HAMO must make it positive, and thus leads to the improvement of the multi-agent
advantage of agent
i
. In the next subsection, we study the properties of HAMO in more details, as
well as use it to construct HAML—a general framework for MARL algorithm design.
4.2 Theoretical Properties of HAML
It turns out that, under certain configuration, agents’ local improvements result in the joint improve-
ment of all agents, as described by the lemma below.
Lemma 2
(HAMO Is All You Need)
.
Let
π
old
and
π
new
be joint policies and let
i
1:n
Sym(n)
be
an agent permutation. Suppose that, for every state s S,
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s). (5)
Then, π
new
is jointly better than π
old
, so that for every state s,
V
π
new
(s) V
π
old
(s).
Subsequently, the monotonic improvement property of the joint return follows naturally, as
J(π
new
) = E
sd
V
π
new
(s)
E
sd
V
π
old
(s)
= J(π
old
).
However, the conditions of the lemma require every agent to solve
|S|
instances of Inequality
(5), which may be an intractable problem. We shall design a single optimisation objective whose
solution satisfies those inequalities instead. Furthermore, to have a practical application to large-scale
problems, such an objective should be estimatable via sampling. To handle these challenges, we
introduce the following Algorithm Template 1 which generates a continuum of HAML algorithms.
Algorithm Template 1: Heterogeneous-Agent Mirror Learning
Initialise a joint policy π
0
= (π
1
0
, . . . , π
n
0
);
for k = 0, 1, . . . do
Compute the advantage function A
π
k
(s, a) for all state-(joint)action pairs (s, a);
Draw a permutaion i
1:n
of agents at random \\from a positive distribution p P(Sym(n));
for m = 1 : n do
Make an update π
i
m
k+1
= arg max
π
i
m
∈U
i
m
π
k
(π
i
m
k
)
E
sβ
π
k
h
M
(π
i
m
)
D
i
m
,π
i
1:m1
k+1
A
π
k
(s)
i
;
Output :A limit-point joint policy π
Based on Lemma 2 and the fact that
π
i
U
i
π
(π
i
), i N , π
i
Π
i
, we can know any HAML
algorithm (weakly) improves the joint return at every iteration. In practical settings, such as deep
MARL, the maximisation step of a HAML method can be performed by a few steps of gradient
ascent on a sample average of HAMO (see Definition 1). We also highlight that if the neighbourhood
operators
U
i
can be chosen so that they produce small policy-space subsets, then the resulting updates
will be not only improving but also small. This, again, is a desirable property while optimising
neural-network policies, as it helps stabilise the algorithm. One may wonder why the order of agents
in HAML updates is randomised at every iteration; this condition has been necessary to establish
convergence to NE, which is intuitively comprehendable: fixed-point joint policies of this randomised
procedure assure that none of the agents is incentivised to make an update, namely reaching a NE.
We provide the full list of the most fundamental HAML properties in Theorem 1 which shows that
any method derived from Algorithm Template 1 solves the cooperative MARL problem.
Theorem 1
(The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)
.
Let, for every
agent
i N
,
D
i,ν
be a HADF,
U
i
be a neighbourhood operator, and let the sampling distributions
β
π
depend continuously on π. Let π
0
Π, and the sequence of joint policies (π
k
)
k=0
be obtained
by a HAML algorithm induced by
D
i,ν
, U
i
, i N
, and
β
π
. Then, the joint policies induced by the
algorithm enjoy the following list of properties
6
1. Attain the monotonic improvement property,
J(π
k+1
) J(π
k
),
2. Their value functions converge to a Nash value function V
NE
lim
k→∞
V
π
k
= V
NE
,
3. Their expected returns converge to a Nash return,
lim
k→∞
J(π
k
) = J
NE
,
4. Their ω-limit set consists of Nash equilibria.
See the proof in Appendix A. With the above theorem, we can conclude that HAML provides a
template for generating theoretically sound, stable, monotonically improving algorithms that enable
agents to learn solving multi-agent cooperation tasks.
4.3 Existing HAML Instances: HATRPO and HAPPO
As a sanity check for its practicality, we show that two SOTA MARL methods—HATRPO and
HAPPO [
8
]—are valid instances of HAML, which also provides an explanation for their excellent
empirical performance.
We begin with an intuitive example of HATRPO, where agent
i
m
(the permutation
i
1:n
is drawn from
the uniform distribution) updates its policy so as to maximise (in ¯π
i
m
)
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
¯π
i
m
h
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
i
, subject to D
KL
(π
i
m
old
, ¯π
i
m
) δ.
This optimisation objective can be casted as a HAMO with the trivial HADF
D
i
m
0
, and the
KL-divergence neighbourhood operator
U
i
m
π
(π
i
m
) =
n
¯π
i
m
E
sρ
π
h
KL
π
i
m
(·
i
m
|s), ¯π
i
m
(·
i
m
|s)
i
δ
o
.
The sampling distribution used in HATRPO is
β
π
= ρ
π
. Lastly, as the agents update their policies in
a random loop, the algorithm is an instance of HAML. Hence, it is monotonically improving and
converges to a Nash equilibrium set.
In HAPPO, the update rule of agent i
m
is changes with respect to HATRPO as
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
h
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m
)
i
,
where r(¯π
i
) =
¯π
i
(a
i
|s)
π
i
old
(a
i
|s)
. We show in Appendix D that this optimisation objective is equivalent to
E
sρ
π
old
h
E
a
i
1:m1
π
i
1:m1
new
,a
i
m
¯π
i
m
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
E
a
i
1:m
π
i
1:m
old
ReLU

r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m
)

i
.
The purple term is clearly non-negative due to the presence of the ReLU funciton. Furthermore,
for policies
¯π
i
m
sufficiently close to
π
i
m
old
, the clip operator does not activate, thus rendering
r(¯π
i
m
)
unchanged. Therefore, the purple term is zero at and in a region around
¯π
i
m
= π
i
m
old
, which also
implies that its Gâteaux derivatives are zero. Hence, it evaluates a HADF for agent
i
m
, thus making
HAPPO a valid HAML instance.
Finally, we would like to highlight that these results about HATRPO and HAPPO significantly
strengthen the work originally by [
8
] who only derived these learning protocols as approximations to
a theoretically sound algorithm (see Algorithm 1 in [8]), yet without such level of insights.
7
Figure 1: Comparison between HAA2C (yellow) vs MAA2C-S (blue) and MAA2C-NS (pink) in SMAC.
4.4 New HAML Instances: HAA2C and HADDPG
In this subsection, we exemplify how HAML can be used for derivation of principled MARL
algorithms, and introduce heterogeneous-agent extensions of A2C and DDPG, different from those in
Subsection 3.2. Our goal is not to refresh new SOTA performance on challenging tasks, but rather to
verify the correctness of our theory, as well as deliver more robust versions of multi-agent extensions
of popular RL algorithms such as A2C and DDPG.
HAA2C
intends to optimise the policy for the joint advantage function at every iteration, and similar
to A2C, does not impose any penalties or constraints on that procedure. This learning procedure is
accomplished by, first, drawing a random permutation of agents
i
1:n
, and then performing a few steps
of gradient ascent on the objective of
E
sρ
π
old
,a
i
1:m
π
i
1:m
old
h
π
i
1:m1
new
(a
i
1:m1
|s)π
i
m
(a
i
m
|s)
π
i
1:m1
old
(a
i
1:m1
|s)π
i
m
old
(a
i
m
|s)
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
i
, (6)
with respect to
π
i
m
parameters, for each agent
i
m
in the permutation, sequentially. In practice,
we replace the multi-agent advantage
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
with the joint advantage estimate which,
thanks to the joint importance sampling in Equation (6), poses the same objective on the agent (see
Appendix E for full pseudocode).
HADDPG
aims to maximise the state-action value function off-policy. As it is a deterministic-
action method, and thus importance sampling in its case translates to replacement of the old action
inputs to the critic with the new ones. Namely, agent i
m
in a random permutation i
1:n
maximises
E
sβ
µ
old
h
Q
i
1:m
µ
old
s, µ
i
1:m1
new
(s), µ
i
m
(s)
i
, (7)
with respect to
µ
i
m
, also with a few steps of gradient ascent. Similar to HAA2C, optimising the
state-action value function (with the old action replacement) is equivalent to the original multi-agent
value (see Appendix E for full pseudocode).
We are fully aware that none of these two methods exploits the entire abundance of the HAML
framework—they do not possess HADFs or neighbourhood operators, as oppose to HATRPO or
HAPPO. Thus, we speculate that the range of opportunities for HAML algorithm design is yet to be
discovered with more future work. Nevertheless, we begin this search from these two straightforward
methods, and analyse their performance in the next section.
5 Experiments and Results
In this section
(1)
, we challenge the capabilities of HAML in practice by testing HAA2C and HADDPG
on the most challenging MARL benchmarks—we use StarCraft Multi-Agent Challenge [
20
] and
Multi-Agent MuJoCo [19], for discrete (only HAA2C) and continuous action settings, respectively.
(1)
Our code is available at https://github.com/anonymouswater/HAML
8
Figure 2: Comparison between HAA2C (yellow) vs MAA2C-S (blue) and MAA2C-NS (pink) in MAMuJoCo.
Figure 3: Comparison between HADDPG (yellow) and MADDPG (pink) in MAMuJoCo.
We begin by demonstrating the performance of HAA2C. As a baseline, we use its “naive” predecessor,
MAA2C, in both policy-sharing (MAA2C-S) and heterogeneous (MAA2C-NS) versions. Results on
Figures 1 & 2 show that HAA2C generally achieves higher rewards in SMAC than both versions of
MAA2C, while maintaining lower variance. The performance gap increases in MAMuJoCo, where
the agents must learn diverse policies to master complex movements [
8
]. Here, echoing Proposition
1, the homogeneous MAA2C-S fails completely, and conventional, heterogeneous MAA2C-NS
underperforms by a significant margin (recall Proposition 2).
As HADDPG is a continuous-action method, we test it only on MAMuJoCo (precisely 6 tasks) and
compare it to MADDPG. Figure 3 revelas that HADDPG achieves higher reward than MADDPG,
while sometimes displaying significantly lower variance (e.g in Reacher-2x1 and Swimmer-2x1).
Hence, we conclude that HADDPG performs better.
6 Conclusion
In this paper, we described and addressed the problem of lacking principled treatments for cooperative
MARL tasks. Our main contribution is the development of heterogeneous-agent mirror learning
(HAML), a class of provably correct MARL algorithms, whose properties are rigorously profiled.
We verified the correctness and the practicality of HAML by interpreting current SOTA methods—
HATRPO and HAPPO—as HAML instances and also by deriving and testing heterogeneous-agent
extensions of successful RL algorithms, named as HAA2C and HADDPG. We expect HAML to help
create a template for designing both principled and practical MARL algorithms hereafter.
9
References
[1]
Lawrence M Ausubel and Raymond J Deneckere. A generalized theorem of the maximum.
Economic Theory, 3(1):99–107, 1993.
[2]
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
[3]
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS
Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft
multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.
[4]
Christian Schröder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip H. S. Torr, Wendelin
Böhmer, and Shimon Whiteson. Deep multi-agent reinforcement learning for decentralized
continuous cooperative control. CoRR, abs/2003.06709, 2020.
[5]
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry
Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case
study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
[6]
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.
Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018.
[7]
Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices
in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.
[8]
Jakub Grudzien Kuba, Ruiqing Chen, Munning Wen, Ying Wen, Fanglei Sun, Jun Wang, and
Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. ICLR,
2022.
[9]
Jakub Grudzien Kuba, Christian Schroeder de Witt, and Jakob Foerster. Mirror learning: A
unifying framework of policy optimisation. ICML, 2022.
[10]
Jakub Grudzien Kuba, Muning Wen, Linghui Meng, Haifeng Zhang, David Mguni, Jun Wang,
Yaodong Yang, et al. Settling the variance of multi-agent policy gradients. Advances in Neural
Information Processing Systems, 34:13458–13470, 2021.
[11]
Hepeng Li and Haibo He. Multi-agent trust region policy optimization. arXiv preprint
arXiv:2010.07916, 2020.
[12]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
[13] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In
Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
[14]
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-
critic for mixed cooperative-competitive environments. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, pages 6382–6393, 2017.
[15]
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-
critic for mixed cooperative-competitive environments. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, pages 6382–6393, 2017.
[16]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,
Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-
ment learning. In International conference on machine learning, pages 1928–1937. PMLR,
2016.
[17] John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
10
[18]
Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Benchmark-
ing multi-agent deep reinforcement learning algorithms in cooperative tasks. In Proceedings
of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS),
2021.
[19]
Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, Pierre-Alexandre Kamienny, Philip HS
Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised
policy gradients. arXiv e-prints, pages arXiv–2003, 2020.
[20]
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nan-
tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon
Whiteson. The starcraft multi-agent challenge. 2019.
[21]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In International conference on machine learning, pages 1889–1897.
PMLR, 2015.
[22]
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv preprint
arXiv:1506.02438, 2015.
[23]
John Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. ArXiv, abs/1707.06347, 2017.
[24]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.
Deterministic policy gradient algorithms. In International conference on machine learning,
pages 387–395. PMLR, 2014.
[25]
R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement
learning with function approximation. In Advances in Neural Information Processing Systems
12, volume 12, pages 1057–1063. MIT Press, 2000.
[26] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.
[27]
Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. In Uncertainty
in Artificial Intelligence, pages 113–122. PMLR, 2020.
[28]
Jiayi Weng, Huayu Chen, Dong Yan, Kaichao You, Alexis Duburcq, Minghao Zhang, Hang
Su, and Jun Zhu. Tianshou: a highly modularized deep reinforcement learning library. arXiv
preprint arXiv:2107.14171, 2021.
[29]
Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game
theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.
[30]
Chao Yu, A. Velu, Eugene Vinitsky, Yu Wang, A. Bayen, and Yi Wu. The surprising effectiveness
of mappo in cooperative, multi-agent games. ArXiv, abs/2103.01955, 2021.
[31]
Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized
multi-agent reinforcement learning with networked agents. In International Conference on
Machine Learning, pages 5872–5881. PMLR, 2018.
11
A Proofs of Preliminary Results
A.1 Proof of Lemma 1
Lemma 1
(Multi-Agent Advantage Decomposition)
.
Let
π
be a joint policy, and
i
1
, . . . , i
m
be an
arbitrary ordered subset of agents. Then, for any state s and joint action a
i
1:m
,
A
i
1:m
π
s, a
i
1:m
=
m
X
j=1
A
i
j
π
s, a
i
1:j1
, a
i
j
. (1)
Proof.
(We quote the proof from [
8
].) We start as expressing the multi-agent advantage as a
telescoping sum, and then rewrite it using the definition of multi-agent advantage,
A
i
1:m
π
(s, a
i
1:m
) = Q
i
1:m
π
(s, a
i
1:m
) V
π
(s)
=
m
X
j=1
Q
i
1:j
π
(s, a
i
1:j
) Q
i
1:j1
π
(s, a
i
1:j1
)
=
m
X
j=1
A
i
j
π
(s, a
i
1:j1
, a
i
j
).
A.2 Proof of Proposition 2
Proposition 2
(Trap of Heterogeneity)
.
Let’s consider a fully-cooperative game with
2
agents,
one state, and the joint action space
{0, 1}
2
, where the reward is given by
r(0, 0) = 0, r(0, 1) =
r(1, 0) = 2,
and
r(1, 1) = 1
. Suppose that
π
i
old
(0) > 0.6
for
i = 1, 2
. Then, if agents
i
update
their policies by
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
A
π
old
(a
i
, a
i
)
, i N ,
then the resulting policy will yield a lower return,
J(π
old
) > J(π
new
) = min
π
J(π).
As there is only one state, we can ignore the infinite horizon and the discount factor
γ
, thus making
the state-action value and the reward functions equivalent, Q r.
Let us, for brevity, define π
i
= π
i
old
(0) > 0.6, for i = 1, 2. We have
J(π
new
) = Pr(a
1
= a
2
= 0)r(0, 0) +
1 Pr(a
1
= a
2
= 0)
E[r(a
1
, a
2
)|(a
1
, a
2
) 6= (0, 0)]
> 0.6
2
× 0 (1 0.6
2
) = 0.64.
The update rule stated in the proposition can be equivalently written as
π
i
new
= arg max
π
i
E
a
i
π
i
,a
i
π
i
old
Q
π
old
(a
i
, a
i
)
. (8)
We have
E
a
i
π
i
old
Q
π
old
(0, a
i
)
= π
i
Q(0, 0) + (1 π
i
)Q(0, 1) = π
i
r(0, 0) + (1 π
i
)r(0, 1) = 2(1 π
i
),
and similarly
E
a
i
π
i
old
Q
π
old
(1, a
i
)
= π
i
r(1, 0) + (1 π
i
)r(1, 1) = 2π
i
(1 π
i
) = 3π
i
1.
Hence, if π
i
> 0.6, then
E
a
i
π
i
old
Q
π
old
(1, a
i
)
= 3π
i
1 > 3 × 0.6 1 = 0.8 > 2 2π
i
= E
a
i
π
i
old
Q
π
old
(0, a
i
)
.
Therefore, for every i, the solution to Equation (8) is the greedy policy π
i
new
(1) = 1. Therefore,
J(π
new
) = Q(1, 1) = r(1, 1) = 1,
which finishes the proof.
12
B Proof of HAMO Is All You Need Lemma
Lemma 2
(HAMO Is All You Need)
.
Let
π
old
and
π
new
be joint policies and let
i
1:n
Sym(n)
be
an agent permutation. Suppose that, for every state s S,
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s). (5)
Then, π
new
is jointly better than π
old
, so that for every state s,
V
π
new
(s) V
π
old
(s).
Proof.
Let
e
D
π
old
(π
new
|s) ,
P
n
m=1
ν
i
m
π
old
,ˆπ
i
m
(s)
β
π
old
(s)
D
i
m
π
old
(π
i
m
new
|s, π
i
1:m1
new
)
. Combining this with Lemma
1 gives
E
aπ
new
A
π
old
(s, a)
e
D
π
old
(π
new
|s)
=
n
X
m=1
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
ν
i
m
π
old
i
m
new
(s)
β
π
old
(s)
D
i
m
π
old
(π
i
m
new
|s, π
i
1:m1
new
)
by Inequality (5)
n
X
m=1
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
ν
i
m
π
old
i
m
old
(s)
β
π
old
(s)
D
i
m
π
old
(π
i
m
old
|s, π
i
1:m1
new
)
= E
aπ
old
A
π
old
(s, a)
e
D
π
old
(π
old
|s).
The resulting inequality can be equivalently rewritten as
E
aπ
new
Q
π
old
(s, a)
e
D
π
old
(π
new
|s) E
aπ
old
Q
π
old
(s, a)
e
D
π
old
(π
old
|s), s S. (9)
We use it to prove the claim as follows,
V
π
new
(s) = E
aπ
new
Q
π
new
(s, a)
= E
aπ
new
Q
π
old
(s, a)
e
D
π
old
(π
new
|s)
+
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
,
by Inequality (9)
E
aπ
old
Q
π
old
(s, a)
e
D
π
old
(π
old
|s)
+
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
,
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + E
aπ
new
Q
π
new
(s, a) Q
π
old
(s, a)
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + E
aπ
new
,s
0
P
r(s, a) + γV
π
new
(s
0
) r(s, a) γV
π
old
(s
0
)
= V
π
old
(s) +
e
D
π
old
(π
new
|s) + γE
aπ
new
,s
0
P
V
π
new
(s
0
) V
π
old
(s
0
)
V
π
old
(s) + γ inf
s
0
V
π
new
(s
0
) V
π
old
(s
0
)
.
Hence V
π
new
(s) V
π
old
(s) γ inf
s
0
V
π
new
(s
0
) V
π
old
(s
0
)
.
Taking infimum over s and simplifying
(1 γ) inf
s
V
π
new
(s) V
π
old
(s)
0.
Therefore, inf
s
V
π
new
(s) V
π
old
(s)
0, which proves the lemma.
13
C Proof of Theorem 1
Lemma 3. Suppose an agent i
m
maximises the expected HAMO
π
i
m
new
= arg max
π
i
m
∈U
i
m
π
old
(π
i
m
old
)
E
sβ
π
old
h
M
(π
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
. (10)
Then, for every state s S
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s).
Proof. We will prove this statement by contradiction. Suppose that there exists s
0
S such that
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
) <
M
(π
i
m
old
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
). (11)
Let us define the following policy ˆπ
i
m
.
ˆπ
i
m
(·
i
m
|s) =
π
i
m
old
(·
i
m
|s), at s = s
0
π
i
m
new
(·
i
m
|s), at s 6= s
0
Note that
ˆπ
i
m
is (weakly) closer to
π
i
m
old
than
π
i
m
new
at
s
0
, and at the same distance at other states.
Together with π
i
m
new
U
i
m
π
old
(π
i
m
old
), this implies that ˆπ
i
m
U
i
m
π
old
(π
i
m
old
). Further,
E
sβ
π
old
h
M
(ˆπ
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
E
sβ
π
old
h
M
(π
i
m
new
)
D
i
m
,π
i
1:m1
new
A
π
old
(s)
i
β
π
old
(s
0
)

M
(ˆπ
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
)
M
(ˆπ
i
m
)
D
i
m
,π
i
1:m1
new
A
π
old
(s
0
)
> 0.
The above contradicts
π
i
m
new
as being the argmax of Inequality (11), as
ˆπ
i
m
is strictly better. The
contradiciton finishes the proof.
Theorem 1
(The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)
.
Let, for every
agent
i N
,
D
i,ν
be a HADF,
U
i
be a neighbourhood operator, and let the sampling distributions
β
π
depend continuously on π. Let π
0
Π, and the sequence of joint policies (π
k
)
k=0
be obtained
by a HAML algorithm induced by
D
i,ν
, U
i
, i N
, and
β
π
. Then, the joint policies induced by the
algorithm enjoy the following list of properties
1. Attain the monotonic improvement property,
J(π
k+1
) J(π
k
),
2. Their value functions converge to a Nash value function V
NE
lim
k→∞
V
π
k
= V
NE
,
3. Their expected returns converge to a Nash return,
lim
k→∞
J(π
k
) = J
NE
,
4. Their ω-limit set consists of Nash equilibria.
Proof. Proof of Property 1.
It follows from combining Lemmas 2 & 3.
Proof of Properties 2, 3 & 4.
Step 1: convergence of the value function.
By Lemma 2, we have that
V
π
k
(s) V
π
k+1
(s), s S
,
and that the value function is upper-bounded by
V
max
. Hence, the sequence of value functions
(V
π
k
)
kN
converges. We denote its limit by V .
Step 2: characterisation of limit points.
As the joint policy space
Π
is bounded, by Bolzano-
Weierstrass theorem, we know that the sequence
(π
k
)
kN
has a convergent subsequence. Therefore,
14
it has at least one limit point policy. Let
¯π
be such a limit point. We introduce an auxiliary notation:
for a joint policy
π
and a permutation
i
1:n
, let
HU(π, i
1:n
)
be a joint policy obtained by a HAML
update from π along the permutation i
1:n
.
Claim: For any permutation z
1:n
Sym(n),
¯π = HU(¯π, z
1:n
). (12)
Proof of Claim
. Let
ˆ
π = HU(¯π, z
1:n
) 6= ¯π
and
(π
k
r
)
rN
be a subsequence converging to
¯π
. Let us
recall that the limit value function is unique and denoted as
V
. Writing
E
i
0:
1:n
[·]
for the expectation
operator under the stochastic process (i
k
1:n
)
kN
of update orders, for a state s S, we have
0 = lim
r→∞
E
i
0:
1:n
V
π
k
r
+1
(s) V
π
k
r
(s)
as every choice of permutation improves the value function
lim
r→∞
P(i
k
r
1:n
= z
1:n
)
V
HU(π
k
r
,z
1:n
)
(s) V
π
k
r
(s)
= p(z
1:n
) lim
r→∞
V
HU(π
k
r
,z
1:n
)
(s) V
π
k
r
(s)
.
By the continuity of the expected HAMO (following from the continuity of the value function [
8
,
Appendix A], HADFs, neighbourhood operators, and the sampling distribution) we obtain that the first
component of
HU(π
k
r
, z
1:n
)
, which is
π
z
1
k
r
+1
, is continuous in
π
k
r
by Berge’s Maximum Theorem
[
1
]. Applying this argument recursively for
z
2
, . . . , z
n
, we have that
HU(π
k
r
, z
1:n
)
is continuous
in
π
k
r
. Hence, as
π
k
r
converges to
¯π
, its HU converges to the HU of
¯π
, which is
ˆ
π
. Hence, we
continue wiriting the above derivation as
= p(z
1:n
)
V
ˆ
π
(s) V
¯π
(s)
0, by Lemma 2.
As
s
was arbitrary, the state-value function of
ˆ
π
is the same as that of
π
:
V
ˆ
π
= V
π
, by the Bellman
equation [
26
]:
Q(s, a) = r(s, a) + γEV (s
0
)
, this also implies that their state-value and advantage
functions are the same:
Q
ˆ
π
= Q
¯π
and
A
ˆ
π
= A
¯π
. Let
m
be the smallest integer such that
ˆπ
z
m
6= ¯π
z
m
.
This means that ˆπ
z
m
achieves a greater expected HAMO than ¯π
z
m
, for which it is zero. Hence,
0 < E
sβ
π
h
M
(ˆπ
z
m
)
D
z,ν
,¯π
z
1:m1
A
¯π
(s)
i
= E
sβ
π
h
E
a
z
1:m
¯π
z
1:m1
,a
z
m
ˆπ
z
m
A
z
m
¯π
(s, a
z
1:m1
, a
z
m
)
ν
z
m
¯π,ˆπ
z
m
(s)
β
¯π
(s)
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
= E
sβ
π
h
E
a
z
1:m
¯π
z
1:m1
,a
z
m
ˆπ
z
m
A
z
m
ˆ
π
(s, a
z
1:m1
, a
z
m
)
ν
z
m
¯π,ˆπ
z
m
(s)
β
¯π
(s)
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
and as the expected value of the multi-agent advantage function is zero
= E
sβ
π
h
ν
z
m
¯π,ˆπ
z
m
(s)
β
¯π
(s)
D
z
m
π
(ˆπ
z
m
|s, ¯π
z
1:m1
)
i
0.
This is a contradiction, and so the claim in Equation (12) is proved, and the Step 2 is finished.
Step 3: dropping the HADF.
Consider an arbitrary limit point joint policy
¯π
. By Step 2, for any
permutation i
1:n
, considering the first component of the HU, and writing ν
i
= ν
i
¯π
i
,
¯π
i
1
= max
π
i
1
∈U
i
1
π
(π
i
1
)
E
sβ
¯π
h
M
(π
i
1
)
D
i
1
A
¯π
(s)
i
(13)
= max
π
i
1
∈U
i
1
π
(π
i
1
)
E
sβ
¯π
h
E
a
i
1
π
i
1
A
¯π
(s, a
i
1
)
ν
i
1
(s)
β
¯π
(s)
D
i
1
¯π
(π
i
1
|s)
i
.
As the HADF is non-negative, and at
π
i
1
= ¯π
i
1
its value and of its all Gâteaux derivatives are zero, it
follows by Step 3 of Theorem 1 of [9] that for every s S,
¯π
i
1
(·
i
m
|s) = arg max
π
i
1
∈P(A
i
1
)
E
a
i
1
π
i
1
Q
i
1
¯π
(s, a
i
1
)
.
15
Step 4: Nash equilibrium. We have proved that ¯π satisfies
¯π
i
(·
i
|s) = arg max
π
i
(·
i
|s)∈P(A
i
)
E
a
i
π
i
Q
i
¯π
(s, a
i
)
= arg max
π
i
(·
i
|s)∈P(A
i
)
E
a
i
π
i
,a
i
¯π
i
Q
¯π
(s, a)
, i N , s S.
Hence, by considering
¯π
i
fixed, we see that
¯π
i
satisfies the condition for the optimal policy [
26
],
and hence
¯π
i
= arg max
π
i
Π
i
J(π
i
, ¯π
i
).
Thus,
¯π
is a Nash equilibrium. Lastly, this implies that the value function corresponds to a Nash
value function V
NE
, the return corresponds to a Nash return J
NE
.
16
D Casting HAPPO as HAML
The maximisation objective of agent i
m
in HAPPO is
E
sρ
π
old
,a
i
1:m1
π
i
1:m1
new
,a
i
m
π
i
m
old
h
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m
)
i
.
Fixing s and a
i
1:m1
, we can rewrite it as
E
a
i
m
¯π
i
m
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
E
a
i
m
π
i
m
old
h
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
min
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
.
By the multi-agent advantage decomposition,
E
a
i
m
¯π
i
m
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
= A
i
1:m1
π
old
(s, a
i
1:m1
) + E
a
i
m
¯π
i
m
A
i
m
π
old
(s, a
i
1:m1
, a
i
m
)
.
Hence, the presence of the joint advantage of agents i
1:m
is equivalent to the multi-agent advantage
of
i
m
given
a
i
1:m1
that appears in HAMO. Hence, we only need to show that that the subtracted
term is an HADF. Firstly, we change min into max with the identity min f(x) = max[f (x)].
E
a
i
m
π
i
m
old
h
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
+ max
r(¯π
i
m
)A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
), clip
r(¯π
i
m
), 1 ±
A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
which we then simplify
E
a
i
m
π
i
m
old
h
max
0,
r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
= E
a
i
m
π
i
m
old
h
ReLU
r(¯π
i
m
) clip
r(¯π
i
m
), 1 ±

A
i
1:m
π
old
(s, a
i
1:m1
, a
i
m
)
i
.
As discussed in the main body of the paper, this is an HADF.
17
E Algorithms
Algorithm 2: HAA2C
Input: stepsize α, batch size B, number of: agents n, episodes K, steps per episode T ,
mini-epochs e;
Initialize: the critic network: φ, the policy networks: {θ
i
}
i∈N
, replay buffer B;
for k = 0, 1, . . . , K 1 do
Collect a set of trajectories by letting the agents act according their policies,
a
i
π
i
θ
i
(·
i
|o
i
)
;
Push transitions {(o
i
t
, a
i
t
, o
i
t+1
, r
t
), i N , t T } into B;
Sample a random minibatch of B transitions from B;
Estimate the returns R and the advantage function,
ˆ
A(s, a), using
ˆ
V
φ
and GAE;
Draw a permutation of agents i
1:n
at random;
Set M
i
1
(s, a) =
ˆ
A(s, a);
for agent i
m
= i
1
, . . . , i
n
do
Set π
i
m
0
(a
i
m
|o
i
m
) = π
i
m
θ
i
m
(a
i
m
|o
i
m
);
for mini-epoch= 1, . . . , e do
Compute agent i
m
s policy gradient
g
i
m
=
θ
i
1
B
B
P
b=1
M
i
m
(s
b
, a
b
)
π
i
m
θ
i
m
(a
i
m
b
|o
i
m
b
)
π
i
m
0
(a
i
m
b
|o
i
m
b
)
.
Update agent i
m
s policy by
θ
i
m
= θ
i
m
+ αg
i
m
.
Compute M
i
m+1
(s, a) =
π
i
m
θ
i
m
(a
i
m
|o
i
m
)
π
i
m
0
(a
i
m
|o
i
m
)
M
i
m
(s, a) //unless m = n;
Update the critic by gradient descent on
1
B
P
s
ˆ
V
φ
(s
b
) R
b
2
.
Discard φ. Deploy {θ
i
}
i∈N
in execution;
18
Algorithm 3: HADDPG
Input:
stepsize
α
, Polyak coefficient
τ
, batch size
B
, number of: agents
n
, episodes
K
, steps per
episode T , mini-epochs e;
Initialize:
the critic networks:
φ
and
φ
0
and policy networks:
{θ
i
}
i∈N
, replay buffer
B
, random
processes {X
i
}
i∈N
for exploration;
for k = 0, 1, . . . , K 1 do
Collect a set of transitions by letting the agents act according to their deterministic policies
with the exploratory noise
a
i
= µ
i
θ
i
(o
i
) + X
i
t
.
Push transitions {(o
i
t
, a
i
t
, o
i
t+1
, r
t
), i N , t T } into B;
Sample a random minibatch of B transitions from B;
Compute the critic targets
y
t
= r
t
+ γQ
φ
0
(s
t+1
, a
t+1
).
Update the critic by minimising the loss
φ = arg min
φ
1
B
P
t
y
t
Q
φ
(s
t
, a
t
)
2
.
Draw a permutation of agents i
1:n
at random;
for agent i
m
= i
1
, . . . , i
n
do
Update agent i
m
by solving
θ
i
m
= arg max
ˆ
θ
i
m
1
B
P
t
Q
φ
s
t
, µ
i
1:m1
θ
i
1:m1
(o
i
1:m1
t
), µ
i
m
ˆ
θ
i
m
(o
i
m
t
), a
i
m+1:n
t
.
with e mini-epochs of deterministic policy gradient ascent;
Update the target critic network smoothly
φ
0
= τφ + (1 τ )φ
0
.
Discard φ. Deploy {θ
i
}
i∈N
in execution;
19
F Experiments
F.1 Compute resources
For compute resources, We used one internal compute servers which consists consisting of 6x RTX
3090 cards and 112 CPUs, however each model is trained on at most 1 card.
F.2 Hyperparameters
We implement the MAA2C and HAA2C based on HAPPO/HATRPO [
8
]. We offer the hyperparameter
we use for SMAC in table 1 and for Mujoco in table 2.
Table 1: Common hyperparameters used in the SMAC domain.
hyperparameters value hyperparameters value hyperparameters value
critic lr 5e-4 optimizer Adam stacked-frames 1
gamma 0.99 optim eps 1e 5 batch size 3200
gain 0.01 hidden layer 1 training threads 64
actor network mlp num mini-batch 1 rollout threads 8
hypernet embed 64 max grad norm 10 episode length 400
activation ReLU hidden layer dim 64 use huber loss True
Table 2: Common hyperparameters used for MAA2C-NS, MAA2C-S and HAA2C in the Multi-Agent MuJoCo.
hyperparameters value hyperparameters value hyperparameters value
critic lr 1e 3 optimizer Adam num mini-batch 1
gamma 0.99 optim eps 1e 5 batch size 4000
gain 0.01 hidden layer 1 training threads 8
std y coef 0.5 actor network mlp rollout threads 4
std x coef 1 max grad norm 10 episode length 1000
activation ReLU hidden layer dim 64 eval episode 32
In addition to those common hyperparameters, we set the mini-epoch for HAA2C as 5. For actor
learning rate, we set it as 2e-4 for HalfCheetah and Ant while 1e-4 for Walker2d.
We implement the MADDPG and HADDPG based on the Tianshou framework [
28
]. We offer the
hyperparameter we use in table 3 and 4.
Table 3: Hyper-parameter used for MADDPG/HADDPG in the Multi-Agent MuJoCo domain
hyperparameters value hyperparameters value hyperparameters value
actor lr 3e 4 optimizer Adam replay buffer size 1e6
critic lr 1e 3 exploration noise 0.1 batch size 1000
gamma 0.99 step-per-epoch 50000 training num 20
tau 0.1 step-per-collector 2000 test num 10
start-timesteps 25000 update-per-step 0.025 epoch 200
hidden-sizes [64, 64] episode length 1000
Table 4: Parameter n-step used for MADDPG/HADDPG in the Multi-Agent MuJoCo
task value task value task value
Reacher (2 × 1) 5 Hopper (3 × 1) 20 Walker (3 × 2) 5
Ant (4 × 2) 20 Swimmer (2 × 1) 5 Humanoid (9|8) 5
20