Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to

Heterogeneous-Agent Mirror Learning:

A Continuum of Solutions to Cooperative MARL

Jakub Grudzien Kuba

∗,1

, Xidong Feng

∗,2

Shiyao Ding

, Hao Dong

, Jun Wang

, Yaodong Yang

†,4

University of Oxford,

University College London,

Kyoto University,

Peking University

Abstract

The necessity for cooperation among intelligent machines has popularised coop-

erative multi-agent reinforcement learning (MARL) in the artiﬁcial intelligence

(AI) research community. However, many research endeavours have been focused

on developing practical MARL algorithms whose effectiveness has been studied

only empirically, thereby lacking theoretical guarantees. As recent studies have

revealed, MARL methods often achieve performance that is unstable in terms of

reward monotonicity or suboptimal at convergence. To resolve these issues, in

this paper, we introduce a novel framework named Heterogeneous-Agent Mirror

Learning (HAML) that provides a general template for MARL algorithmic designs.

We prove that algorithms derived from the HAML template satisfy the desired

properties of the monotonic improvement of the joint reward and the convergence

to Nash equilibrium. We verify the practicality of HAML by proving that the

current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO,

are in fact HAML instances. Next, as a natural outcome of our theory, we pro-

pose HAML extensions of two well-known RL algorithms, HAA2C (for A2C)

and HADDPG (for DDPG), and demonstrate their effectiveness against strong

baselines on StarCraftII and Multi-Agent MuJoCo tasks.

1 Introduction

While the policy gradient (PG) formula has been long known in the reinforcement learning (RL)

community [

], it has not been until trust region learning [

] that deep RL algorithms started

to solve complex tasks such as real-world robotic control successfully. Nowadays, methods that

followed the trust-region framework, including TRPO [

], PPO [

] and their extensions [

became effective tools for solving challenging AI problems [

]. It was believed that the key to their

success are the rigorously described stability and the monotonic improvement property of trust-region

learning that they approximate. This reasoning, however, would have been of limited scope since it

failed to explain why some algorithms following it (e.g. PPO-KL) largely underperform in contrast to

success of other ones (e.g. PPO-clip) [

]. Furthermore, the trust-region interpretation of PPO has

been formally rejected by recent studies both empirically [

] and theoretically [

]; this revealed that

the algorithm violates the trust-region constraints—it neither constraints the KL-divergence between

two consecutive policies, nor does it bound their likelihood ratios. These ﬁndings have suggested

that, while the number of available RL algorithms grows, our understanding of them does not, and

the algorithms often come without theoretical guarantees either.

Only recently, Kuba et al. [

] showed that the well-known algorithms, such as PPO, are in fact

instances of the so-called mirror learning framework, within which any induced algorithm is theoreti-

cally sound. On a high level, methods that fall into this class optimise the mirror objective, which

shapes an advantage surrogate by means of a drift functional—a quasi-distance between policies.

∗

Equal contribution.

†

Corresponding author <[email protected]>.

Preprint. Under review.

arXiv:2208.01682v1 [cs.MA] 2 Aug 2022

Such an update provably leads them to monotonic improvements of the return, as well as the conver-

gence to the optimal policy. The result of mirror learning offers RL researchers strong conﬁdence

that there exists a connection between an algorithm’s practicality and its theoretical properties and

assures soundness of the common RL practice.

While the problem of the lack of theoretical guarantees has been severe in RL, in multi-agent

reinforcement learning (MARL) it has only been exacerbated. Although the PG theorem has been

successfully extended to the multi-agent PG (MAPG) version [

], it has only recently been shown

that the variance of MAPG estimators grows linearly with the number of agents [

]. Prior to this,

however, a novel paradigm of centralised training for decentralised execution (CTDE) [

] greatly

alleviated the difﬁculty of the multi-agent learning by assuming that the global state and opponents’

actions and policies are accessible during the training phase; this enabled developments of practical

MARL methods by merely extending single-agent algorithms’ implementations to the multi-agent

setting. As a result, direct extensions of TRPO [

] and PPO [

] have been proposed whose

performance, although is impressive in some settings, varies according to the version used and

environment tested against. However, these extensions do not assure the monotonic improvement

property or convergence result of any kind [

]. Importantly, these methods can be proved to be

suboptimal at convergence in the common setting of parameter sharing [

] which is considered as

default by popular multi-agent algorithms [

] and popular multi-agent benchmarks such as SMAC

[20] due to the computational convenience it provides.

In this paper, we resolve these issues by proposing Heterogeneous-Agent Mirror Learning (HAML)—

a template that can induce a continuum of cooperative MARL algorithms with theoretical guarantees

for monotonic improvement as well as Nash equilibrium (NE) convergence. The purpose of HAML

is to endow MARL researchers with a template for rigorous algorithmic design so that having

been granted a method’s correctness upfront, they can focus on other aspects, such as effective

implementation through deep neural networks. We demonstrate the expressive power of the HAML

framework by showing that two of existing state-of-the-art (SOTA) MARL algorithms, HATRPO

and HAPPO [

], are rigorous instances of HAML. This stands in contrast to viewing them as merely

approximations to provably correct multi-agent trust-region algorithms as which they were originally

considered [

]. Furthermore, although HAML is mainly a theoretical contribution, we can naturally

demonstrate its usefulness by using it to derive two heterogeneous-agent extensions of successful RL

algorithms: HAA2C (for A2C [

]) and HADDPG (for DDPG [

]), whose strength are demonstrated

on benchmarks of StarCraftII (SMAC) [

] and Multi-Agent MuJoCo [

] against strong baselines

such as MADDPG [14] and MAA2C [18].

2 Problem Formulations

We formulate the cooperative MARL problem as cooperative Markov game [

] deﬁned by a tuple

hN , S, A, r, P, γ, di

. Here,

N = {1, . . . , n}

is a set of

agents,

is the state space,

A = ×

i=1

is the products of all agents’ action spaces, known as the joint action space. Although our results

hold for general compact state and action spaces, in this paper we assume that they are ﬁnite, for

simplicity. Further,

r : S × A → R

is the joint reward function,

P : S × A × S → [0, 1]

is the

transition probability kernel,

γ ∈ [0, 1)

is the discount factor, and

d ∈ P(S)

(where

P(X)

denotes

the set of probability distributions over a set

) is the positive initial state distribution. At time

step

t ∈ N

, the agents are at state

(which may not be fully observable); they take independent

actions

, ∀i ∈ N

drawn from their policies

(·

) ∈ P(A

)

, and equivalently, they take a joint

action

= (a

, . . . , a

)

drawn from their joint policy

π(·|s

) =

i=1

(·

) ∈ P(A)

. We

write

, {×

s∈S

(·

|s) |∀s ∈ S, π

(·

|s) ∈ P(A

)}

to denote the policy space of agent

, and

Π , (Π

, . . . , Π

)

to denote the joint policy space. It is important to note that when

(·

|s)

is a

Dirac delta ditribution, the policy is referred to as deterministic [

] and we write

(s)

to refer

to its centre. Then, the environment emits the joint reward

r(s

, a

)

and moves to the next state

t+1

∼ P (·|s

, a

) ∈ P(S)

. The initial state distribution

, the joint policy

, and the transition

kernel

induce the (improper) marginal state distribution

(s) ,

∞

t=0

Pr(s

= s|d, π)

. The

agents aim to maximise the expected joint return, deﬁned as

J(π) = E

∼d,a

0:∞

∼π,s

1:∞

∼P

∞

t=0

r(s

, a

)

We adopt the most common solution concept for multi-agent problems which is that of Nash equilibria

[

]. We say that a joint policy

∈ Π

is a NE if none of the agents can increase the joint

return by unilaterally altering its policy. More formally, π

is a NE if

∀i ∈ N , ∀π

∈ Π

, J(π

, π

−i

) ≤ J(π

To study the problem of ﬁnding a NE, we introduce the following notions. Let

1:m

= (i

, . . . , i

) ⊆

be an ordered subset of agents. We write

−i

1:m

to refer to its complement, and

and

−i

respectively, when m = 1. We deﬁne the multi-agent state-action value function as

1:m

(s, a

1:m

) , E

−i

1:m

∼π

−i

1:m

1:∞

∼P,a

1:∞

∼π

∞

t=0

r(s

, a

)



= s, a

1:m

= a

1:m

When

m = n

(the joint action of all agents is considered), then

1:n

∈ Sym(n)

, where

Sym(n)

denotes the set of permutations of integers

1, . . . , n

, known as the

symmetric group

. In that case we

write

1:n

(s, a

1:n

) = Q

(s, a)

which is known as the (joint) state-action value function. On the

other hand, when

m = 0

, i.e.,

1:m

= ∅

, the function takes the form

(s)

, known as the state-value

function. Consider two disjoint subsets of agents,

1:k

and

1:m

. Then, the multi-agent advantage

function of i

1:m

with respect to j

1:k

is deﬁned as

1:m



s, a

1:k

, a

1:m



, Q

1:k

1:m



s, a

1:k

, a

1:m



− Q

1:k



s, a

1:k



As it has been shown in [

], the multi-agent advantage function allows for additive decomposition

of the joint advantge function by the means of the following lemma.

Lemma 1

(Multi-Agent Advantage Decomposition)

Let

be a joint policy, and

, . . . , i

be an

arbitrary ordered subset of agents. Then, for any state s and joint action a

1:m



s, a

1:m



j=1



s, a

1:j−1

, a



. (1)

Although the multi-agent advantage function has been discovered only recently and has not been

studied thoroughly, it is this function and the above lemma that builds the foundation for the

development of the HAML theory.

3 The State of Affairs in MARL

Before we review existing SOTA algorithms for cooperative MARL, we introduce two settings in

which the algorithms can be implemented. Both of them can be considered appealing depending

on the application, but these pros also come with “traps” which, if not taken care of, may provably

deteriorate an algorithm’s performance.

3.1 Homogeneity vs. Heterogeneity

The ﬁrst setting is that of homogeneous policies, i.e., those where agents all agents share one policy:

= π, ∀N

, so that

π = (π, . . . , π)

[

]. This approach enables a straightforward adoption of

an RL algorithm to MARL, and it does not introduce much computational and sample complexity

burden with the increasing number of agents. Nevertheless, sharing one policy across all agents

requires that their action spaces are also the same, i.e.,

= A

, ∀i, j ∈ N

. Furthermore, policy

sharing prevents agents from learning different skills. This scenario may be particularly dangerous in

games with a large number of agents, as shown in Proposition 1, proved by [8].

Proposition 1

(Trap of Homogeneity)

Let’s consider a fully-cooperative game with an even num-

ber of agents

, one state, and the joint action space

{0, 1}

, where the reward is given by

r(0

n/2

, 1

n/2

) = r(1

n/2

, 0

n/2

) = 1

, and

r(a

1:n

) = 0

for all other joint actions. Let

∗

be the

optimal joint reward, and

∗

be the optimal joint reward under the shared policy constraint. Then

∗

A more ambitious approach to MARL is to allow for heterogeneity of policies among agents, i.e.,

to let

and

be different functions when

i 6= j ∈ N

. This setting has greater applicability as

heterogeneous agents can operate in different action spaces. Furthermore, thanks to this model’s

ﬂexibility they may learn more sophisticated joint behaviours. Lastly, they can recover homogeneous

policies as a result of training, if that is indeed optimal. Nevertheless, training heterogeneous agents

is highly non-trivial. Given a joint reward, an individual agent may not be able to distill its own

contibution to it—a problem known as credit assignment [

]. Furthermore, even if an agent

identiﬁes its improvement direction, it may be conﬂicting with those of other agents, which then

results in performance damage, as we exemplify in Proposition 2, proved in Appendix A.2.

Proposition 2

(Trap of Heterogeneity)

Let’s consider a fully-cooperative game with

agents,

one state, and the joint action space

{0, 1}

, where the reward is given by

r(0, 0) = 0, r(0, 1) =

r(1, 0) = 2,

and

r(1, 1) = −1

. Suppose that

old

(0) > 0.6

for

i = 1, 2

. Then, if agents

update

their policies by

new

= arg max

∼π

−i

∼π

−i

old



old

, a

−i

)



, ∀i ∈ N ,

then the resulting policy will yield a lower return,

J(π

old

) > J(π

new

) = min

J(π).

Consequently, these facts imply that homogeneous algorithms should not be applied to complex

problems, but they also highlight that heterogeneous algorithms should be developed with extra

cares. In the next subsection, we describe existing SOTA actor-critic algorithms which, while often

performing greatly, are still not impeccable, as they fall into one of the above two traps.

3.2 A Second Look at SOTA MARL Algorithms

Multi-Agent Advantage Actor-Critic

MAA2C [

] extends the A2C [

] algorithm to MARL

by replacing the RL optimisation (single-agent policy) objective with the MARL one (joint policy),

MAA2C

(π) , E

s∼π,a∼π



old

(s, a)



, (2)

which computes the gradient with respect to every agent

’s policy parameters, and performs a

gradient-ascent update for each agent. This algorithm is straightforward to implement and is capable

of solving simple multi-agent problems [

]. We point out, however, that by simply following

their own MAPG, the agents perform uncoordinated updates, thus getting caught by Proposition 2.

Furthermore, MAPG estimates have been proved to suffer from large variance which grows linearly

with the number of agents [

], thus making the algorithm unstable. To assure greater stability, the

following MARL methods, inspired by stable RL approaches, have been developed.

Multi-Agent Deep Deterministic Policy Gradient

MADDPG [

] is a MARL extension of the

popular DDPG algorithm [

]. At every iteration, every agent

updates its deterministic policy by

maximising the following objective

MADDPG

(µ

) , E

s∼β

old



s, µ

(s)



= E

s∼β

old



s, µ

, µ

−i

old

(s)



, (3)

where

old

is a state distribution that is not necessarily equivalent to

old

, thus allowing for off-

policy training. In practice, MADDPG maximises Equation (3) by a few steps of gradient ascent.

The main advantages of MADDPG include small variance of its MAPG estimates—a property

granted by deterministic policies [

], as well as low sample complexity due to learning from off-

policy data. Such a combination gives the algorithm a strong performance in continuous-action

tasks [

]. However, this method’s strengths also constitute its limitations—it is applicable to

continuous problems only (discrete problems require categorical-distribution policies), and relies on

large memory capacity to store the off-policy data.

Multi-Agent PPO

MAPPO [

] is a relatively straightforward extension of PPO [

] to MARL.

In its default formulation, the agents employ the trick of policy sharing described in the previous

subsection. As such, the policy is updated to maximise

MAPPO

(π) , E

s∼ρ

old

,a∼π

old

i=1

min



π(a

|s)

old

|s)

old

(s, a), clip



π(a

|s)

old

|s)

, 1 ± 



old

(s, a)



(4)

where the

clip(·, 1 ± )

operator clips the input to

1 − 

1 + 

if it is below/above this value. Such an

operation removes the incentive for agents to make large policy updates, thus stabilising the training

effectively. Indeed, the algorithm’s performance on the StarCraftII benchmark is remarkable, and it

is accomplished by using only on-policy data. Nevertheless, the policy-sharing strategy limits the

algorithm’s applicability and leads to its suboptimality, as we discussed in Proposition 1 and also in

[

]. Trying to avoid this issue, one can implement the algorithm without policy sharing, thus making

the agents simply take simultaneous PPO updates meanwhile employing a joint advantage estimator.

In this case, the updates are not coordinated, making MAPPO fall into the trap of Proposition 2.

In summary, all these algorithms do not possess performance guarantees. Altering their implementa-

tion settings to escape one of the traps from Subsection 3.1 makes them, at best, fall into another. This

shows that the MARL problem introduces additional complexity into the single-agent RL setting, and

needs additional care to be rigorously solved. With this motivation, in the next section, we develop a

theoretical framework for development of MARL algorithms with correctness guarantees.

4 Heterogeneous-Agent Mirror Learning

In this section, we introduce heterogeneous-agent mirror learning (HAML)—a template that includes

a continuum of MARL algorithms which we prove to solve cooperative problems with correctness

guarantees. HAML is designed for the general and expressive setting of heterogeneous agents, thus

avoiding Proposition 1, and it is capable of coordinating their updates, leaving Proposition 2 behind.

4.1 Setting up HAML

We begin by introducing the necessary deﬁnitions of HAML attributes: the drift functional.

Deﬁnition 1.

Let

i ∈ N

, a

heterogeneous-agent drift functional

(HADF)

i,ν

consists of a

map, which is deﬁned as

: Π × Π × P(−i) × S → {D

(·|s, ¯π

1:m

) : P(A

) → R},

such that for all arguments, under notation D



ˆπ

|s, ¯π

1:m



, D



ˆπ

(·

|s)|¯π

1:m

, s



1. D



ˆπ

|s, ¯π

1:m



≥ D



|s, ¯π

1:m



= 0 (non-negativity),

2. D



ˆπ

|s, ¯π

1:m



has all Gâteaux derivatives zero at ˆπ

= π

(zero gradient),

and a probability distribution

π,ˆπ

∈ P(S)

over states that can (but does not have to) depend on

and ˆπ

, and such that the drift D

i,ν

of ˆπ

from π

with respect to ¯π

1:m

, deﬁned as

i,ν

(ˆπ

|¯π

1:m

) , E

s∼ν

π, ˆπ





ˆπ

|s, ¯π

1:m



is continuous with respect to

π, ¯π

1:m

, and

ˆπ

. We say that the HADF is positive if

i,ν

(ˆπ

|¯π

1:m

) = 0

implies ˆπ

= π

, and trivial if D

i,ν

(ˆπ

|¯π

1:m

) = 0 for all π, ¯π

1:m

, and ˆπ

, .

Intuitively, the drift

i,ν

(ˆπ

|¯π

1:m

)

is a notion of distance between

and

ˆπ

, given that agents

1:m

just updated to

¯π

1:m

. We highlight that, under this conditionality, the same update (from

ˆπ

)

can have different sizes—this will later enable HAML agents to softly constraint their learning steps

in a coordinated way. Before that, we introduce a notion that renders hard constraints, which may be

a part of an algorithm design, or an inherrent limitation.

Deﬁnition 2.

Let

i ∈ N

. We say that,

: Π×Π

→ P(Π

)

is a neighbourhood operator if

∀π

∈ Π

(π

)

contains a closed ball, i.e., there exists a state-wise monotonically non-decreasing metric

χ : Π

×Π

→ R

such that

∀π

∈ Π

there exists

> 0

such that

χ(π

, ¯π

) ≤ δ

=⇒ ¯π

∈ U

(π

)

Throughout this work, for every joint policy

, we will associate it with its sampling distribution—a

positive state distribution

∈ P(S)

that is continuous in

[

]. With these notions deﬁned, we

introduce the main deﬁnition of the paper.

Deﬁnition 3.

Let

i ∈ N

1:m

∈ P(−i)

, and

i,ν

be a HADF of agent

. The

heterogeneous-agent

mirror operator (HAMO) integrates the advantage function as



(ˆπ

)

i,ν

,¯π

1:m



(s) , E

1:m

∼¯π

1:m

∼ˆπ

(s, a

1:m

, a

)

−

π,ˆπ

(s)



ˆπ



s, ¯π

1:m



We note two important facts. First, when

π,ˆπ

= β

, the fraction from the front of the HADF in

HAMO disappears, making it only a difference between the advantage and the HADF’s evaluation.

Second, when

ˆπ

= π

, HAMO evaluates to zero. Therefore, as the HADF is non-negative, a policy

ˆπ

that improves HAMO must make it positive, and thus leads to the improvement of the multi-agent

advantage of agent

. In the next subsection, we study the properties of HAMO in more details, as

well as use it to construct HAML—a general framework for MARL algorithm design.

4.2 Theoretical Properties of HAML

It turns out that, under certain conﬁguration, agents’ local improvements result in the joint improve-

ment of all agents, as described by the lemma below.

Lemma 2

(HAMO Is All You Need)

Let

old

and

new

be joint policies and let

1:n

∈ Sym(n)

an agent permutation. Suppose that, for every state s ∈ S,



(π

new

)

,ν

,π

1:m−1

new

old



(s) ≥



(π

old

)

,ν

,π

1:m−1

new

old



(s). (5)

Then, π

new

is jointly better than π

old

, so that for every state s,

new

(s) ≥ V

old

(s).

Subsequently, the monotonic improvement property of the joint return follows naturally, as

J(π

new

) = E

s∼d



new

(s)



≥ E

s∼d



old

(s)



= J(π

old

However, the conditions of the lemma require every agent to solve

|S|

instances of Inequality

(5), which may be an intractable problem. We shall design a single optimisation objective whose

solution satisﬁes those inequalities instead. Furthermore, to have a practical application to large-scale

problems, such an objective should be estimatable via sampling. To handle these challenges, we

introduce the following Algorithm Template 1 which generates a continuum of HAML algorithms.

Algorithm Template 1: Heterogeneous-Agent Mirror Learning

Initialise a joint policy π

= (π

, . . . , π

);

for k = 0, 1, . . . do

Compute the advantage function A

(s, a) for all state-(joint)action pairs (s, a);

Draw a permutaion i

1:n

of agents at random \\from a positive distribution p ∈ P(Sym(n));

for m = 1 : n do

Make an update π

k+1

= arg max

∈U

(π

)

s∼β



(π

)

,ν

,π

1:m−1

k+1



(s)

;

Output :A limit-point joint policy π

∞

Based on Lemma 2 and the fact that

∈ U

(π

), ∀i ∈ N , π

∈ Π

, we can know any HAML

algorithm (weakly) improves the joint return at every iteration. In practical settings, such as deep

MARL, the maximisation step of a HAML method can be performed by a few steps of gradient

ascent on a sample average of HAMO (see Deﬁnition 1). We also highlight that if the neighbourhood

operators

can be chosen so that they produce small policy-space subsets, then the resulting updates

will be not only improving but also small. This, again, is a desirable property while optimising

neural-network policies, as it helps stabilise the algorithm. One may wonder why the order of agents

in HAML updates is randomised at every iteration; this condition has been necessary to establish

convergence to NE, which is intuitively comprehendable: ﬁxed-point joint policies of this randomised

procedure assure that none of the agents is incentivised to make an update, namely reaching a NE.

We provide the full list of the most fundamental HAML properties in Theorem 1 which shows that

any method derived from Algorithm Template 1 solves the cooperative MARL problem.

Theorem 1

(The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)

Let, for every

agent

i ∈ N

i,ν

be a HADF,

be a neighbourhood operator, and let the sampling distributions

depend continuously on π. Let π

∈ Π, and the sequence of joint policies (π

)

∞

k=0

be obtained

by a HAML algorithm induced by

i,ν

, U

, ∀i ∈ N

, and

. Then, the joint policies induced by the

algorithm enjoy the following list of properties

1. Attain the monotonic improvement property,

J(π

k+1

) ≥ J(π

2. Their value functions converge to a Nash value function V

lim

k→∞

= V

3. Their expected returns converge to a Nash return,

lim

k→∞

J(π

) = J

4. Their ω-limit set consists of Nash equilibria.

See the proof in Appendix A. With the above theorem, we can conclude that HAML provides a

template for generating theoretically sound, stable, monotonically improving algorithms that enable

agents to learn solving multi-agent cooperation tasks.

4.3 Existing HAML Instances: HATRPO and HAPPO

As a sanity check for its practicality, we show that two SOTA MARL methods—HATRPO and

HAPPO [

]—are valid instances of HAML, which also provides an explanation for their excellent

empirical performance.

We begin with an intuitive example of HATRPO, where agent

(the permutation

1:n

is drawn from

the uniform distribution) updates its policy so as to maximise (in ¯π

)

s∼ρ

old

1:m−1

∼π

1:m−1

new

∼¯π

old

(s, a

1:m−1

, a

)

, subject to D

(π

old

, ¯π

) ≤ δ.

This optimisation objective can be casted as a HAMO with the trivial HADF

,ν

≡ 0

, and the

KL-divergence neighbourhood operator

(π

) =

¯π



s∼ρ



(·

|s), ¯π

(·

|s)



≤ δ

The sampling distribution used in HATRPO is

= ρ

. Lastly, as the agents update their policies in

a random loop, the algorithm is an instance of HAML. Hence, it is monotonically improving and

converges to a Nash equilibrium set.

In HAPPO, the update rule of agent i

is changes with respect to HATRPO as

s∼ρ

old

1:m−1

∼π

1:m−1

new

∼π

old

min



r(¯π

1:m

old

(s, a

1:m

), clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m

)



where r(¯π

) =

¯π

|s)

old

|s)

. We show in Appendix D that this optimisation objective is equivalent to

s∼ρ

old

1:m−1

∼π

1:m−1

new

∼¯π



old

(s, a

1:m−1

, a

)



− E

1:m

∼π

1:m

old



ReLU



r(¯π

) − clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m

)



The purple term is clearly non-negative due to the presence of the ReLU funciton. Furthermore,

for policies

¯π

sufﬁciently close to

old

, the clip operator does not activate, thus rendering

r(¯π

)

unchanged. Therefore, the purple term is zero at and in a region around

¯π

= π

old

, which also

implies that its Gâteaux derivatives are zero. Hence, it evaluates a HADF for agent

, thus making

HAPPO a valid HAML instance.

Finally, we would like to highlight that these results about HATRPO and HAPPO signiﬁcantly

strengthen the work originally by [

] who only derived these learning protocols as approximations to

a theoretically sound algorithm (see Algorithm 1 in [8]), yet without such level of insights.

Figure 1: Comparison between HAA2C (yellow) vs MAA2C-S (blue) and MAA2C-NS (pink) in SMAC.

4.4 New HAML Instances: HAA2C and HADDPG

In this subsection, we exemplify how HAML can be used for derivation of principled MARL

algorithms, and introduce heterogeneous-agent extensions of A2C and DDPG, different from those in

Subsection 3.2. Our goal is not to refresh new SOTA performance on challenging tasks, but rather to

verify the correctness of our theory, as well as deliver more robust versions of multi-agent extensions

of popular RL algorithms such as A2C and DDPG.

HAA2C

intends to optimise the policy for the joint advantage function at every iteration, and similar

to A2C, does not impose any penalties or constraints on that procedure. This learning procedure is

accomplished by, ﬁrst, drawing a random permutation of agents

1:n

, and then performing a few steps

of gradient ascent on the objective of

s∼ρ

old

1:m

∼π

1:m

old

1:m−1

new

1:m−1

|s)π

|s)

1:m−1

old

1:m−1

|s)π

old

|s)

old

(s, a

1:m−1

, a

)

, (6)

with respect to

parameters, for each agent

in the permutation, sequentially. In practice,

we replace the multi-agent advantage

old

(s, a

1:m−1

, a

)

with the joint advantage estimate which,

thanks to the joint importance sampling in Equation (6), poses the same objective on the agent (see

Appendix E for full pseudocode).

HADDPG

aims to maximise the state-action value function off-policy. As it is a deterministic-

action method, and thus importance sampling in its case translates to replacement of the old action

inputs to the critic with the new ones. Namely, agent i

in a random permutation i

1:n

maximises

s∼β

old

1:m

old



s, µ

1:m−1

new

(s), µ

(s)



, (7)

with respect to

, also with a few steps of gradient ascent. Similar to HAA2C, optimising the

state-action value function (with the old action replacement) is equivalent to the original multi-agent

value (see Appendix E for full pseudocode).

We are fully aware that none of these two methods exploits the entire abundance of the HAML

framework—they do not possess HADFs or neighbourhood operators, as oppose to HATRPO or

HAPPO. Thus, we speculate that the range of opportunities for HAML algorithm design is yet to be

discovered with more future work. Nevertheless, we begin this search from these two straightforward

methods, and analyse their performance in the next section.

5 Experiments and Results

In this section

(1)

, we challenge the capabilities of HAML in practice by testing HAA2C and HADDPG

on the most challenging MARL benchmarks—we use StarCraft Multi-Agent Challenge [

] and

Multi-Agent MuJoCo [19], for discrete (only HAA2C) and continuous action settings, respectively.

(1)

Our code is available at https://github.com/anonymouswater/HAML

Figure 2: Comparison between HAA2C (yellow) vs MAA2C-S (blue) and MAA2C-NS (pink) in MAMuJoCo.

Figure 3: Comparison between HADDPG (yellow) and MADDPG (pink) in MAMuJoCo.

We begin by demonstrating the performance of HAA2C. As a baseline, we use its “naive” predecessor,

MAA2C, in both policy-sharing (MAA2C-S) and heterogeneous (MAA2C-NS) versions. Results on

Figures 1 & 2 show that HAA2C generally achieves higher rewards in SMAC than both versions of

MAA2C, while maintaining lower variance. The performance gap increases in MAMuJoCo, where

the agents must learn diverse policies to master complex movements [

]. Here, echoing Proposition

1, the homogeneous MAA2C-S fails completely, and conventional, heterogeneous MAA2C-NS

underperforms by a signiﬁcant margin (recall Proposition 2).

As HADDPG is a continuous-action method, we test it only on MAMuJoCo (precisely 6 tasks) and

compare it to MADDPG. Figure 3 revelas that HADDPG achieves higher reward than MADDPG,

while sometimes displaying signiﬁcantly lower variance (e.g in Reacher-2x1 and Swimmer-2x1).

Hence, we conclude that HADDPG performs better.

6 Conclusion

In this paper, we described and addressed the problem of lacking principled treatments for cooperative

MARL tasks. Our main contribution is the development of heterogeneous-agent mirror learning

(HAML), a class of provably correct MARL algorithms, whose properties are rigorously proﬁled.

We veriﬁed the correctness and the practicality of HAML by interpreting current SOTA methods—

HATRPO and HAPPO—as HAML instances and also by deriving and testing heterogeneous-agent

extensions of successful RL algorithms, named as HAA2C and HADDPG. We expect HAML to help

create a template for designing both principled and practical MARL algorithms hereafter.

References

[1]

Lawrence M Ausubel and Raymond J Deneckere. A generalized theorem of the maximum.

Economic Theory, 3(1):99–107, 1993.

[2]

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, Christy

Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large

scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.

[3]

Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS

Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft

multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.

[4]

Christian Schröder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip H. S. Torr, Wendelin

Böhmer, and Shimon Whiteson. Deep multi-agent reinforcement learning for decentralized

continuous cooperative control. CoRR, abs/2003.06709, 2020.

[5]

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry

Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case

study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.

[6]

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.

Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, volume 32, 2018.

[7]

Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices

in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.

[8]

Jakub Grudzien Kuba, Ruiqing Chen, Munning Wen, Ying Wen, Fanglei Sun, Jun Wang, and

Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. ICLR,

2022.

[9]

Jakub Grudzien Kuba, Christian Schroeder de Witt, and Jakob Foerster. Mirror learning: A

unifying framework of policy optimisation. ICML, 2022.

[10]

Jakub Grudzien Kuba, Muning Wen, Linghui Meng, Haifeng Zhang, David Mguni, Jun Wang,

Yaodong Yang, et al. Settling the variance of multi-agent policy gradients. Advances in Neural

Information Processing Systems, 34:13458–13470, 2021.

[11]

Hepeng Li and Haibo He. Multi-agent trust region policy optimization. arXiv preprint

arXiv:2010.07916, 2020.

[12]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,

David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv

preprint arXiv:1509.02971, 2015.

[13] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In

Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.

[14]

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-

critic for mixed cooperative-competitive environments. In Proceedings of the 31st International

Conference on Neural Information Processing Systems, pages 6382–6393, 2017.

[15]

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-

critic for mixed cooperative-competitive environments. In Proceedings of the 31st International

Conference on Neural Information Processing Systems, pages 6382–6393, 2017.

[16]

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,

Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-

ment learning. In International conference on machine learning, pages 1928–1937. PMLR,

2016.

[17] John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.

[18]

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V. Albrecht. Benchmark-

ing multi-agent deep reinforcement learning algorithms in cooperative tasks. In Proceedings

of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS),

2021.

[19]

Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, Pierre-Alexandre Kamienny, Philip HS

Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised

policy gradients. arXiv e-prints, pages arXiv–2003, 2020.

[20]

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nan-

tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon

Whiteson. The starcraft multi-agent challenge. 2019.

[21]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust

region policy optimization. In International conference on machine learning, pages 1889–1897.

PMLR, 2015.

[22]

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-

dimensional continuous control using generalized advantage estimation. arXiv preprint

arXiv:1506.02438, 2015.

[23]

John Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy

optimization algorithms. ArXiv, abs/1707.06347, 2017.

[24]

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.

Deterministic policy gradient algorithms. In International conference on machine learning,

pages 387–395. PMLR, 2014.

[25]

R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement

learning with function approximation. In Advances in Neural Information Processing Systems

12, volume 12, pages 1057–1063. MIT Press, 2000.

[26] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.

[27]

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. In Uncertainty

in Artiﬁcial Intelligence, pages 113–122. PMLR, 2020.

[28]

Jiayi Weng, Huayu Chen, Dong Yan, Kaichao You, Alexis Duburcq, Minghao Zhang, Hang

Su, and Jun Zhu. Tianshou: a highly modularized deep reinforcement learning library. arXiv

preprint arXiv:2107.14171, 2021.

[29]

Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game

theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.

[30]

Chao Yu, A. Velu, Eugene Vinitsky, Yu Wang, A. Bayen, and Yi Wu. The surprising effectiveness

of mappo in cooperative, multi-agent games. ArXiv, abs/2103.01955, 2021.

[31]

Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized

multi-agent reinforcement learning with networked agents. In International Conference on

Machine Learning, pages 5872–5881. PMLR, 2018.

A Proofs of Preliminary Results

A.1 Proof of Lemma 1

Lemma 1

(Multi-Agent Advantage Decomposition)

Let

be a joint policy, and

, . . . , i

be an

arbitrary ordered subset of agents. Then, for any state s and joint action a

1:m



s, a

1:m



j=1



s, a

1:j−1

, a



. (1)

Proof.

(We quote the proof from [

].) We start as expressing the multi-agent advantage as a

telescoping sum, and then rewrite it using the deﬁnition of multi-agent advantage,

1:m

(s, a

1:m

) = Q

1:m

(s, a

1:m

) − V

(s)

j=1



1:j

(s, a

1:j

) − Q

1:j−1

(s, a

1:j−1

)



j=1

(s, a

1:j−1

, a

A.2 Proof of Proposition 2

Proposition 2

(Trap of Heterogeneity)

Let’s consider a fully-cooperative game with

agents,

one state, and the joint action space

{0, 1}

, where the reward is given by

r(0, 0) = 0, r(0, 1) =

r(1, 0) = 2,

and

r(1, 1) = −1

. Suppose that

old

(0) > 0.6

for

i = 1, 2

. Then, if agents

update

their policies by

new

= arg max

∼π

−i

∼π

−i

old



old

, a

−i

)



, ∀i ∈ N ,

then the resulting policy will yield a lower return,

J(π

old

) > J(π

new

) = min

J(π).

As there is only one state, we can ignore the inﬁnite horizon and the discount factor

, thus making

the state-action value and the reward functions equivalent, Q ≡ r.

Let us, for brevity, deﬁne π

= π

old

(0) > 0.6, for i = 1, 2. We have

J(π

new

) = Pr(a

= a

= 0)r(0, 0) +



1 − Pr(a

= a

= 0)



E[r(a

, a

)|(a

, a

) 6= (0, 0)]

> 0.6

× 0 − (1 − 0.6

) = −0.64.

The update rule stated in the proposition can be equivalently written as

new

= arg max

∼π

−i

∼π

−i

old



old

, a

−i

)



. (8)

We have

−i

∼π

−i

old



old

(0, a

−i

)



= π

−i

Q(0, 0) + (1 − π

−i

)Q(0, 1) = π

−i

r(0, 0) + (1 − π

−i

)r(0, 1) = 2(1 − π

−i

and similarly

−i

∼π

−i

old



old

(1, a

−i

)



= π

−i

r(1, 0) + (1 − π

−i

)r(1, 1) = 2π

−i

− (1 − π

−i

) = 3π

−i

− 1.

Hence, if π

−i

> 0.6, then

−i

∼π

−i

old



old

(1, a

−i

)



= 3π

−i

− 1 > 3 × 0.6 − 1 = 0.8 > 2 − 2π

−i

= E

−i

∼π

−i

old



old

(0, a

−i

)



Therefore, for every i, the solution to Equation (8) is the greedy policy π

new

(1) = 1. Therefore,

J(π

new

) = Q(1, 1) = r(1, 1) = −1,

which ﬁnishes the proof.

B Proof of HAMO Is All You Need Lemma

Lemma 2

(HAMO Is All You Need)

Let

old

and

new

be joint policies and let

1:n

∈ Sym(n)

an agent permutation. Suppose that, for every state s ∈ S,



(π

new

)

,ν

,π

1:m−1

new

old



(s) ≥



(π

old

)

,ν

,π

1:m−1

new

old



(s). (5)

Then, π

new

is jointly better than π

old

, so that for every state s,

new

(s) ≥ V

old

(s).

Proof.

Let

old

(π

new

|s) ,

m=1

old

,ˆπ

(s)

old

(s)

old

(π

new

|s, π

1:m−1

new

)

. Combining this with Lemma

1 gives

a∼π

new



old

(s, a)



−

old

(π

new

|s)

m=1



old

(s, a

1:m−1

, a

) −

old

,π

new

(s)

old

(s)

old

(π

new

|s, π

1:m−1

new

)



by Inequality (5)

≥

m=1



old

(s, a

1:m−1

, a

) −

old

,π

old

(s)

old

(s)

old

(π

old

|s, π

1:m−1

new

)



= E

a∼π

old



old

(s, a)



−

old

(π

old

|s).

The resulting inequality can be equivalently rewritten as

a∼π

new



old

(s, a)



−

old

(π

new

|s) ≥ E

a∼π

old



old

(s, a)



−

old

(π

old

|s), ∀s ∈ S. (9)

We use it to prove the claim as follows,

new

(s) = E

a∼π

new



new

(s, a)



= E

a∼π

new



old

(s, a)



−

old

(π

new

|s)

old

(π

new

|s) + E

a∼π

new



new

(s, a) − Q

old

(s, a)



by Inequality (9)

≥ E

a∼π

old



old

(s, a)



−

old

(π

old

|s)

old

(π

new

|s) + E

a∼π

new



new

(s, a) − Q

old

(s, a)



= V

old

(s) +

old

(π

new

|s) + E

a∼π

new



new

(s, a) − Q

old

(s, a)



= V

old

(s) +

old

(π

new

|s) + E

a∼π

new

∼P



r(s, a) + γV

new

) − r(s, a) − γV

old

)



= V

old

(s) +

old

(π

new

|s) + γE

a∼π

new

∼P



new

) − V

old

)



≥ V

old

(s) + γ inf



new

) − V

old

)



Hence V

new

(s) − V

old

(s) ≥ γ inf



new

) − V

old

)



Taking inﬁmum over s and simplifying

(1 − γ) inf



new

(s) − V

old

(s)



≥ 0.

Therefore, inf



new

(s) − V

old

(s)



≥ 0, which proves the lemma.

C Proof of Theorem 1

Lemma 3. Suppose an agent i

maximises the expected HAMO

new

= arg max

∈U

old

(π

old

)

s∼β

old



(π

)

,ν

,π

1:m−1

new

old



(s)

. (10)

Then, for every state s ∈ S



(π

new

)

,ν

,π

1:m−1

new

old



(s) ≥



(π

old

)

,ν

,π

1:m−1

new

old



(s).

Proof. We will prove this statement by contradiction. Suppose that there exists s

∈ S such that



(π

new

)

,ν

,π

1:m−1

new

old



) <



(π

old

)

,ν

,π

1:m−1

new

old



). (11)

Let us deﬁne the following policy ˆπ

ˆπ

(·

|s) =



old

(·

|s), at s = s

new

(·

|s), at s 6= s

Note that

ˆπ

is (weakly) closer to

old

than

new

, and at the same distance at other states.

Together with π

new

∈ U

old

(π

old

), this implies that ˆπ

∈ U

old

(π

old

). Further,

s∼β

old



(ˆπ

)

,ν

,π

1:m−1

new

old



(s)

− E

s∼β

old



(π

new

)

,ν

,π

1:m−1

new

old



(s)

old

)



(ˆπ

)

,ν

,π

1:m−1

new

old



) −



(ˆπ

)

,ν

,π

1:m−1

new

old



)



> 0.

The above contradicts

new

as being the argmax of Inequality (11), as

ˆπ

is strictly better. The

contradiciton ﬁnishes the proof.

Theorem 1

(The Fundamental Theorem of Heterogeneous-Agent Mirror Learning)

Let, for every

agent

i ∈ N

i,ν

be a HADF,

be a neighbourhood operator, and let the sampling distributions

depend continuously on π. Let π

∈ Π, and the sequence of joint policies (π

)

∞

k=0

be obtained

by a HAML algorithm induced by

i,ν

, U

, ∀i ∈ N

, and

. Then, the joint policies induced by the

algorithm enjoy the following list of properties

1. Attain the monotonic improvement property,

J(π

k+1

) ≥ J(π

2. Their value functions converge to a Nash value function V

lim

k→∞

= V

3. Their expected returns converge to a Nash return,

lim

k→∞

J(π

) = J

4. Their ω-limit set consists of Nash equilibria.

Proof. Proof of Property 1.

It follows from combining Lemmas 2 & 3.

Proof of Properties 2, 3 & 4.

Step 1: convergence of the value function.

By Lemma 2, we have that

(s) ≤ V

k+1

(s), ∀s ∈ S

and that the value function is upper-bounded by

max

. Hence, the sequence of value functions

)

k∈N

converges. We denote its limit by V .

Step 2: characterisation of limit points.

As the joint policy space

is bounded, by Bolzano-

Weierstrass theorem, we know that the sequence

(π

)

k∈N

has a convergent subsequence. Therefore,

it has at least one limit point policy. Let

¯π

be such a limit point. We introduce an auxiliary notation:

for a joint policy

and a permutation

1:n

, let

HU(π, i

1:n

)

be a joint policy obtained by a HAML

update from π along the permutation i

1:n

Claim: For any permutation z

1:n

∈ Sym(n),

¯π = HU(¯π, z

1:n

). (12)

Proof of Claim

. Let

π = HU(¯π, z

1:n

) 6= ¯π

and

(π

)

r∈N

be a subsequence converging to

¯π

. Let us

recall that the limit value function is unique and denoted as

. Writing

0:∞

1:n

[·]

for the expectation

operator under the stochastic process (i

1:n

)

k∈N

of update orders, for a state s ∈ S, we have

0 = lim

r→∞

0:∞

1:n



(s) − V

(s)



as every choice of permutation improves the value function

≥ lim

r→∞

P(i

1:n

= z

1:n

)



HU(π

1:n

)

(s) − V

(s)



= p(z

1:n

) lim

r→∞



HU(π

1:n

)

(s) − V

(s)



By the continuity of the expected HAMO (following from the continuity of the value function [

Appendix A], HADFs, neighbourhood operators, and the sampling distribution) we obtain that the ﬁrst

component of

HU(π

, z

1:n

)

, which is

, is continuous in

by Berge’s Maximum Theorem

[

]. Applying this argument recursively for

, . . . , z

, we have that

HU(π

, z

1:n

)

is continuous

. Hence, as

converges to

¯π

, its HU converges to the HU of

¯π

, which is

. Hence, we

continue wiriting the above derivation as

= p(z

1:n

)



(s) − V

¯π

(s)



≥ 0, by Lemma 2.

was arbitrary, the state-value function of

is the same as that of

= V

, by the Bellman

equation [

Q(s, a) = r(s, a) + γEV (s

)

, this also implies that their state-value and advantage

functions are the same:

= Q

¯π

and

= A

¯π

. Let

be the smallest integer such that

ˆπ

6= ¯π

This means that ˆπ

achieves a greater expected HAMO than ¯π

, for which it is zero. Hence,

0 < E

s∼β



(ˆπ

)

z,ν

,¯π

1:m−1

¯π



(s)

= E

s∼β

1:m

∼¯π

1:m−1

∼ˆπ



¯π

(s, a

1:m−1

, a

)



−

¯π,ˆπ

(s)

¯π

(s)

(ˆπ

|s, ¯π

1:m−1

)

= E

s∼β

1:m

∼¯π

1:m−1

∼ˆπ



(s, a

1:m−1

, a

)



−

¯π,ˆπ

(s)

¯π

(s)

(ˆπ

|s, ¯π

1:m−1

)

and as the expected value of the multi-agent advantage function is zero

= E

s∼β

−

¯π,ˆπ

(s)

¯π

(s)

(ˆπ

|s, ¯π

1:m−1

)

≤ 0.

This is a contradiction, and so the claim in Equation (12) is proved, and the Step 2 is ﬁnished.

Step 3: dropping the HADF.

Consider an arbitrary limit point joint policy

¯π

. By Step 2, for any

permutation i

1:n

, considering the ﬁrst component of the HU, and writing ν

= ν

¯π,π

¯π

= max

∈U

(π

)

s∼β

¯π



(π

)

,ν

¯π



(s)

(13)

= max

∈U

(π

)

s∼β

¯π

∼π



¯π

(s, a

)



−

(s)

¯π

(s)

¯π

(π

|s)

As the HADF is non-negative, and at

= ¯π

its value and of its all Gâteaux derivatives are zero, it

follows by Step 3 of Theorem 1 of [9] that for every s ∈ S,

¯π

(·

|s) = arg max

∈P(A

)

∼π



¯π

(s, a

)



Step 4: Nash equilibrium. We have proved that ¯π satisﬁes

¯π

(·

|s) = arg max

(·

|s)∈P(A

)

∼π



¯π

(s, a

)



= arg max

(·

|s)∈P(A

)

∼π

−i

∼¯π

−i



¯π

(s, a)



, ∀i ∈ N , s ∈ S.

Hence, by considering

¯π

−i

ﬁxed, we see that

¯π

satisﬁes the condition for the optimal policy [

and hence

¯π

= arg max

∈Π

J(π

, ¯π

−i

Thus,

¯π

is a Nash equilibrium. Lastly, this implies that the value function corresponds to a Nash

value function V

, the return corresponds to a Nash return J

D Casting HAPPO as HAML

The maximisation objective of agent i

in HAPPO is

s∼ρ

old

1:m−1

∼π

1:m−1

new

∼π

old

min



r(¯π

1:m

old

(s, a

1:m

), clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m

)

i

Fixing s and a

1:m−1

, we can rewrite it as

∼¯π



1:m

old

(s, a

1:m−1

, a

)



− E

∼π

old

r(¯π

1:m

old

(s, a

1:m−1

, a

)

− min



r(¯π

1:m

old

(s, a

1:m−1

, a

), clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m−1

, a

)

i

By the multi-agent advantage decomposition,

∼¯π



1:m

old

(s, a

1:m−1

, a

)



= A

1:m−1

old

(s, a

1:m−1

) + E

∼¯π



old

(s, a

1:m−1

, a

)



Hence, the presence of the joint advantage of agents i

1:m

is equivalent to the multi-agent advantage

given

1:m−1

that appears in HAMO. Hence, we only need to show that that the subtracted

term is an HADF. Firstly, we change min into max with the identity − min f(x) = max[−f (x)].

∼π

old

r(¯π

1:m

old

(s, a

1:m−1

, a

)

+ max



− r(¯π

1:m

old

(s, a

1:m−1

, a

), −clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m−1

, a

)

i

which we then simplify

∼π

old

max





r(¯π

) − clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m−1

, a

)

i

= E

∼π

old

ReLU





r(¯π

) − clip



r(¯π

), 1 ± 



1:m

old

(s, a

1:m−1

, a

)

i

As discussed in the main body of the paper, this is an HADF.

E Algorithms

Algorithm 2: HAA2C

Input: stepsize α, batch size B, number of: agents n, episodes K, steps per episode T ,

mini-epochs e;

Initialize: the critic network: φ, the policy networks: {θ

}

i∈N

, replay buffer B;

for k = 0, 1, . . . , K − 1 do

Collect a set of trajectories by letting the agents act according their policies,

∼ π

(·

)

;

Push transitions {(o

, a

, o

t+1

, r

), ∀i ∈ N , t ∈ T } into B;

Sample a random minibatch of B transitions from B;

Estimate the returns R and the advantage function,

A(s, a), using

and GAE;

Draw a permutation of agents i

1:n

at random;

Set M

(s, a) =

A(s, a);

for agent i

= i

, . . . , i

Set π

) = π

);

for mini-epoch= 1, . . . , e do

Compute agent i

’s policy gradient

= ∇

b=1

, a

)

Update agent i

’s policy by

= θ

+ αg

Compute M

m+1

(s, a) =

)

(s, a) //unless m = n;

Update the critic by gradient descent on



) − R



Discard φ. Deploy {θ

}

i∈N

in execution;

Algorithm 3: HADDPG

Input:

stepsize

, Polyak coefﬁcient

, batch size

, number of: agents

, episodes

, steps per

episode T , mini-epochs e;

Initialize:

the critic networks:

and

and policy networks:

{θ

}

i∈N

, replay buffer

, random

processes {X

}

i∈N

for exploration;

for k = 0, 1, . . . , K − 1 do

Collect a set of transitions by letting the agents act according to their deterministic policies

with the exploratory noise

= µ

) + X

Push transitions {(o

, a

, o

t+1

, r

), ∀i ∈ N , t ∈ T } into B;

Sample a random minibatch of B transitions from B;

Compute the critic targets

= r

+ γQ

t+1

, a

t+1

Update the critic by minimising the loss

φ = arg min



− Q

, a

)



Draw a permutation of agents i

1:n

at random;

for agent i

= i

, . . . , i

Update agent i

by solving

= arg max



, µ

1:m−1

), µ

), a

m+1:n



with e mini-epochs of deterministic policy gradient ascent;

Update the target critic network smoothly

= τφ + (1 − τ )φ

Discard φ. Deploy {θ

}

i∈N

in execution;

F Experiments

F.1 Compute resources

For compute resources, We used one internal compute servers which consists consisting of 6x RTX

3090 cards and 112 CPUs, however each model is trained on at most 1 card.

F.2 Hyperparameters

We implement the MAA2C and HAA2C based on HAPPO/HATRPO [

]. We offer the hyperparameter

we use for SMAC in table 1 and for Mujoco in table 2.

Table 1: Common hyperparameters used in the SMAC domain.

hyperparameters value hyperparameters value hyperparameters value

critic lr 5e-4 optimizer Adam stacked-frames 1

gamma 0.99 optim eps 1e − 5 batch size 3200

gain 0.01 hidden layer 1 training threads 64

actor network mlp num mini-batch 1 rollout threads 8

hypernet embed 64 max grad norm 10 episode length 400

activation ReLU hidden layer dim 64 use huber loss True

Table 2: Common hyperparameters used for MAA2C-NS, MAA2C-S and HAA2C in the Multi-Agent MuJoCo.

hyperparameters value hyperparameters value hyperparameters value

critic lr 1e − 3 optimizer Adam num mini-batch 1

gamma 0.99 optim eps 1e − 5 batch size 4000

gain 0.01 hidden layer 1 training threads 8

std y coef 0.5 actor network mlp rollout threads 4

std x coef 1 max grad norm 10 episode length 1000

activation ReLU hidden layer dim 64 eval episode 32

In addition to those common hyperparameters, we set the mini-epoch for HAA2C as 5. For actor

learning rate, we set it as 2e-4 for HalfCheetah and Ant while 1e-4 for Walker2d.

We implement the MADDPG and HADDPG based on the Tianshou framework [

]. We offer the

hyperparameter we use in table 3 and 4.

Table 3: Hyper-parameter used for MADDPG/HADDPG in the Multi-Agent MuJoCo domain

hyperparameters value hyperparameters value hyperparameters value

actor lr 3e − 4 optimizer Adam replay buffer size 1e6

critic lr 1e − 3 exploration noise 0.1 batch size 1000

gamma 0.99 step-per-epoch 50000 training num 20

tau 0.1 step-per-collector 2000 test num 10

start-timesteps 25000 update-per-step 0.025 epoch 200

hidden-sizes [64, 64] episode length 1000

Table 4: Parameter n-step used for MADDPG/HADDPG in the Multi-Agent MuJoCo

task value task value task value

Reacher (2 × 1) 5 Hopper (3 × 1) 20 Walker (3 × 2) 5

Ant (4 × 2) 20 Swimmer (2 × 1) 5 Humanoid (9|8) 5