JuliaReinforcementLearning · HenriDeh · Dec 22, 2022 · Feb 17, 2022 · Feb 18, 2022 · Feb 18, 2022
diff --git a/.cspell/cspell.json b/.cspell/cspell.json
@@ -180,7 +180,8 @@
         "rsold",
         "rsnew",
         "unnormalized",
-        "baedan"
+        "baedan",
+        "Dehaybe"
     ],
     "ignoreWords": [],
     "minWordLength": 5,

diff --git a/.cspell/julia_words.txt b/.cspell/julia_words.txt
@@ -5284,4 +5284,12 @@ inworld
 Posteriori
 normalised
 kldivergence
-devmode
+qnetworks
+mpodual
+lagrangeμ
+mvnormkldivergence
+diagnormkldivergence
+normkldivergence
+sqmahal
+logdpf
+devmode
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -21,7 +21,7 @@ jobs:
       fail-fast: false
       matrix:
         version:
-          - '1.6'
+          - '1.8'
           - '1'
         os:
           - ubuntu-latest
@@ -157,7 +157,7 @@ jobs:
   #     - run: python -m pip install --user matplotlib
   #     - uses: julia-actions/setup-julia@v1
   #       with:
-  #         version: '1.6'
+  #         version: '1.8'
   #     - name: Build homepage
   #       run: |
   #         cd docs/homepage

diff --git a/Project.toml b/Project.toml
@@ -4,6 +4,7 @@ authors = ["Johanni Brea <[email protected]>", "Jun Tian <tianjun.c
 version = "0.11.0"
 
 [deps]
+Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
 Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
 ReinforcementLearningBase = "e575027e-6cd6-5018-9292-cdc6200d2b44"
 ReinforcementLearningCore = "de1b191a-4ae0-4afa-a27b-92d07f46b2d6"

diff --git a/docs/make.jl b/docs/make.jl
@@ -51,6 +51,9 @@ makedocs(
             "Which algorithm should I use?" => "Which_algorithm_should_I_use.md",
             "Episodic vs. Non-episodic environments" => "non_episodic.md",
         ],
+        "Zoo Algorithms" => [
+            "MPO" => "src/Zoo Algorithms/MPO.md"
+        ],
         "FAQ" => "FAQ.md",
         experiments,
         "Tips for Developers" => "tips.md",

diff --git a/docs/src/Zoo Algorithms/MPO.md b/docs/src/Zoo Algorithms/MPO.md
@@ -0,0 +1,121 @@
+# Maximum a Posterio Policy Optimization
+
+ReinforcementLearningZoo proposes an implementation of the Maximum a Posterio Policy Optimization (MPO) algorithm. This algorithm was initially proposed by [Abdolmaleki et al. (2018)](https://arxiv.org/abs/1806.06920) and is further detailled in a [subsequent paper](https://arxiv.org/abs/1812.02256). This implementation is not identical to that of the paper for several reasons that we will detail later. The purpose of this page is to guide a RLZoo user through the creation of an experiment that uses the MPO algorithm. We will recreate [one of the three experiments](../../../src/ReinforcementLearningExperiments/deps/experiments/experiments/Policy%20Gradient/JuliaRL_MPO_CartPole.jl) available in RLExperiments.jl. 
+
+The implementation of MPO is declined in three forms (one for each cartpole experiments):
+
+- With a Categorical Actor (for discrete action spaces)
+- With a Diagonal Gaussian (the standard actor for continuous action spaces in RL)
+- With a Full Gaussian (which can learn a covariance between the different action dimensions)
+
+The latter is the approach used in the paper for continuous actions. It is implemented but is very slow on a GPU at the moment. Although more expressive, it may not be worth the extra computation time. 
+
+## Learning a continuous Cartpole policy
+First, we instantiate the environment from the package `ReinforcementLearningEnvironments`. We wrap it into an `ActionTransformedEnv` with a `tanh` to constrain the action in [-1, 1].
+
+```julia
+using ReinforcementLearning, Flux
+
+env = ActionTransformedEnv(CartPoleEnv(continuous = true), action_mapping = x->tanh(only(x)))
+```
+
+Because we want our experiment to be reproducible, we also use a seed.
+
+```julia
+using Random
+Random.set_global_seed!(123)
+```
+
+Then we instantiate a `MPOPolicy` 
+```julia
+    policy = MPOPolicy(
+        actor = Approximator(GaussianNetwork(
+            Chain(Dense(4, 64, tanh), Dense(64,64,tanh)),
+            Dense(64, 1),
+            Dense(64, 1)), ADAM(3f-4)),
+        qnetwork1 = Approximator(Chain(Dense(5, 64, gelu), Dense(64,64,gelu), Dense(64,1)), ADAM(3f-4)),
+        qnetwork2 = Approximator(Chain(Dense(5, 64, gelu), Dense(64,64,gelu), Dense(64,1)), ADAM(3f-4)),
+        action_sample_size = 32,
+        ϵμ = 0.1f0, 
+        ϵΣ = 1f-2,
+        ϵ = 0.1f0)
+```
+`MPOPolicy` needs an Actor that is an `Approximator`, we use a Deep Neural Network and the `Adam` Optimiser from the `Flux.jl` package. Notice that the NN is a `GaussianNetwork` made of three parts. The first is a common body with an input size equal to the length of the state of the environment (4 in this case). Then we have two "heads", one for the mean of the Gaussian policy, and one for the standard deviation. Both heads must have the same output size (the size of the action vectors, 1 in this case) with a `GaussianNetwork` and no activation at the output layers. In
+
+Then we have `qnetwork1` and 2. This implementation of MPO uses twin QNetworks with targets. Both must be `Approximator`s, but must not necessarily have the same architecture. The input size should be the size of the state + the size of the action (5). The output size must be 1. The original MPO paper uses the Retrace algorithm instead of 1-step TD to train the critics. This currently not implemented in RL.jl.
+
+`MPOPolicy` has several keyword arguments in its constructor. We omit the least important ones here (that are not specific to MPO). You can see them using `?MPOPolicy` in the REPL. 
+
+- `action_sample_size` is the number of actions sampled for each state during the E-step of the algorithm ($K$ in the second paper). 
+- `ϵ` is the maximum KL divergence between the E-step variational distribution and the current policy.
+- `ϵμ` is the maximum KL divergence between the updated policy at the M-step and the current policy, with respect to the mean of the Gaussian.
+-  `ϵΣ` is the maximum KL divergence between the updated policy at the M-step and the current policy, with respect to the standard deviation of the Gaussian. It should typically be lower than `ϵμ` to ensure it does not shrink to 0 before the mean settles around its optimum. 
+- `α_scale = 1f0` and `αΣ_scale = 100f0`, are the gradient descent learning rate for the lagrange penalty for the mean and covariance. We leave it to the default values here. 
+
+The next step is to wrap this policy into an `Agent`. An agent is a combination of a policy and a `Trajectory`. We will use the following trajectory.
+
+```julia
+trajectory = Trajectory(
+            CircularArraySARTTraces(capacity = 1000, state = Float32 => (4,),action = Float32 => (1,)), 
+            MetaSampler(
+                actor = MultiBatchSampler(BatchSampler{(:state,)}(32), 10),
+                critic = MultiBatchSampler(BatchSampler{SS′ART}(32), 1000)
+            ),
+            InsertSampleRatioController(ratio = 1/1000, threshold = 1000)
+        ) 
+```
+
+MPO needs to store `SART` Traces, i.e. State-Action-Reward-Terminal-NextState. Here we use a fixed sized buffer with a capacity of 1000 steps. Then we specify the `Sampler`. MPO needs a specific type of sampler called a `MetaSampler`. A MetaSampler contains several named samplers, here one named `:actor` and the other `critic`. As you may have guessed, one samples to update the actor and the other for the critic (the QNetworks). You must use these exact names. Each Sampler must be a `MultiBatchSampler`, that will sample multiple batch to update the networks for several iterations. Here we update the critic 1000 times but only 10 times the policy. The actor sampler must sample only `(:state,)` traces, it does not need any other trace, the critic needs the `SS′ART` traces to perform the 1-step TD update on the `qnetwork`s. Here we sample batches of 32 transitions, of course this is a hyperparameter that you can tune to your liking.
+Finally, we decide on the `InsertSampleRatioController`. We decide to start sampling to update the networks once we have inserted `threshold = 1000` transitions in the buffer (that is, when the buffer is full). You can chose another value but it does not make sense to pick one that is larger than the capacity of the buffer. Ratio defines how many steps are to be done between each sample call. In this case, we do 1000 steps to collect data before sampling and updating the networks. 
+
+To summarize, with this setup, the algorithm will perform the following:
+1. Interact 1000 times with the environment to fill the buffer.
+2. Sample 1000 batches of 32 state-action-reward-terminal-next_state.
+3. Update each qnetworks 500 times, once with each batch. 
+4. Sample 10 batches of 32 states.
+5. Update the actor 10 times.
+6. Perform 1000 new steps with the new policy and replace the old ones in the buffer.
+7. Unless the stopping criterion is true, go back to 2.
+
+We can now create the agent, and run the experiment for 50,000 steps:
+```julia
+agent = Agent(policy = policy, trajectory = trajectory)
+stop_condition = StopAfterStep(50_000, is_show_progress=true)
+hook = TotalRewardPerEpisode()
+run(agent, env, stop_condition, hook)
+```
+
+This should take a couple of minutes on a recent CPU. You can plot the result, for example with UnicodePlots:
+```julia
+using UnicodePlots
+lineplot(hook.episodes, hook.mean_rewards, xlabel="episode", ylabel="mean episode reward", title = "Cartpole Continuous Action Space")
+```
+
+### Learning on a GPU
+
+If you have a CUDA compatible GPU, you can accelerate your experiments by transfering the neural networks on the card. `MPOPolicy` comes with a method for the `gpu` function from the `Flux` package.
+
+```julia
+using CUDA
+
+policy = gpu(policy) #Recreate a new policy if you already trained it.
+agent = Agent(policy = policy, trajectory = trajectory)
+stop_condition = StopAfterStep(50_000, is_show_progress=true)
+hook = TotalRewardPerEpisode()
+run(agent, env, stop_condition, hook) #Using the GPU is slower in this case because the NN and the batch size are small.
+```
+
+## Learning a discrete Cartpole policy
+
+To use MPO with a discrete action space only requires simple changes. 
+1. Instantiate the environment with `continuous = false`
+2. Instead of using a `GaussianNetwork`, you should use the `CategoricalNetwork`. 
+3. The action is now a one-hot vector of length two, because the action_size is 2.
+
+## How to use the CovGaussianNetwork
+
+`CovGaussianNetowrk` allows the approximation of a policy with a correlation between action dimensions, unlike the `GaussianNetwork` that only models a standard deviation for each dimension independently. In practice, this only requires two changes to the above example with `GaussianNetwork`:
+1. Use a `CovGaussianNetowrk` instead of a `GaussianNetwork`.
+2. The output size of the second head ($\Sigma$) should not be the action size ($|A|$), but $\frac{|A|*(|A|+1)}{2}$. For the Cartpole environment, the remains 1 since the action is of length 1.
+
+
diff --git a/src/ReinforcementLearningCore/Project.toml b/src/ReinforcementLearningCore/Project.toml
@@ -13,6 +13,7 @@ Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
 FillArrays = "1a297f60-69ca-5386-bcde-b61e274b549b"
 Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
 Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
+LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 Parsers = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
 ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
@@ -26,18 +27,18 @@ UnicodePlots = "b8865327-cd53-5732-bb35-84acbb429228"
 [compat]
 AbstractTrees = "0.3, 0.4"
 Adapt = "3"
-Crayons = "4"
 CUDA = "3.5"
 ChainRulesCore = "1"
 CircularArrayBuffers = "0.1"
+Crayons = "4"
 Distributions = "0.25"
 FillArrays = "0.8, 0.9, 0.10, 0.11, 0.12, 0.13"
 Flux = "0.13"
 Functors = "0.1, 0.2, 0.3"
 Parsers = "2"
 ProgressMeter = "1.2"
 ReinforcementLearningBase = "0.10, 0.11"
-ReinforcementLearningTrajectories = "0.1.5"
+ReinforcementLearningTrajectories = "0.1.8"
 StatsBase = "0.32, 0.33"
 UnicodePlots = "1.3, 2, 3"
 julia = "1.6"

diff --git a/src/ReinforcementLearningCore/src/policies/learners.jl b/src/ReinforcementLearningCore/src/policies/learners.jl
@@ -18,7 +18,7 @@ Base.show(io::IO, m::MIME"text/plain", A::Approximator) = show(io, m, convert(An
 
 @functor Approximator (model,)
 
-(A::Approximator)(args...) = A.model(args...)
+(A::Approximator)(args...; kwargs...) = A.model(args...; kwargs...)
 
 RLBase.optimise!(A::Approximator, gs) =
     Flux.Optimise.update!(A.optimiser, Flux.params(A), gs)
diff --git a/src/ReinforcementLearningCore/src/utils/distributions.jl b/src/ReinforcementLearningCore/src/utils/distributions.jl
@@ -1,14 +1,15 @@
-export normlogpdf, mvnormlogpdf
+export normlogpdf, mvnormlogpdf, diagnormlogpdf, mvnormkldivergence, diagnormkldivergence, normkldivergence
 
 using Flux: unsqueeze, stack
+using LinearAlgebra
 
 # watch https://github.com/JuliaStats/Distributions.jl/issues/1183
 const log2π = log(2.0f0π)
 
 """
      normlogpdf(μ, σ, x; ϵ = 1.0f-8)
 
-GPU automatic differentiable version for the logpdf function of normal distributions.
+GPU automatic differentiable version for the logpdf function of a univariate normal distribution.
 Adding an epsilon value to guarantee numeric stability if sigma is exactly zero
 (e.g. if relu is used in output layer).
 """
@@ -17,6 +18,24 @@ function normlogpdf(μ, σ, x; ϵ=1.0f-8)
     -(z .^ 2 .+ log2π) / 2.0f0 .- log.(σ .+ ϵ)
 end
 
+"""
+    diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)
+
+GPU automatic differentiable version for the logpdf function of normal distributions with 
+diagonal covariance. Adding an epsilon value to guarantee numeric stability if sigma is 
+exactly zero (e.g. if relu is used in output layer).
+"""
+function diagnormlogpdf(μ, σ, x; ϵ = 1.0f-8)
+    v = (σ .+ ϵ) .^2
+    -0.5f0*(log(prod(v)) .+ inv.(v)'*((x .- μ).^2) .+ length(μ)*log2π)
+end
+
+#3D tensor version
+function diagnormlogpdf(μ::AbstractArray{<:Any,3}, σ::AbstractArray{<:Any,3}, x::AbstractArray{<:Any,3}; ϵ = 1.0f-8)
+    logp = [diagnormlogpdf(μ[:, :, k], σ[:, :, k], x[:, :, k]) for k in 1:size(x, 3)]
+    return reduce((x,y)->cat(x,y,dims=3), logp) #returns a 3D vector 
+end
+
 """
     mvnormlogpdf(μ::AbstractVecOrMat, L::AbstractMatrix, x::AbstractVecOrMat)
 
@@ -47,3 +66,64 @@ function mvnormlogpdf(μ::A, LorU::A, x::A; ϵ=1.0f-8) where {A<:AbstractArray}
     logp = [mvnormlogpdf(μ[:, :, k], LorU[:, :, k], x[:, :, k]) for k in 1:size(x, 3)]
     return unsqueeze(stack(logp, 2), dims=1) #returns a 3D vector 
 end
+
+#Used for mvnormlogpdf
+"""
+    logdetLorU(LorU::AbstractMatrix)
+Log-determinant of the Positive-Semi-Definite matrix A = L*U (cholesky lower and upper triangulars), given L or U. 
+Has a sign uncertainty for non PSD matrices.
+"""
+function logdetLorU(LorU::CuArray)
+    return 2*sum(log.(diag(LorU)))
+end
+
+#Cpu fallback
+logdetLorU(LorU::AbstractMatrix) = logdet(LorU)*2
+
+"""	
+    mvnormkldivergence(μ1, L1, μ2, L2)
+
+GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors `μ1, μ2` respectively and 	
+with cholesky decomposition of covariance matrices `L1, L2`.	
+"""	
+function mvnormkldivergence(μ1, L1M, μ2, L2M)
+    L1 = LowerTriangular(L1M)	
+    L2 = LowerTriangular(L2M)	
+    U1 = UpperTriangular(permutedims(L1M))	
+    U2 = UpperTriangular(permutedims(L2M))	
+    d = size(μ1,1)	
+    logdet = logdetLorU(L2M) - logdetLorU(L1M)	
+    M1 = L1*U1	
+    L2i = inv(L2)	
+    U2i = inv(U2)	
+    M2i = U2i*L2i	
+    X = M2i*M1	
+    trace = tr(X) # trace of inv(Σ2) * Σ1	
+    sqmahal = sum(abs2.(L2i*(μ2 .- μ1))) #mahalanobis square distance	
+    return (logdet - d + trace + sqmahal)/2	
+end	
+
+"""	
+    diagnormkldivergence(μ1, σ1, μ2, σ2)	
+
+GPU differentiable implementation of the kl_divergence between two MultiVariate Gaussian distributions with mean vectors `μ1, μ2` respectively and 	
+diagonal standard deviations `σ1, σ2`. Arguments must be Vectors or single-column Matrices.	
+"""	
+function diagnormkldivergence(μ1, σ1, μ2, σ2)	
+    v1, v2 = σ1.^2, σ2.^2
+    d = size(μ1,1)	
+    logdet = sum(log.(v2)) - sum(log.(v1)) 	
+    trace = sum(v1 ./ v2)	
+    sqmahal = sum((μ2 .- μ1) .^2 ./ v2)	
+    return (logdet - d + trace + sqmahal)/2	
+end
+
+"""	
+    normkldivergence(μ1, σ1, μ2, σ2)	
+
+GPU differentiable implementation of the kl_divergence between two univariate Gaussian 
+distributions with means `μ1, μ2` and standard deviations `σ1, σ2` respectively.	
+"""	
+function normkldivergence(μ1, σ1, μ2, σ2)	
+    log(σ2) - log(σ1) + (σ1^2 + (μ1 - μ2)^2)/(2σ2^2) - typeof(μ1)(0.5)
+end