RLSolver: GPU-based Massively Parallel Environments for Large-Scale Combinatorial Optimization (CO) Problems Using Reinforcement Learning

RLSolver Contest 2025: docs website

We aim to showcase the effectiveness of GPU-based massively parallel environments for large-scale combinatorial optimization (CO) problems using reinforcement learning (RL). RL with the help of GPU-based parallel environments can significantly improve the sampling speed and can obtain high-quality solutions within short time.

Overview

RLSolver has three layers:

Environments: providing massively parallel environments using GPUs.
RL agents: providing RL algorithms, e.g., REINFORCE, PPO, and DQN.
Problems: typical CO problems, e.g., graph maxcut, TNCO and TSP.

Key Technologies

GPU-based Massively parallel environments of Markov chain Monte Carlo (MCMC) simulations on GPU using thousands of CUDA cores and tensor cores.
Distribution-wise is much much faster than the instance-wise methods (such as MCPG and iSCO), since we can obtain the results directly by inference.

Why Use GPU-based Massively Parallel Environments?

The bottleneck of using RL for solving large-scale CO problems especially in distribution-wise scenarios is the low sampling speed since existing solver engines (a.k.a, gym-style environments) are implemented on CPUs. Training the policy network is essentially estimating the gradients via a Markov chain Monte Carlo (MCMC) simulation, which requires a large number of samples from environments.

Existing CPU-based environments have two significant disadvantages: 1) The number of CPU cores is typically small, generally ranging from 16 to 256, resulting in a small number of parallel environments. 2) The communication link between CPUs and GPUs has limited bandwidth. The massively parallel environments can overcome these disadvantages, since we can build thounsands of environments and the communication bottleneck between CPUs and GPUs is bypassed; therefore the sampling speed is significantly improved.

Improving the Sampling Speed

The sampling speed denotes the number of collect samples per second. From the above figures, we used CPU and GPU based environments. We see that the sampling speed is improved by at least 2 orders by using GPU-based massively parallel environments compared with conventional CPUs.

We see that, if the number of CPU-based environments increases, the sampling speed does not improve clearly. However, the sampling speed improves greatly if using GPUs.

Improving the Convergence Speed

To achieve the same objective value, if we use more parallel environments, the less running time.

Improving the Quality of Solutions

GPU-based parallel environments can significantly improve the quality of solutions during training, since RL methods require many high-quality samples from the environments for training. Take graph maxcut as an example. We select G22 in the Gset dataset. The above figure shows the objective values vs. number of epochs with different number of GPU-based parallel environments. We see that, generally, the more parallel environments, the higher objective values, and the faster convergence.

Comparison of libraries in RL for combinatorial optimization.

Library	RL methods	Supported pattern	Actor-critic algs	Non-actor-critic algs	Euclidean topology	Non-Euclidean topology	Distribution-wise	Instance-wise	Problem-specific methods	Commercial solvers	Integration of OR and RL
Jumanji	A2C	I, II	Y	N	Y	N	Y	N	N	N	N
RL4CO	A2C, PPO, reinforce	I	Y	Only reinforce	Y	N	Y	N	N	N	N
RLSolver (Ours)	PPO, S2V-DQN, ECO-DQN, MCPG, dREINFORCE, iSCO, PI-GNN, etc	I, II	Y	Y	Y	Y	Y	Y	Y	Y	Y

RLSolver crosses two domains: operations research (OR) and RL. Commercial solver-based OR methods use Gurobi (or SCIP) based on ILP or QUBO/Ising models. RLSolver supports the integration of OR and RL, e.g., RL for branching, cutting plane, or column generation (CG).

Two Patterns in RL-based methods

Sparse-rewards Pattern (I): The rewards are given when the goal is finished, and therefore are sparse. RL-based heuristic formulates the CO problem as Markov decision process (MDP), and then use RL algorithms to select the node and add it into a node set. There are three important functions for a gym-style environment:

reset(): Set the selected nodes as an empty set.
step(): Select the node with the maximum Q-value and then add it to the set.
reward(): Calculate the objective values over all parallel environments.

Dense-rewards Pattern (II): The reward is immediately obtained, and therefore are dense. Policy-based methods first formulate the CO problem as a QUBO problem, and then learn a policy using say REINFORCE algorithm to minimize the Hamiltonian objective function. Here, the policy is a vector of probabilities of the nodes belong to the set. For example, the policy for a graph with 3 nodes is [0, 0, 0.9] means that the probabilities of the first two nodes belong to the set are 0, and the probability of the third node belong to the set is 0.9. We introduce four important functions for all parallel environments:

reset(): Generate random initial solutions for all parallel environments.
step(): Search for better solutions based on the current solutions. It has two sub-functions.
- sampling() is the sampling method.
- local_search() returns a better solution by flipping some bits. It can improve the quality of the current solution in a local domain.
pick_good_xs(): Select the good solutions in all parallel environments, where each environment returns exactly one good solution with corresponding objective value.
obj(): Calculate the objective values.

Example (Graph Maxcut)

Sparse-rewards Pattern (I): In left part of of the above figure, the initial state is empty, i.e., no node is selected. Then we select node 1 with the maximum Q-value and add it to the state, thus the new state is [1], and the reward is 2.

Dense-rewards Pattern (II): In right part of the above figure, the current state is [2, 3], i.e., node 2 and 3 are selected, and the objective value is 2. The new state is [1, 3, 4], i.e., node 1, 3, and 4 are selected, and the objective value is 4.

Implementation for GPU-based Parallelism

All states and objective values are stored by PyTorch Tensors, so that they are mapped to CUDA cores and tensor cores of GPUs.
We use vmap (one GPU) or pmap (multiple GPUs) to push the map into PyTorch operations, effectively vectorizing those operations.

For example, we calculate the objective values of states over all parallel environments by using the following codes:

from torch import vmap
batched_obj = vmap(objective)
objs = batched_obj(states)

where the "objective" function is denotes the calculation of the objective value for one state.

Dimension of states and objective values: The states over all parallel environments are stored as PyTorch tensors with the environments as the first dimension and the graph nodes as the second dimension. For example, for the graph with 100 nodes, and we use 1000 environments, and the dimension of states is 1000 * 100. Correspondingly, the dimension of the objective values over all parallel environments is 1000 * 1.

Methods

RLSolver provides a diverse range of methods for large-scale combinatorial optimization (CO) problems.

Method	Category	Pattern	Source	Parallel environments
Gurobi (QUBO)	Conventional	-	code	-
Greedy	Conventional	-	code	-
SDP	Conventional	-	code	-
S2V-DQN	RL	I	code	code
ECO-DQN	RL	I	code	code
Jumanji	RL	I	code	code
PI-GNN	RL	I	code	code
RUN-CSP	RL	I	code	code
iSCO	RL	II	code	code
seq2seq	RL	II	code	code
MCPG	RL	II	code	code
dREINFORCE	RL	II	code	code

Key References

Mazyavkina, Nina, et al. "Reinforcement learning for combinatorial optimization: A survey." Computers & Operations Research 134 (2021): 105400.
Bengio, Yoshua, Andrea Lodi, and Antoine Prouvost. "Machine learning for combinatorial optimization: a methodological tour d’horizon." European Journal of Operational Research 290.2 (2021): 405-421.
Peng, Yun, Byron Choi, and Jianliang Xu. "Graph learning for combinatorial optimization: a survey of state-of-the-art." Data Science and Engineering 6, no. 2 (2021): 119-141.
Nair, Vinod, et al. "Solving mixed integer programs using neural networks." arXiv preprint arXiv:2012.13349 (2020).
Makoviychuk, Viktor, et al. "Isaac Gym: High performance GPU based physics simulation for robot learning." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.

Dependencies

Python>=3.7

PyTorch>=2.0.0

Numpy>=1.23

File Structure

rlsolver
└──data
└──docs
└──envs
    └──Env_ECO.py
    └──Env_ISCO.py
    └──Env_Jumanji.py
    └──Env_L2A.py
    └──Env_MCPG.py
    └──Env_S2V.py
└──methods # generic RL methods, i.e., one method can solver multiple problems
    └──attention_model
    └──eco_s2v
          └──eco
          └──s2v
          └──jumanji
          └──rl4co
    └──iSCO
    └──L2A (ours)
    └──PI-GNN
    └──RUN-CSP
    └──s2v_ppo
    └──config.py
    └──genetic_algorithm.py
    └──greedy.py
    └──gurobi.py
    └──mcpg.py
    └──random_walk.py
    └──scip.py
    └──simulated_annealing.py
    └──util.py
    └──util_generate.py
    └──util_obj.py
    └──util_read_data.py
    └──util_result.py
└──methods_problem_specific # may also contain generic methods
    └──TSP
    └──VRPTW
    └──compressive_sensing
    └──maxcut
    └──mimo_beamforming
    └──portfolio_allocation
    └──quantum_circuits
    └──tensor_train
└──methods_RLOR # Generic RL+OR methods, i.e., one method can solver multiple problems. Integration of RL and OR using Gurobi. 
    └──RL_branching 
    └──RL_column_generation
    └──RL_cutting
└──README.md
└──result

Datasets

Gset is opened by Standford university, and is stored in the "data" folder of this repo. The number of nodes is from 800 to 10000.
Syn is the synthetic data. The number of nodes is from 100 to 1000 which in three distributions: barabasi albert (BA), erdos renyi (ER), and powerlaw (PL). Each dataset in generated graphs has 10 instances. The partial synthetic data is stored in the "data" folder of this repo.

Take gset_14.txt (an undirected graph with 800 nodes and 4694 edges) as an example:

800 4694 # #nodes is 800, and #edges is 4694.

1 7 1 # node 1 connects with node 7, weight = 1

1 10 1 # node 1 connects node 10, weight = 1

1 12 1 # node 1 connects node 12, weight = 1

Results

Results will be written to the file "result.txt" in the folder "result". Take graph maxcut as an example. The first column is the node, and the second column is the label of the set.

1 2 # node 1 in set 2

2 1 # node 2 in set 1

3 2 # node 3 in set 2

4 1 # node 4 in set 1

5 2 # node 5 in set 2

Run Methods

1: select problem

config.py

PROBLEM = Problem.maxcut

We can select a problem such as maxcut.

2: select dataset

Take methods/greedy.py as an example:

 directory_data = '../data/syn_BA' # the directory of datasets
 prefixes = ['BA_100_'] # select the graphs with 100 nodes

3: run method

python methods/greedy.py  # run greedy
python methods/gurobiy.py  # run gurobi
python methods/simulated_annealing.py  # run simulated_annealing
python methods/mcpg.py  # run mcpg
python methods/iSCO/main.py  # run iSCO
python methods/PI-GNN/main.py  # run PI-GNN

python methods/L2A/main.py  # ours

Commercial Solvers to Compare with

Gurobi

Existing Methods

RL-based annealing using massively parallel enironments

code 2024 NeurIPS Learning to search feasible and infeasible regions of routing problems with flexible neural k-opt

RL-based search using massively parallel enironments

code 2024 ICLR Jumanji: a diverse suite of scalable reinforcement learning environments in jax

RL/ML-based heuristic

code (greedy) 2017 NeurIPS Learning Combinatorial Optimization Algorithms over Graphs

code (local search) 2023, A Monte Carlo Policy Gradient Method with Local Search for Binary Optimization

Variational annealing

code (VCA_RNN) 2023 Machine_Learning Supplementing recurrent neural networks with annealing to solve combinatorial optimization problems

Discrete sampling

code (iSCO) 2023 ICML Revisiting Sampling for Combinatorial Optimization

Learning to branch

code 2023 AAAI Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories

code 2021 AAAI Parameterizing Branch-and-Bound Search Trees to Learn Branching Policies

Learning to cut

code 2020 ICML Reinforcement learning for integer programming: Learning to cut

Classical methods

Results for Graph Maxcut

In the following experiments, we used GPU during training by default. The best-known results are labled in bold. The "obj bound" is the bound of the objective value, which is calculated by Gurobi, and any solution can not exceed this value.

Instance-wise (Gset)

We use the instance-wise version of L2A, i.e., end to end, in the dataset Gset, which is opened by Stanford university.

Graph	Nodes	Edges	BLS	DSDP	KHLWG	RUN-CSP	PI-GNN	Gurobi (1 h)	Gap	Obj bound	iSCO	MCPG	Ours	Improvement
							Pattern I				Pattern II	Pattern II	Pattern II
G14	800	4694	3064		2922	3061	2943	3037	4.28%	3167	3056	3064	3064	+0%
G15	800	4661	3050	2938	3050	2928	2990	3022	4.37%	3151	3046	3050	3050	+0%
G22	2000	19990	13359	12960	13359	13028	13181	13217	30.56%	17256	13289	13359	13359	+0%
G49	3000	6000	6000	6000	6000	6000	5918	6000	0	6000	5940	6000	6000	+0%
G50	3000	6000	5880	5880	5880	5880	5820	5880	0.51%	5910	5880	5880	5880	+0%
G55	5000	12468	10294	9960	10236	10116	10138	10115	17.09%	11844	10218	10296	10298	+0.04%
G70	10000	9999	9541	9456	9458	-	9421	9579	1.75%	9747	9442	9578	9586	+0.47%

Distribution-wise (synthetic data)

We use the distribution-wise version of L2A in the synthetic datasets in 3 distributions: barabasi albert (BA), erdos renyi (ER), and powerlaw (PL). That is, after training, we test the instances by inferring the neural networks. The distribution-wise version of L2A is much much faster than the instance-wise methods (such as MCPG and iSCO), since we can obtain the results directly by inference. For graphs with $n ~ (100 \leq 1000)$ nodes, there are 10 datasets, and we calcualte the average of the obtained objective values.

Results on the BA distribution.

Nodes	Greedy	SDP	SA	GA	Gurobi (1 h)	Obj bound	PI-GNN	iSCO	MCPG	Ours
							Pattern I	Pattern II	Pattern II	Pattern II
100	272.1	272.5	272.3	284.1	284.1	284.1	273.0	284.1	284.1	284.1
200	546.9	552.9	560.2	582.9	583.0	583.0	560.6	581.5	583.0	583.0
300	833.2	839.3	845.3	880.4	880.4	880.4	846.3	877.2	880.4	880.4
400	1112.1	1123.9	1134.6	1180.9	1180.4	1195.1	1174.6	1176.5	1179.5	1181.9
500	1383.8	1406.3	1432.8	1477.7	1476.0	1523.2	1436.8	1471.3	1478.3	1478.3
600	1666.7	1701.2	1770.3	1780.3	1777.0	1867.7	1768.5	1771.0	1778.6	1781.5
700	1961.9	1976.7	1984.3	1989.2	2071.2	2197.3	1989.4	2070.2	2076.6	2076.6
800	2237.9	2268.8	2273.6	2375.5	2358.9	2539.5	2365.9	2366.9	2372.9	2377.8
900	2518.1	2550.3	2554.3	2670.1	2658.3	2866.6	2539.7	2662.4	2670.6	2675.1
1000	2793.8	2834.3	2856.2	2967.9	2950.2	3217.6	2846.8	2954.0	2968.7	2972.3

Name		Name	Last commit message	Last commit date
Latest commit History 890 Commits
rlsolver		rlsolver
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLSolver: GPU-based Massively Parallel Environments for Large-Scale Combinatorial Optimization (CO) Problems Using Reinforcement Learning

Overview

Key Technologies

Why Use GPU-based Massively Parallel Environments?

Improving the Sampling Speed

Improving the Convergence Speed

Improving the Quality of Solutions

Comparison of libraries in RL for combinatorial optimization.

Two Patterns in RL-based methods

Example (Graph Maxcut)

Implementation for GPU-based Parallelism

Methods

Key References

Dependencies

File Structure

Datasets

Results

Run Methods

Commercial Solvers to Compare with

Existing Methods

Results for Graph Maxcut

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

Open-Finance-Lab/RLSolver

Folders and files

Latest commit

History

Repository files navigation

RLSolver: GPU-based Massively Parallel Environments for Large-Scale Combinatorial Optimization (CO) Problems Using Reinforcement Learning

Overview

Key Technologies

Why Use GPU-based Massively Parallel Environments?

Improving the Sampling Speed

Improving the Convergence Speed

Improving the Quality of Solutions

Comparison of libraries in RL for combinatorial optimization.

Two Patterns in RL-based methods

Example (Graph Maxcut)

Implementation for GPU-based Parallelism

Methods

Key References

Dependencies

File Structure

Datasets

Results

Run Methods

Commercial Solvers to Compare with

Existing Methods

Results for Graph Maxcut

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages