RLSolver: GPU-based Massively Parallel Environments for Large-Scale Combinatorial Optimization (CO) Problems Using Reinforcement Learning
RLSolver Contest 2025: docs website
We aim to showcase the effectiveness of GPU-based massively parallel environments for large-scale combinatorial optimization (CO) problems using reinforcement learning (RL). RL with the help of GPU-based parallel environments can significantly improve the sampling speed and can obtain high-quality solutions within short time.
RLSolver has three layers:
- Environments: providing massively parallel environments using GPUs.
- RL agents: providing RL algorithms, e.g., REINFORCE, PPO, and DQN.
- Problems: typical CO problems, e.g., graph maxcut, TNCO and TSP.
- GPU-based Massively parallel environments of Markov chain Monte Carlo (MCMC) simulations on GPU using thousands of CUDA cores and tensor cores.
- Distribution-wise is much much faster than the instance-wise methods (such as MCPG and iSCO), since we can obtain the results directly by inference.
The bottleneck of using RL for solving large-scale CO problems especially in distribution-wise scenarios is the low sampling speed since existing solver engines (a.k.a, gym-style environments) are implemented on CPUs. Training the policy network is essentially estimating the gradients via a Markov chain Monte Carlo (MCMC) simulation, which requires a large number of samples from environments.
Existing CPU-based environments have two significant disadvantages: 1) The number of CPU cores is typically small, generally ranging from 16 to 256, resulting in a small number of parallel environments. 2) The communication link between CPUs and GPUs has limited bandwidth. The massively parallel environments can overcome these disadvantages, since we can build thounsands of environments and the communication bottleneck between CPUs and GPUs is bypassed; therefore the sampling speed is significantly improved.
The sampling speed denotes the number of collect samples per second. From the above figures, we used CPU and GPU based environments. We see that the sampling speed is improved by at least 2 orders by using GPU-based massively parallel environments compared with conventional CPUs.
We see that, if the number of CPU-based environments increases, the sampling speed does not improve clearly. However, the sampling speed improves greatly if using GPUs.To achieve the same objective value, if we use more parallel environments, the less running time.
GPU-based parallel environments can significantly improve the quality of solutions during training, since RL methods require many high-quality samples from the environments for training. Take graph maxcut as an example. We select G22 in the Gset dataset. The above figure shows the objective values vs. number of epochs with different number of GPU-based parallel environments. We see that, generally, the more parallel environments, the higher objective values, and the faster convergence.
Library | RL methods | Supported pattern | Actor-critic algs | Non-actor-critic algs | Euclidean topology | Non-Euclidean topology | Distribution-wise | Instance-wise | Problem-specific methods | Commercial solvers | Integration of OR and RL |
---|---|---|---|---|---|---|---|---|---|---|---|
Jumanji | A2C | I, II | Y | N | Y | N | Y | N | N | N | N |
RL4CO | A2C, PPO, reinforce | I | Y | Only reinforce | Y | N | Y | N | N | N | N |
RLSolver (Ours) | PPO, S2V-DQN, ECO-DQN, MCPG, dREINFORCE, iSCO, PI-GNN, etc | I, II | Y | Y | Y | Y | Y | Y | Y | Y | Y |
RLSolver crosses two domains: operations research (OR) and RL. Commercial solver-based OR methods use Gurobi (or SCIP) based on ILP or QUBO/Ising models. RLSolver supports the integration of OR and RL, e.g., RL for branching, cutting plane, or column generation (CG).
Sparse-rewards Pattern (I): The rewards are given when the goal is finished, and therefore are sparse. RL-based heuristic formulates the CO problem as Markov decision process (MDP), and then use RL algorithms to select the node and add it into a node set. There are three important functions for a gym-style environment:
- reset(): Set the selected nodes as an empty set.
- step(): Select the node with the maximum Q-value and then add it to the set.
- reward(): Calculate the objective values over all parallel environments.
Dense-rewards Pattern (II): The reward is immediately obtained, and therefore are dense. Policy-based methods first formulate the CO problem as a QUBO problem, and then learn a policy using say REINFORCE algorithm to minimize the Hamiltonian objective function. Here, the policy is a vector of probabilities of the nodes belong to the set. For example, the policy for a graph with 3 nodes is [0, 0, 0.9] means that the probabilities of the first two nodes belong to the set are 0, and the probability of the third node belong to the set is 0.9. We introduce four important functions for all parallel environments:
- reset(): Generate random initial solutions for all parallel environments.
- step(): Search for better solutions based on the current solutions. It has two sub-functions.
- sampling() is the sampling method.
- local_search() returns a better solution by flipping some bits. It can improve the quality of the current solution in a local domain.
- pick_good_xs(): Select the good solutions in all parallel environments, where each environment returns exactly one good solution with corresponding objective value.
- obj(): Calculate the objective values.
Sparse-rewards Pattern (I): In left part of of the above figure, the initial state is empty, i.e., no node is selected. Then we select node 1 with the maximum Q-value and add it to the state, thus the new state is [1], and the reward is 2.
Dense-rewards Pattern (II): In right part of the above figure, the current state is [2, 3], i.e., node 2 and 3 are selected, and the objective value is 2. The new state is [1, 3, 4], i.e., node 1, 3, and 4 are selected, and the objective value is 4.
-
All states and objective values are stored by PyTorch Tensors, so that they are mapped to CUDA cores and tensor cores of GPUs.
-
We use vmap (one GPU) or pmap (multiple GPUs) to push the map into PyTorch operations, effectively vectorizing those operations.
For example, we calculate the objective values of states over all parallel environments by using the following codes:
from torch import vmap
batched_obj = vmap(objective)
objs = batched_obj(states)
where the "objective" function is denotes the calculation of the objective value for one state.
Dimension of states and objective values: The states over all parallel environments are stored as PyTorch tensors with the environments as the first dimension and the graph nodes as the second dimension. For example, for the graph with 100 nodes, and we use 1000 environments, and the dimension of states is 1000 * 100. Correspondingly, the dimension of the objective values over all parallel environments is 1000 * 1.
RLSolver provides a diverse range of methods for large-scale combinatorial optimization (CO) problems.
Method | Category | Pattern | Source | Parallel environments |
---|---|---|---|---|
Gurobi (QUBO) | Conventional | - | code | - |
Greedy | Conventional | - | code | - |
SDP | Conventional | - | code | - |
S2V-DQN | RL | I | code | code |
ECO-DQN | RL | I | code | code |
Jumanji | RL | I | code | code |
PI-GNN | RL | I | code | code |
RUN-CSP | RL | I | code | code |
iSCO | RL | II | code | code |
seq2seq | RL | II | code | code |
MCPG | RL | II | code | code |
dREINFORCE | RL | II | code | code |
-
Mazyavkina, Nina, et al. "Reinforcement learning for combinatorial optimization: A survey." Computers & Operations Research 134 (2021): 105400.
-
Bengio, Yoshua, Andrea Lodi, and Antoine Prouvost. "Machine learning for combinatorial optimization: a methodological tour d’horizon." European Journal of Operational Research 290.2 (2021): 405-421.
-
Peng, Yun, Byron Choi, and Jianliang Xu. "Graph learning for combinatorial optimization: a survey of state-of-the-art." Data Science and Engineering 6, no. 2 (2021): 119-141.
-
Nair, Vinod, et al. "Solving mixed integer programs using neural networks." arXiv preprint arXiv:2012.13349 (2020).
-
Makoviychuk, Viktor, et al. "Isaac Gym: High performance GPU based physics simulation for robot learning." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.
Python>=3.7
PyTorch>=2.0.0
Numpy>=1.23
rlsolver
└──data
└──docs
└──envs
└──Env_ECO.py
└──Env_ISCO.py
└──Env_Jumanji.py
└──Env_L2A.py
└──Env_MCPG.py
└──Env_S2V.py
└──methods # generic RL methods, i.e., one method can solver multiple problems
└──attention_model
└──eco_s2v
└──eco
└──s2v
└──jumanji
└──rl4co
└──iSCO
└──L2A (ours)
└──PI-GNN
└──RUN-CSP
└──s2v_ppo
└──config.py
└──genetic_algorithm.py
└──greedy.py
└──gurobi.py
└──mcpg.py
└──random_walk.py
└──scip.py
└──simulated_annealing.py
└──util.py
└──util_generate.py
└──util_obj.py
└──util_read_data.py
└──util_result.py
└──methods_problem_specific # may also contain generic methods
└──TSP
└──VRPTW
└──compressive_sensing
└──maxcut
└──mimo_beamforming
└──portfolio_allocation
└──quantum_circuits
└──tensor_train
└──methods_RLOR # Generic RL+OR methods, i.e., one method can solver multiple problems. Integration of RL and OR using Gurobi.
└──RL_branching
└──RL_column_generation
└──RL_cutting
└──README.md
└──result
-
Gset is opened by Standford university, and is stored in the "data" folder of this repo. The number of nodes is from 800 to 10000.
-
Syn is the synthetic data. The number of nodes is from 100 to 1000 which in three distributions: barabasi albert (BA), erdos renyi (ER), and powerlaw (PL). Each dataset in generated graphs has 10 instances. The partial synthetic data is stored in the "data" folder of this repo.
Take gset_14.txt (an undirected graph with 800 nodes and 4694 edges) as an example:
800 4694 # #nodes is 800, and #edges is 4694.
1 7 1 # node 1 connects with node 7, weight = 1
1 10 1 # node 1 connects node 10, weight = 1
1 12 1 # node 1 connects node 12, weight = 1
Results will be written to the file "result.txt" in the folder "result". Take graph maxcut as an example. The first column is the node, and the second column is the label of the set.
1 2 # node 1 in set 2
2 1 # node 2 in set 1
3 2 # node 3 in set 2
4 1 # node 4 in set 1
5 2 # node 5 in set 2
- 1: select problem
config.py
PROBLEM = Problem.maxcut
We can select a problem such as maxcut.
- 2: select dataset
Take methods/greedy.py as an example:
directory_data = '../data/syn_BA' # the directory of datasets
prefixes = ['BA_100_'] # select the graphs with 100 nodes
- 3: run method
python methods/greedy.py # run greedy
python methods/gurobiy.py # run gurobi
python methods/simulated_annealing.py # run simulated_annealing
python methods/mcpg.py # run mcpg
python methods/iSCO/main.py # run iSCO
python methods/PI-GNN/main.py # run PI-GNN
python methods/L2A/main.py # ours
- RL-based annealing using massively parallel enironments
code 2024 NeurIPS Learning to search feasible and infeasible regions of routing problems with flexible neural k-opt
- RL-based search using massively parallel enironments
code 2024 ICLR Jumanji: a diverse suite of scalable reinforcement learning environments in jax
- RL/ML-based heuristic
code (greedy) 2017 NeurIPS Learning Combinatorial Optimization Algorithms over Graphs
code (local search) 2023, A Monte Carlo Policy Gradient Method with Local Search for Binary Optimization
- Variational annealing
code (VCA_RNN) 2023 Machine_Learning Supplementing recurrent neural networks with annealing to solve combinatorial optimization problems
- Discrete sampling
code (iSCO) 2023 ICML Revisiting Sampling for Combinatorial Optimization
- Learning to branch
code 2023 AAAI Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories
code 2021 AAAI Parameterizing Branch-and-Bound Search Trees to Learn Branching Policies
- Learning to cut
code 2020 ICML Reinforcement learning for integer programming: Learning to cut
- Classical methods
In the following experiments, we used GPU during training by default. The best-known results are labled in bold. The "obj bound" is the bound of the objective value, which is calculated by Gurobi, and any solution can not exceed this value.
- Instance-wise (Gset)
We use the instance-wise version of L2A, i.e., end to end, in the dataset Gset, which is opened by Stanford university.
Graph | Nodes | Edges | BLS | DSDP | KHLWG | RUN-CSP | PI-GNN | Gurobi (1 h) | Gap | Obj bound | iSCO | MCPG | Ours | Improvement |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pattern I | Pattern II | Pattern II | Pattern II | |||||||||||
G14 | 800 | 4694 | 3064 | 2922 | 3061 | 2943 | 3037 | 4.28% | 3167 | 3056 | 3064 | 3064 | +0% | |
G15 | 800 | 4661 | 3050 | 2938 | 3050 | 2928 | 2990 | 3022 | 4.37% | 3151 | 3046 | 3050 | 3050 | +0% |
G22 | 2000 | 19990 | 13359 | 12960 | 13359 | 13028 | 13181 | 13217 | 30.56% | 17256 | 13289 | 13359 | 13359 | +0% |
G49 | 3000 | 6000 | 6000 | 6000 | 6000 | 6000 | 5918 | 6000 | 0 | 6000 | 5940 | 6000 | 6000 | +0% |
G50 | 3000 | 6000 | 5880 | 5880 | 5880 | 5880 | 5820 | 5880 | 0.51% | 5910 | 5880 | 5880 | 5880 | +0% |
G55 | 5000 | 12468 | 10294 | 9960 | 10236 | 10116 | 10138 | 10115 | 17.09% | 11844 | 10218 | 10296 | 10298 | +0.04% |
G70 | 10000 | 9999 | 9541 | 9456 | 9458 | - | 9421 | 9579 | 1.75% | 9747 | 9442 | 9578 | 9586 | +0.47% |
- Distribution-wise (synthetic data)
We use the distribution-wise version of L2A in the synthetic datasets in 3 distributions: barabasi albert (BA), erdos renyi (ER), and powerlaw (PL). That is, after training, we test the instances by inferring the neural networks. The distribution-wise version of L2A is much much faster than the instance-wise methods (such as MCPG and iSCO), since we can obtain the results directly by inference. For graphs with
Results on the BA distribution.
Nodes | Greedy | SDP | SA | GA | Gurobi (1 h) | Obj bound | PI-GNN | iSCO | MCPG | Ours |
---|---|---|---|---|---|---|---|---|---|---|
Pattern I | Pattern II | Pattern II | Pattern II | |||||||
100 | 272.1 | 272.5 | 272.3 | 284.1 | 284.1 | 284.1 | 273.0 | 284.1 | 284.1 | 284.1 |
200 | 546.9 | 552.9 | 560.2 | 582.9 | 583.0 | 583.0 | 560.6 | 581.5 | 583.0 | 583.0 |
300 | 833.2 | 839.3 | 845.3 | 880.4 | 880.4 | 880.4 | 846.3 | 877.2 | 880.4 | 880.4 |
400 | 1112.1 | 1123.9 | 1134.6 | 1180.9 | 1180.4 | 1195.1 | 1174.6 | 1176.5 | 1179.5 | 1181.9 |
500 | 1383.8 | 1406.3 | 1432.8 | 1477.7 | 1476.0 | 1523.2 | 1436.8 | 1471.3 | 1478.3 | 1478.3 |
600 | 1666.7 | 1701.2 | 1770.3 | 1780.3 | 1777.0 | 1867.7 | 1768.5 | 1771.0 | 1778.6 | 1781.5 |
700 | 1961.9 | 1976.7 | 1984.3 | 1989.2 | 2071.2 | 2197.3 | 1989.4 | 2070.2 | 2076.6 | 2076.6 |
800 | 2237.9 | 2268.8 | 2273.6 | 2375.5 | 2358.9 | 2539.5 | 2365.9 | 2366.9 | 2372.9 | 2377.8 |
900 | 2518.1 | 2550.3 | 2554.3 | 2670.1 | 2658.3 | 2866.6 | 2539.7 | 2662.4 | 2670.6 | 2675.1 |
1000 | 2793.8 | 2834.3 | 2856.2 | 2967.9 | 2950.2 | 3217.6 | 2846.8 | 2954.0 | 2968.7 | 2972.3 |