Skip to content

Bad multi-threaded parallel execution efficiency question. #1501

@GoodManWEN

Description

@GoodManWEN

🌍 Environment

  • Your operating system and version: Windows 10 20H2
  • Your python version: python 3.8.7
  • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: from exe file / no.
  • Your Rust version (rustc --version): 1.51.0 (beta)
  • Your PyO3 version: 0.13.2
  • Have you tried using latest PyO3 master (replace version = "0.x.y" with git = "https://github.com/PyO3/pyo3")?: no

💥 Description

Hi everyone, I recently migrated some of my algorithms from python written to rust, the code is around 5,000 lines in total. The bad news is, I found that despite the excellent single-threaded execution efficiency, the pyo3 extension plugin in multi-threaded parallel mode does not execute very nicely. After releasing the GIL, running on my 8-core CPU, I was expecting a 4-8x speedup, but the actual speedup was only 2x.

I'm cautiously assuming this is caused by the type conversions between python and rust (convert python lists into rust vectors and then convert back) are all being executed under the GIL, this may be related to the fact that the data type I passed in was som ralatively long two-dimensional python lists. I would like to ask if this situation can be improved (the low-efficiency may be caused by my wrong calling method), or if it is my requirements that make it can not improve at all.

Minimum Implementation

lib.rs:
It accepts a M by N matrix and returns after each item +1. The algorithm is much more complex in real production.

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;

fn multithread_logic<const N:usize>(matrix: Vec<[f64;N]>
) -> Vec<Vec<f64>> {
    let height = matrix.len() ;
    let width = N;
    let mut result = Vec::new();
    for i in 0..height{
        let mut row:Vec<f64> = Vec::new();
        for j in 0..width {
            row.push(matrix[i][j] + 1.0);
        }
        result.push(row);
    }
    result
}

#[pyfunction]
fn multithread(
    py: Python,
    matrix: Vec<[f64;32]>,
) -> Vec<Vec<f64>> {
    py.allow_threads(|| multithread_logic(matrix))
}

#[pymodule]
fn testlib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(multithread, m)?)?;
    Ok(())
}

call.py:
Uses a simple way to compare the speed of single-thread and multi-threaded execution speed. Time increases linearly in the actual execution, I'd like to know if this is due to my mismanagement of GIL.

import testlib
import time
import threading

matrix = [list(range(32)) for _ in range(2000)]

def single_thread(matrix):
    for i in range(1000):
        testlib.multithread(matrix)

st_time = time.time()
single_thread(matrix)
print(f"Single thread time: {time.time() - st_time} s")

st_time = time.time()
threads = []
for _ in range(8):
    threads.append(threading.Thread(target = single_thread , args = (matrix,)))
for _ in threads:
    _.start()
for _ in threads:
    _.join()
print(f"Multi-threaded time: {time.time() - st_time} s")

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions