Bad multi-threaded parallel execution efficiency question.

### 🌍 Environment

 - Your operating system and version: Windows 10 20H2
 - Your python version: python 3.8.7
 - How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: from exe file / no.
 - Your Rust version (`rustc --version`): 1.51.0 (beta)
 - Your PyO3 version: 0.13.2
 - Have you tried using latest PyO3 master (replace `version = "0.x.y"` with `git = "https://github.com/PyO3/pyo3")?`: no

### 💥 Description

Hi everyone, I recently migrated some of my algorithms from python written to rust, the code is around 5,000 lines in total. The bad news is, I found that despite the excellent single-threaded execution efficiency, the pyo3 extension plugin in multi-threaded parallel mode does not execute very nicely. After releasing the GIL, running on my 8-core CPU, I was expecting a 4-8x speedup, but the actual speedup was only 2x.

I'm cautiously assuming this is caused by the type conversions between python and rust (convert python lists into rust vectors and then convert back) are all being executed under the GIL, this may be related to the fact that the data type I passed in was som ralatively long two-dimensional python lists. I would like to ask if this situation can be improved (the low-efficiency may be caused by my wrong calling method), or if it is my requirements that make it can not improve at all.

### Minimum Implementation
`lib.rs`:
It accepts a M by N matrix and returns after each item +1. The algorithm is much more complex in real production.
```rust
use pyo3::prelude::*;
use pyo3::wrap_pyfunction;

fn multithread_logic<const N:usize>(matrix: Vec<[f64;N]>
) -> Vec<Vec<f64>> {
    let height = matrix.len() ;
    let width = N;
    let mut result = Vec::new();
    for i in 0..height{
        let mut row:Vec<f64> = Vec::new();
        for j in 0..width {
            row.push(matrix[i][j] + 1.0);
        }
        result.push(row);
    }
    result
}

#[pyfunction]
fn multithread(
    py: Python,
    matrix: Vec<[f64;32]>,
) -> Vec<Vec<f64>> {
    py.allow_threads(|| multithread_logic(matrix))
}

#[pymodule]
fn testlib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(multithread, m)?)?;
    Ok(())
}
```
`call.py`:
Uses a simple way to compare the speed of single-thread and multi-threaded execution speed. Time increases linearly in the actual execution, I'd like to know if this is due to my mismanagement of GIL.
```python
import testlib
import time
import threading

matrix = [list(range(32)) for _ in range(2000)]

def single_thread(matrix):
    for i in range(1000):
        testlib.multithread(matrix)

st_time = time.time()
single_thread(matrix)
print(f"Single thread time: {time.time() - st_time} s")

st_time = time.time()
threads = []
for _ in range(8):
    threads.append(threading.Thread(target = single_thread , args = (matrix,)))
for _ in threads:
    _.start()
for _ in threads:
    _.join()
print(f"Multi-threaded time: {time.time() - st_time} s")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad multi-threaded parallel execution efficiency question. #1501

🌍 Environment

💥 Description

Minimum Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad multi-threaded parallel execution efficiency question. #1501

Description

🌍 Environment

💥 Description

Minimum Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions