research area: parallel zip creation

# Problem

I was creating a pex file to resolve dependencies while playing around with fawkes (https://github.com/shawn-shan/fawkes), and like most modern ML projects, it contains many large binary dependencies. This meant that while resolution (with the 2020 pip resolver) was relatively fast, the creation of the zip itself took about a minute after that (without any progress indicator), which led me to prefer a venv when iterating on the fawkes source code.

## Use Case Discussion
I understand pex's supported use cases revolve much more around robust reproducible deployment scenarios, where taking a single minute to zip up a massive bundle of dependencies is more than acceptable, and where making use of the battle-tested stdlib `zipfile.ZipFile` is extremely important to ensure pex files can be executed as well as inspected on all platforms and by all applications. However, for use cases like the one described above, where the pex is going to be created strictly for the local platform, I think it would be really convenient to avoid having to set up a stateful venv.

## Alternatives
I could just create a pex file for the dependencies, and use that to launch the python process that runs from source code, and indeed that is what we decided on to implement pantsbuild/pants#8793, which worked perfectly for the Twitter ML infra team's jupyter notebooks. But (assuming this is actually possible) I would still personally find a feature that zips up pex files much faster to be useful for a lot of "I really just wanna hack something together" scenarios where I don't necessarily want to have to set up a two-phase build process like that.

# Implementation Strategy
After going through the proposal of pypa/pip#8448 to hack the zip format to minimize wheel downloads, which was implemented much more thoughtfully in pypa/pip#8467 as [`LazyZipOverHTTP`](https://github.com/pypa/pip/blob/5190ef71cc77a6a6b8f40e4b1d78b4e6c35fe231/src/pip/_internal/network/lazy_wheel.py#L40-L47), and then realizing I had overpromised the potential speedup at https://github.com/pypa/pip/issues/7049#issuecomment-1268726609, I am wary of just assuming that hacking around with the zip format will necessarily improve performance over the battle-tested synchronous stdlib implementation.

However, it seems plausible that the process of compressing individual entries could be parallelized. `pigz` describes their methodology at https://github.com/madler/pigz/blob/cb8a432c91a1dbaee896cd1ad90be62e5d82d452/pigz.c#L279-L340, and there is a codebase named [`fastzip`](https://github.com/fastzip/fastzip) that does this in python using threads, with a great discussion of [performance bottlenecks](https://github.com/fastzip/fastzip#performance-bottlenecks) and a few notes on [compatibility issues](https://github.com/fastzip/fastzip#future-plans-from-drawbacks). The Apache commons library appears to have implemented this too with [`ParallelScatterZipCreator`](https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/archivers/zip/ParallelScatterZipCreator.html).

## End-User Interface
Due to the likely compatibility issues (both with executing the parallel method at all, as well as consuming the resulting zip file), it seems best to put this behind a flag, and probably to explicitly call it experimental (I like the way pip does `--use-feature=fast-deps`, for example), and possibly even to print out a warning to stderr when executing or processing any pex files created this way. To enable the warning message, or in case we want to do any other special-case processing of pex files created this way, we could put a key in `PEX-INFO`'s `.build_properties`, or perhaps add an empty sentinel file named `.parallel-zip` to the output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

research area: parallel zip creation #2158

Problem

Use Case Discussion

Alternatives

Implementation Strategy

End-User Interface

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

research area: parallel zip creation #2158

Description

Problem

Use Case Discussion

Alternatives

Implementation Strategy

End-User Interface

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions