Skip to content

research area: parallel zip creation #2158

@cosmicexplorer

Description

@cosmicexplorer

Problem

I was creating a pex file to resolve dependencies while playing around with fawkes (https://github.com/shawn-shan/fawkes), and like most modern ML projects, it contains many large binary dependencies. This meant that while resolution (with the 2020 pip resolver) was relatively fast, the creation of the zip itself took about a minute after that (without any progress indicator), which led me to prefer a venv when iterating on the fawkes source code.

Use Case Discussion

I understand pex's supported use cases revolve much more around robust reproducible deployment scenarios, where taking a single minute to zip up a massive bundle of dependencies is more than acceptable, and where making use of the battle-tested stdlib zipfile.ZipFile is extremely important to ensure pex files can be executed as well as inspected on all platforms and by all applications. However, for use cases like the one described above, where the pex is going to be created strictly for the local platform, I think it would be really convenient to avoid having to set up a stateful venv.

Alternatives

I could just create a pex file for the dependencies, and use that to launch the python process that runs from source code, and indeed that is what we decided on to implement pantsbuild/pants#8793, which worked perfectly for the Twitter ML infra team's jupyter notebooks. But (assuming this is actually possible) I would still personally find a feature that zips up pex files much faster to be useful for a lot of "I really just wanna hack something together" scenarios where I don't necessarily want to have to set up a two-phase build process like that.

Implementation Strategy

After going through the proposal of pypa/pip#8448 to hack the zip format to minimize wheel downloads, which was implemented much more thoughtfully in pypa/pip#8467 as LazyZipOverHTTP, and then realizing I had overpromised the potential speedup at pypa/pip#7049 (comment), I am wary of just assuming that hacking around with the zip format will necessarily improve performance over the battle-tested synchronous stdlib implementation.

However, it seems plausible that the process of compressing individual entries could be parallelized. pigz describes their methodology at https://github.com/madler/pigz/blob/cb8a432c91a1dbaee896cd1ad90be62e5d82d452/pigz.c#L279-L340, and there is a codebase named fastzip that does this in python using threads, with a great discussion of performance bottlenecks and a few notes on compatibility issues. The Apache commons library appears to have implemented this too with ParallelScatterZipCreator.

End-User Interface

Due to the likely compatibility issues (both with executing the parallel method at all, as well as consuming the resulting zip file), it seems best to put this behind a flag, and probably to explicitly call it experimental (I like the way pip does --use-feature=fast-deps, for example), and possibly even to print out a warning to stderr when executing or processing any pex files created this way. To enable the warning message, or in case we want to do any other special-case processing of pex files created this way, we could put a key in PEX-INFO's .build_properties, or perhaps add an empty sentinel file named .parallel-zip to the output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions