Skip to content

Conversation

@hategan
Copy link
Collaborator

@hategan hategan commented Aug 2, 2023

This adds file staging as part of Layer 0.

File staging was initially part of Layer 1, but there is reasonable reason for existing Layer 0 executors to support file staging.
For example, the local executor is useful as a test executor and should support staging in order to allow testing of software without the complexity of remote execution. Another example is PBS, which natively supports staging directives in submit scripts.

There are some other changes, such as the removal of the "early draft" status and a move to something that is closer to a change log.

@hategan hategan marked this pull request as draft August 2, 2023 00:25
@ketancmaheshwari
Copy link

Is there a standard definition of staging behavior here or is that up for discussion? For example, would staging be a default or non-default behavior? Is it a copy or link operation or is it configurable? Can it be partial such as just of stage-out operations but not stage-in? Can it be partial among the set of files subject to staging such as based on size or may be a regex / wildcard?

@hategan
Copy link
Collaborator Author

hategan commented Aug 2, 2023

Is there a standard definition of staging behavior here or is that up for discussion?

I would say that given the nature of the project, everything is up for discussion.

For example, would staging be a default or non-default behavior?

I'm not sure I fully understand what you mean, but if a job does not specify files to stage, then nothing is staged.

Is it a copy or link operation or is it configurable?

You can take a look at the changes, but yes, copy/link/move is configurable per file.

Can it be partial such as just of stage-out operations but not stage-in?

A user can specify, for a job, files to stage out but no files to stage in.

Can it be partial among the set of files subject to staging such as based on size or may be a regex / wildcard?

There is no complex filtering at this time (i.e., based on size or regex), but you can stage entire directories recursively. This is a good point though, and we should probably discuss if there is a need for such filtering that cannot be done otherwise. For example, stage-in filtering can be done by user code, since files to stage in are, by definition, accessible to the client side. Stage-out filtering may be a little more tricky, because user code doe not necessarily know what files the job will produce before the job runs. However, it is always possible to have the job run a post-processing step which filters relevant files into an output directory.

Copy link
Collaborator

@andre-merzky andre-merzky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a draft, but left some comments anyway. Thanks a bunch for working on this!

initialized at construction. This is mostly something that would
be characteristic of Python with big-list-kwargs and does not encourage
a uniform API.
@hategan hategan marked this pull request as ready for review September 22, 2023 22:56
Copy link
Collaborator

@andre-merzky andre-merzky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my concerns! I left a comment on the last open one, but please don't consider that blocking a merge!

@hategan
Copy link
Collaborator Author

hategan commented Dec 11, 2023

An issue that came up is supporting file staging for batch schedulers.

Background: Some job schedulers support some basic form of staging. For example, PBS allows staging with the -W stage[in|out] attributes for qsub (see https://docs.adaptivecomputing.com/torque/2-5-12/Content/topics/commands/qsub.htm), Slurm can do staging via burst buffers (see https://slurm.schedmd.com/burst_buffer.html), and LSF uses '-stage' (see https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=options-stage). However, none of these reflect file staging states in an unambiguous way in the job states except Slurm having a STAGE_OUT state (see https://slurm.schedmd.com/squeue.html) but no STAGE_IN.

While the specification is kept separate from implementation issues, it should still allow for the implementation to actually be possible. So the issue at hand is if and how a potential batch scheduler executor could implement file staging.

There are a few possibilities:

  1. Do not implement file staging for batch schedulers. Most supported file staging mechanisms in batch schedulers work at the job level (not the node level), so staging is most often equivalent to a copy to and from a persistent location at some point before or after the job runs. Consequently, the staging can be done from the job script or from outside the job entirely. The more interesting case of staging to and from compute nodes is not generally covered and may or may not be considered to be at the level at which PSI/J Layer 0 is supposed to work. The disadvantages of this is that there is a lack of parity on a significant issue between the batch scheduler executors and, e.g., the local executor, which may diminish the benefits of having the local executor as a stand-in testing executor (i.e., when using batch schedulers one may have to implement file staging outside of PSI/J, which would allow local/batch parity but would break batch/remote parity).
  2. Implement file staging for batch schedulers either with native mechanism or with cp/mv/ln commands in the job script but acknowledge the (near) impossibility of correctly reflecting job states. That is because, in most cases, file staging would happen while the job is in an active scheduler state. Implementing out-of-band states is hindered by the following technical difficulties:
    • A potential network based solution of having a PSI/J implementation run a server to which jobs could connect to update status is made difficult by the fact that most login nodes have multiple network interfaces, only some of which are accessible from the compute nodes. There is no automatic, reliable, and efficient way of detecting which of the IP addresses associated with these interfaces is functional for the task at hand: probing them serially can run into minute-long timeouts while probing them in parallel can be resource intensive (maybe sending datagrams to all relevant IPs could work here?)
    • File based solutions are equally problematic in that tailing multiple files (one for each job) can be resource-intensive, whereas using a single file can run into atomicity/synchronization issues with writes on shared filesystems.
  3. Implement file staging and finding a mechanism to correctly reflect job states. The downside here is that it would force a potentially difficult technical solution on the implementations.
  4. Leave the decision to the implementation.

@andre-merzky
Copy link
Collaborator

Late reply, sorry.

IMHO we should distinguish between (a) staging to the resource, (b) to the task sandbox (workdir), and (c) to node local storage. (a) is only needed for layer 1 and 2 - but in that case is not enacted by the batch system, but need to happen before submission (I think). (b) and (c) might be relevant to layer 0 - with the difference that (c) cannot possibly be enacted before job queuing as we do not know the exact set of nodes the job will run on (and/or don't have access to those nodes). If the batch system does not support (c), the only chance I see is via the batch script - with all the caveats you point out.

I would suggest to specify (a) in layer 1/2 and (b) in layer 0, but not rely on the batch system, but rather on the PSIJ implementation to enact the staging. An implementation MAY choose to support (c) as optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants