Fixing task ID replacement for MNP jobs on AWS Batch #2574

vymao · 2025-08-26T20:54:29Z

If we aren't using the Metaflow metadata service provider, Metaflow defaults to generating task IDs locally. But these task IDs are just simple integers based on how many tasks/steps there are and are sequentially incremented based on new_task_id in metaflow/plugins/metadata_providers/local.py. This presents a problem when we're doing AWS Batch MNP, since currently we try and mass replace based on the task ID in the secondary command. If this is a simple integer, this will replace many erroneous places.

For example, if the task ID is "3", there could be many instances of "3" in the secondary command that then have many replacements with "-node-$AWS_BATCH_JOB_NODE_INDEX" when really we just want to replace the actual task ID.

Here, I've identified two places - the input task ID via --task-id and the task ID in MF_PATHSPEC, that should be the only two places in the command that have the actual task ID in them that need replacing. It is better to have more specific regexes this way.

Furthermore, if there is no metadata provider, I've added a new check for control MNP jobs to finish by checking the S3 datastore instead.

savingoyal · 2025-08-26T21:43:30Z

metaflow/plugins/aws/batch/batch_client.py

-                self._task_id.replace("control-", "")
-                + "-node-$AWS_BATCH_JOB_NODE_INDEX",
-            )
+            # Fix: Only replace task ID in specific arguments, not in environment variables


thanks for the fix! @saikonen any suggestions on approaches that might allow us to not lean on string substitution (quite finicky)!

Haven't touched this implementation in a bit so maybe there is a reason that the task-id patterns are kept in execute() but at first glance this seems (in the original implementation) like quite a late phase to be doing anything with the command.

Could a better place be when the cmd is not yet joined into a string, when we can operate on options separately? f.ex. batch_cli.py#251 and step_functions.py#925 ?

I'm not 100% sure of the flow but I think that is possible if the flag --task-id is the only thing that needs to be changed. Looking at the command, I'm not sure if there is something else that requires us to modify the task ID (ex. MF_PATHSPEC). If task-id is the only thing, then I can definitely make that change.

I've updated the code. I think it is difficult to fully rely on batch_cli.py#251 because we don't have access to $AWS_BATCH_JOB_NODE_INDEX as this would only be available at runtime for a given worker MNP node, I believe. But I've added more placeholders in batch_cli.py#251 that should make the regex more reliable. I'm less familiar with the step_functions file as I'm not using that right now.

Is it possible to get approval on this soon?

@savingoyal @saikonen

the direction looks good as is so no further changes should be required. My final testing is still pending due to some unrelated infra issues at my end that I need to look into. Aiming to get this PR wrapped up by this week.

were you able to have success with this PR on your end? I was finally able to test things and it seems that it is not working as expected. The parallel jobs do launch, but the main control node never finishes, instead timing out.

I would assume that the issue lies in https://github.com/Netflix/metaflow/blob/master/metaflow/plugins/aws/batch/batch_decorator.py#L412
where we poll the status of mapper tasks via the Metaflow client. This relies on task metadata, either through a metadata service, or in the absence of one, the data stored on local disk.

multi-node parallel jobs on batch are single tenant, so they do not share disk between each other. This would mean that a mapper task recording its finished_at locally (due to --metadata local) will not be accessible by the polling control task.

Apologies @saikonen, I had pre-maturely terminated the jobs assuming it was ok. I've now run this end-to-end and confirmed that this works on my end! If we don't have a metadata provider, I've resorted to just using the S3 datastore to check MNP job statuses, which has worked smoothly for me.

Also @savingoyal

Bumping this!

…place with node-index

…HSPEC

…a-aws-batch-mnp

Fixing task ID replacement for MNP jobs on AWS Batch

2a5c211

vymao mentioned this pull request Aug 26, 2025

Is it possible to use @metaflow_ray with foreach on AWS Batch? #2564

Open

savingoyal reviewed Aug 26, 2025

View reviewed changes

savingoyal requested a review from saikonen August 26, 2025 21:43

Victor Mao (main) added 4 commits August 28, 2025 10:48

Modifying so that we can have better/earlier matches for places to re…

ce15127

…place with node-index

Fixing step_kwargs conflict with logs writing

86c3b84

Making it so that the [NODE-INDEX] substitution gets passed to MF_PAT…

f2ee285

…HSPEC

Merge remote-tracking branch 'upstream/master' into fix/local-metadat…

21a62ac

…a-aws-batch-mnp

vymao requested a review from savingoyal August 28, 2025 20:23

Merge branch 'master' into fix/local-metadata-aws-batch-mnp

26fa49c

saikonen linked an issue Sep 5, 2025 that may be closed by this pull request

Is it possible to use @metaflow_ray with foreach on AWS Batch? #2564

Open

Victor Mao (main) and others added 2 commits September 22, 2025 23:46

Updating flow for MNP

cc5b44e

Merge branch 'master' into fix/local-metadata-aws-batch-mnp

96259ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Uh oh!

vymao commented Aug 26, 2025 •

edited

Loading

Uh oh!

savingoyal Aug 26, 2025

Uh oh!

saikonen Aug 27, 2025

Uh oh!

vymao Aug 27, 2025 •

edited

Loading

Uh oh!

vymao Aug 28, 2025

Uh oh!

vymao Aug 28, 2025

Uh oh!

saikonen Sep 10, 2025

Uh oh!

saikonen Sep 11, 2025

Uh oh!

vymao Sep 23, 2025

Uh oh!

vymao Sep 23, 2025

Uh oh!

vymao Oct 2, 2025

Uh oh!

Uh oh!

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Are you sure you want to change the base?

Fixing task ID replacement for MNP jobs on AWS Batch #2574

Uh oh!

Conversation

vymao commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vymao Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vymao commented Aug 26, 2025 •

edited

Loading

vymao Aug 27, 2025 •

edited

Loading