Skip to content

[core] Audit and standardize error handling for Raylet IPC and Plasma store interactions #55612

@codope

Description

@codope

Background

We just refactored PutInLocalPlasmaCallback to return Status and propagated failures through TaskManager instead of crashing. During shutdown, IOError from plasma puts is tolerated and logged; otherwise, we surface a system failure to the task owner. See #55367

We should apply this pattern consistently across other Raylet IPC and plasma callsites.

Goal:

Ensure all Raylet IPC and Plasma interactions:

  • Return/propagate Status to callers instead of hard-crashing (no RAY_CHECK_OK on external IPC paths).
  • Document expected Status codes per operation (e.g., OK/ObjectExists, ObjectStoreFull, IOError, Disconnected).
  • Tolerate shutdown-specific IOErrors where appropriate (warn and continue), but treat non-shutdown failures as real errors handled at the task/owner level.
  • Avoid status-message string matching; use status codes.

Scope (primary targets to audit):

  • Core worker plasma paths in CoreWorker and CoreWorkerPlasmaStoreProvider:
    • PutInLocalPlasmaStore, Put, CreateExisting, SealExisting, Release, GetIfLocal, Contains, Delete/DeleteImpl, GetPlasmaUsage.
  • Raylet IPC paths relevant to object lifecycle:
    • PinObjectIDs RPC callback handling, UpdateObjectLocation/pubsub paths.
  • Higher-level consumers:
    • TaskManager return-object handling (done in current PR), ObjectRecoveryManager, FutureResolver, generator streaming HandleReportGeneratorItemReturns (evaluate best-effort vs strict propagation).
  • Any remaining RAY_CHECK_OK immediately after Raylet/Plasma calls that can return expected runtime errors.

Metadata

Metadata

Assignees

Labels

coreIssues that should be addressed in Ray Core

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions