Skip to content

Problems dealing with snapshot create requests timing out #134

@ShyamsundarR

Description

@ShyamsundarR

When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).

When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the SnapID.

The end result of this is that the snapshot is leaked on the storage provider.

The question/issue hence is as follows,

Should the snapshot be retried on timeouts from the CreateSnapshot call?

Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].

So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a SnapID, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?

[1] CSI spec ready_to_use section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions