-
Notifications
You must be signed in to change notification settings - Fork 404
Description
When a CSI plugin is passed a CreateSnapshot request and the caller (snapshotter sidecar) times out, the snapshotter sidecar marks this as an error and does not retry the snapshot. Further, as the call only timed out and did not fail, the storage provider may have actually created the said snapshot (although delayed).
When such an snapshot is deleted, there are no requests to the CSI plugin to delete the same, which cannot be issued by the sidecar as it does not have the SnapID.
The end result of this is that the snapshot is leaked on the storage provider.
The question/issue hence is as follows,
Should the snapshot be retried on timeouts from the CreateSnapshot call?
Based on the ready_to_use parameter in the CSI spec [1] and possibilities of application freeze as the snapshot is taken, I would assume this operation cannot be done indefinitely. But, also as per the spec timeout errors, the behavior should be a retry, as implemented for volume create and delete operations in the provisioner sidecar [2].
So to fix the potential snapshot leak by the storage provider, should the snapshotter sidecar retry till it gets an error from the plugin or a success with a SnapID, but mark the snapshot as bad/unusable as it was not completed in time (to honor the application freeze times and such)?
[1] CSI spec ready_to_use section: https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter
[2] timeout handling in provisioner sidecar: https://github.com/kubernetes-csi/external-provisioner#csi-error-and-timeout-handling