Skip to content

CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580

@tastyminerals

Description

@tastyminerals

I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.

rois = roi_pool(
        input=classification_feature_map_tensor, # FLOAT16 
        boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
        output_size=roi_size,
        spatial_scale=1,
)

Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws

RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)

But if I cast box_coord_tensor to float16, CUDA throws memory access error below.

  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
    scale_override=grads_have_scale/out_scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Is there anything I could try to do because so far any attempts result in the error above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions