CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs

I am training a version of unet with joint classification and semantic segmentation using `O1` level. The training crashes after I explicitly cast `box_coord_tensor` in `roi_pool` function.
```python
rois = roi_pool(
        input=classification_feature_map_tensor, # FLOAT16 
        boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
        output_size=roi_size,
        spatial_scale=1,
)
```
Thing is, `classification_feature_map_tensor` comes as float16 since it is handled by amp while `box_coord_tensor` comes from input batch which is float32. However, `roi_pool` requires tensors to have equal precision and throws
```
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)
```
But if I cast `box_coord_tensor` to float16, CUDA throws memory access error below.

```
  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
    scale_override=grads_have_scale/out_scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
```
Is there anything I could try to do because so far any attempts result in the error above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions