-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Description
I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.
rois = roi_pool(
input=classification_feature_map_tensor, # FLOAT16
boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
output_size=roi_size,
spatial_scale=1,
)Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)
But if I cast box_coord_tensor to float16, CUDA throws memory access error below.
File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
self.unscale_python(model_grads, master_grads, scale)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
self.dynamic)
File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered
Is there anything I could try to do because so far any attempts result in the error above.
anjani-dhrangadhariya, sebastianffx, anjandash, jizongFox, e-rich and 9 more
Metadata
Metadata
Assignees
Labels
No labels