Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion csrc/cpu/dnnl_kernels.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -523,7 +523,7 @@ void onednn_mm(torch::Tensor& c, // [M, OC], row-major
CPU_KERNEL_GUARD_IN(onednn_mm)
TORCH_CHECK(a.dim() == 2);
TORCH_CHECK(a.stride(-1) == 1);
TORCH_CHECK(c.is_contiguous());
TORCH_CHECK(c.stride(-1) == 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Relaxing the check from c.is_contiguous() to c.stride(-1) == 1 without providing the full tensor strides to the underlying OneDNN kernel is dangerous and will likely lead to memory corruption.

When c is not contiguous (e.g., it's a view of a larger tensor, which can be the case with torch.compile's tensor reuse), its rows are not packed together in memory. The MatMulPrimitiveHandler receives c.data_ptr() but does not appear to receive the stride for c's first dimension (unlike for tensor a, where a.stride(0) is passed via exec_args).

Without the stride information, the kernel will write output rows assuming a contiguous layout, overwriting memory that does not belong to c. This can cause silent data corruption and difficult-to-debug crashes.

To fix this correctly, you must either:

  1. Pass the strides of c to MatMulPrimitiveHandler and ensure the OneDNN primitive is configured to use them. This would likely involve adding c.stride(0) to MatMulPrimitiveHandler::ExecArgs.
  2. If modifying the handler is not feasible, you should enforce contiguity. Instead of relaxing the check, you could create a temporary contiguous tensor for the output and copy it back to c if c was not originally contiguous.

Given the potential for silent memory corruption, this is a critical issue.

MatMulPrimitiveHandler* ptr =
reinterpret_cast<MatMulPrimitiveHandler*>(handler);

Expand Down
5 changes: 5 additions & 0 deletions vllm/platforms/cpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,11 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None:
parallel_config.distributed_executor_backend = "mp"
if parallel_config.worker_cls == "auto":
parallel_config.worker_cls = "vllm.v1.worker.cpu_worker.CPUWorker"
# Disable DBO
if parallel_config.enable_dbo:
logger.warning(
"Dual-Batch Overlap is not supported on CPU, disabled.")
parallel_config.enable_dbo = False

# Note: workaround for v1 gpu_model_runner
from vllm.config import CompilationLevel
Expand Down
8 changes: 8 additions & 0 deletions vllm/v1/worker/cpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,12 +145,20 @@ def __init__(self, *args, **kwargs) -> None:
self.record = lambda: None
self.synchronize = lambda: None

class _StreamPlaceholder:

def __init__(self, *args, **kwargs) -> None:
pass

cuda_event = torch.cuda.Event
cuda_stream = torch.cuda.Stream
try:
torch.cuda.Event = _EventPlaceholder
torch.cuda.Stream = _StreamPlaceholder
yield
finally:
torch.cuda.Event = cuda_event
torch.cuda.Stream = cuda_stream


@contextmanager
Expand Down