forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yucai-intel
wants to merge
332
commits into
main
Choose a base branch
from
yucai-intel-depthwise_conv-xpu
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yucai-intel
pushed a commit
that referenced
this pull request
Mar 31, 2025
Summary:
fix another combo kernel logging error:
File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init
self.create_combo_kernel_nodes(num_ck_nodes=None)
File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes
log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes)
Message: 'ComboKernels: Generating with num_ck_nodes = %d...'
Arguments: (None,)
Test Plan:
Verified in test_combo_kernel.py
the logging error went away.
Differential Revision: D71655949
Pull Request resolved: pytorch#149772
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007
…ass (pytorch#150450) Pull Request resolved: pytorch#150450 Approved by: https://github.com/zou3519 ghstack dependencies: pytorch#150082
…`rendezvous` (pytorch#149793) this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`. e.g., https://github.com/pytorch/pytorch/blob/9d02b3993f7dae7fa3379d5190ac88291ecd4dce/torch/csrc/distributed/c10d/intra_node_comm.cu#L49 this currently breaks `test_intra_node_comm_all_reduce` Pull Request resolved: pytorch#149793 Approved by: https://github.com/kwen2501, https://github.com/cyyever
Summary: My commandeer of pytorch#150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: pytorch#150370 Approved by: https://github.com/aaronenyeshi
…orch#150269) operations Summary: Fix the test for memory tracking. This PR does: (1) Add tracking before and after for all memory-related operations. Make sure the operation do indeed captures memory both in CUDA and torch's CUDACachAllocator Make sure the operation do indeed captures consumed memory both in CUDA and torch's CUDACachAllocator. (2) Keep track of memory being reserved by CUDACacheAllocator in torch and it's relationship with global CUDA memory consumption. Test Plan: This PR is adding tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#150269 Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
Implements pytorch#146445 Pull Request resolved: pytorch#150341 Approved by: https://github.com/zou3519, https://github.com/jansel
Pull Request resolved: pytorch#150440 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: pytorch#150341
That restrict the store operation to 0th thread, which should be much better, shouldn't it (Though I don't observe it in the benchmark) Pull Request resolved: pytorch#150457 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: pytorch#150452
…ion of types. (pytorch#150204) Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly. Pull Request resolved: pytorch#150204 Approved by: https://github.com/malfet
Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed. Pull Request resolved: pytorch#150396 Approved by: https://github.com/sraikund16
Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue.
### Example
```python
import torch
def foobar(x, y):
return x * 2, y * 3
foo_c = torch.compile(mode="reduce-overhead")(foobar)
t = torch.empty((1, 16, 128, 128), device="meta")
y = torch.rand([64], device="cuda")
eager_out = foobar(t, y)
for _ in range(3):
compiled_out = foo_c(t, y)
```
Prior to this PR, above code leads to
```
skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta')
```
With this PR, we don't skip.
Pull Request resolved: pytorch#150478
Approved by: https://github.com/eellison
… RAIIPyObject interface (pytorch#149350) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes pytorch#142005. Pull Request resolved: pytorch#149350 Approved by: https://github.com/desertfire
Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging. Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907) Pull Request resolved: pytorch#150188 Approved by: https://github.com/yushangdi
Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we special-case the pattern matcher to propagate arg_kwarg_vals when it sees triton_kernel_wrapper_functional. The strategy is: 1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata) 2) trace out the replacement graph with vals (which have the accurate Inductor metadata) 3) Propagate the arg_kwarg_vals from the first graph to the second. 4) Use the second graph as the replacement graph. The strategy is this because we want to extend this to handle auto_functionalized later up in the stack. Test Plan: - existing tests Pull Request resolved: pytorch#148046 Approved by: https://github.com/eellison
…des (pytorch#148063) Inductor will force exact strides on a custom operator tagged with needs_exact_strides. I'll make this the default in a follow-up PR. Test Plan: - tests Pull Request resolved: pytorch#148063 Approved by: https://github.com/eellison ghstack dependencies: pytorch#148046
…48091) Mutable custom operators get wrapped into an auto_functionalized HOP, so we need to store the arg_kwarg_vals on the auto_functionalized HOP itself. When Inductor does the re-inplacing, it'll use the pattern matcher to decompose the auto_functionalized HOP back into the original op (and 0+ other view or clone operations). The pattern matcher uses the arg_kwarg_vals to trace the subgraph to do the decomposition, so it ultimately sets arg_kwarg_vals on the original op's node correctly. Test Plan: - new test Pull Request resolved: pytorch#148091 Approved by: https://github.com/eellison ghstack dependencies: pytorch#148046, pytorch#148063
…ytorch#148092) And added a comment about it. Otherwise it might be confusing Test Plan: - wait for CI Pull Request resolved: pytorch#148092 Approved by: https://github.com/eellison ghstack dependencies: pytorch#148046, pytorch#148063, pytorch#148091
Pull Request resolved: pytorch#147592 Approved by: https://github.com/eellison
Pull Request resolved: pytorch#148210 Approved by: https://github.com/eellison
…alse (pytorch#150486) I am not sure if this is the right way. Pull Request resolved: pytorch#150486 Approved by: https://github.com/zou3519 ghstack dependencies: pytorch#150082, pytorch#150450
Fixes pytorch#150480 Pull Request resolved: pytorch#150512 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <[email protected]>
Disables mm/bmm decompositions.
torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower.
Self-contained reproducer to demonstrate the difference (before the change, after it should be identical)
```python
import torch
import timeit
def bench_mm(f, x, y):
from torch.utils.benchmark import Timer
return Timer(stmt="f(x, y); torch.mps.synchronize()",
globals={"x": x, "y": y, "f": f},
language="python", timer=timeit.default_timer).blocked_autorange()
x = torch.rand(1024, 512, device='mps')
y = torch.rand(512, 1, device='mps')
mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False})
mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True})
print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c, x, y).median}")
print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and 512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}")
```
Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x)
The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M:
| | stories15M | stories110M |
|------------------------|------------|-------------|
| without compile | 99.40 | 53.11 |
| compile before change | 367.68 | 19.43 |
| compile after change | 582.96 | 355.07 |
stories110M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 53.11
```
stories110M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 19.43
```
stories110M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 355.07
```
stories15M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 99.40
```
stories15M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 367.68
```
stories15M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 582.96
```
Pull Request resolved: pytorch#150541
Approved by: https://github.com/malfet
This reverts commit 5734909. Reverted pytorch#150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/3ac5a499ddac701f607a9f7206f9bec8871e1cbb) ([comment](pytorch#150370 (comment)))
… `Parameter.__torch_function__` (pytorch#149482) This fixes most of huggingface/diffusers#10795, except for `torch.Tensor._make_subclass`, which will be fixed in a subsequent patch. The relevant tensor subclass from the aforementioned issue is defined here: https://github.com/huggingface/diffusers/blob/fbf6b856cc61fd22ad8635547bff4aafe05723f3/src/diffusers/quantizers/gguf/utils.py#L398-L435. There are two things to note about the tensor subclass: 1. it calls `super().__torch_function__`, which is `torch._C._disabled_torch_function_impl`, so this patch updates `SuperVariable.call_method` to handle it (we can't do a simpler polyfill due to some bug with `var_getattr` raising `NotImplementedError`, which forgot to restore symbolic context). 2. it sets and reads attributes (`quant_type`), and defines new methods (`as_data`), so this patch adds support for those. 3. it has a `__init__`, which Dynamo needs to trace through in `TensorSubclassVariable.call_function`. Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140) Pull Request resolved: pytorch#149482 Approved by: https://github.com/jansel, https://github.com/mlazos
…nsor subclass `__new__` (pytorch#149483) This builds off the previous patch in the stack, and fully fixes huggingface/diffusers#10795. Essentially, tensor subclass in the issue uses `torch.Tensor._make_subclass`, which has a pretty simple shallow-copy plus type change semantics, as far as Dynamo is concerned. So this patch adds a polyfill for it. As a result, this allows us to trace through many user-defined `__new__` in tensor subclass (it's similar to how we trace through user-defined `__new__` for `UserDefinedClassVariable`), so this patch also faithfully trace through these `__new__` methods. Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139) Pull Request resolved: pytorch#149483 Approved by: https://github.com/zou3519, https://github.com/mlazos ghstack dependencies: pytorch#149482
…operties (pytorch#149484) This fixes most of the "torch.compile X tensor-subclass" issues encountered in city96/ComfyUI-GGUF#118. The relevant tensor subclass definition is here: https://github.com/city96/ComfyUI-GGUF/blob/298192ed60f8ca821c6fe5f8030cae23424cada5/ops.py#L18-L65. A few things to note about the tensor subclass: 1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`, `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support that. 2. it overrides the `shape` property, so this patch updates `TensorWithTFOverrideVariable.var_getattr` to support property as well. 3. it has calls to `torch.Tensor.size`, which returns `torch.Size`, which gets reconstructed in `torch.Tensor.__torch_function__`, so this patch adds support for calling `torch.Size(...)` on non-constant inputs. Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137) Pull Request resolved: pytorch#149484 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: pytorch#149482, pytorch#149483
…rch#149792) This patch effectively ignores traceable_tensor_subclasses, allowing Dynamo to always try tracing into the `__torch_function__` of tensor subclass. This helps us with 2 things: 1. allowing users to directly benefit from better compilation of tensor subclass, by just upgrading pytorch, without having to change legacy library code (see earlier patches in the stack for examples). 2. potentially exposing more issues in compiling tensor subclass, so we can get signals and improve them. As a consequence, it exposed and fixes 2 subtle bugs: 1. In `build_torch_function_fn`, we could get `torch._C._disabled_torch_function_impl` because we have a `Parameter` subclass without `__torch_function__` override or if we have a tensor subclass with `__torch_dispatch__` override. We graph break on this for now, and plan to add support -- the logic for simulating `torch._C._disabled_torch_function_impl` is already in `SuperVariable`, we just need to reuse it. 2. Sometimes we create `SyntheticLocalSource` and need to remove all the guards installed on it, but we only removed the ones whose source _is_ the created synthetic source `s`, but forgot about chained source like `s.foo`, this showed up as `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`. Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141) Pull Request resolved: pytorch#149792 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: pytorch#149482, pytorch#149483, pytorch#149484
…torch#150045 (pytorch#150441) Merges pytorch#150438 and pytorch#150045. pytorch#150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: pytorch#150441 Approved by: https://github.com/clee2000
1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864 ``` CMake Error at CMakeLists.txt:1 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` 2. Removes deprecated CUDA 12.4 build Pull Request resolved: pytorch#150549 Approved by: https://github.com/clee2000
Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version. There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them. Unfortunately this leads to a lot of files needing to be copied to the docker build Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is 15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs) `jq '[.layers[].size, .config.size] | add / 1024 / 1024'` Example https://hud.pytorch.org/pytorch/pytorch/commit/6eb3c2e2822c50d8a87b43938a9cf7ef0561ede2#39520169577-box  TODO: * Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available) * Merge the cusparse installation scripts * Merge the cuda installation scripts * Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script distributed/test_distributed_spawn Pull Request resolved: pytorch#150226 Approved by: https://github.com/seemethere, https://github.com/atalman
…orch#149901) - Allows opset_version to determine which onnx decomposition to choose - Adds a cleanup function to modify the registry after it is built Pull Request resolved: pytorch#149901 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
…e profile title (pytorch#150863) While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring. And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter. Pull Request resolved: pytorch#150863 Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501
…r coalesce collectives (pytorch#150881) Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions. This PR mostly is shuffle around the code a bit. Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120) Pull Request resolved: pytorch#150881 Approved by: https://github.com/wz337
tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import. (discovered in pytorch#149646) Pull Request resolved: pytorch#150325 Approved by: https://github.com/zou3519
This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple. 1. HopArgumentInfoGen creates an argument or an output schema with mutation information. 2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema. is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond. Pull Request resolved: pytorch#149688 Approved by: https://github.com/zou3519
) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: pytorch#150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: pytorch#150495, pytorch#148104
…orch#150511)" This reverts commit a4bb2f1. Reverted pytorch#150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c) ([comment](pytorch#148104 (comment)))
This reverts commit 2e7c9d3. Reverted pytorch#148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c) ([comment](pytorch#148104 (comment)))
The util converts a list of placements in the traditional DTensor format (e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding is always applied left-to-right (from dim 0 to higher dims)) to a more explicitly ordered format, also replacing '_StridedShard' with simple 'Shard' placements in the process. (e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first item in the tuple is the mesh_dim and the ordering of the tuples is the sharding order. This is useful so far as a helper for fixing local shape computation for strided sharding in the uneven shape case, in the following PR- but may also be useful more broadly if we can use explicit orderings to simplify other parts of DTensor logic. This skips implementing some combinations of _StridedSharding that are not currently used in the wild today, but could be supported easily. Pull Request resolved: pytorch#150493 Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Pull Request resolved: pytorch#150890 Approved by: https://github.com/jerryzh168
…t exported programs (pytorch#150651) Summary: Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on. https://fburl.com/code/9qzytl6q torch._functorch.config.generate_fake_kernels_from_real_mismatches If we set this flag to True, then aoti compilation would work. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts ``` Differential Revision: D72345085 Pull Request resolved: pytorch#150651 Approved by: https://github.com/angelayi
+ a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf). Pull Request resolved: pytorch#150813 Approved by: https://github.com/xw285cornell
Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method Test Plan: P1780296121 Differential Revision: D72683568 Pull Request resolved: pytorch#150893 Approved by: https://github.com/davidberard98
Summary: When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer. ``` FAST = get_fast_op_impls() fast_div = FAST[torch.ops.aten.div.Tensor] fast_div(fake_tensor, some_int) ``` Test Plan: ``` python test/test_fake_tensor.py -k test_fast_div ``` Differential Revision: D72667430 Pull Request resolved: pytorch#150874 Approved by: https://github.com/angelayi
…/export (pytorch#150884) Differential Revision: D72667175 Pull Request resolved: pytorch#150884 Approved by: https://github.com/ydwu4
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on pytorch/gloo#427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: pytorch#150801 Approved by: https://github.com/fduwjj
Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone Pull Request resolved: pytorch#150876 Approved by: https://github.com/pianpwk
Test Plan: Sandcastle Reviewed By: wenxin0319 Pull Request resolved: pytorch#150802 Approved by: https://github.com/Skylion007
<img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" /> PR https://hud.pytorch.org/pr/148104 which is acceptable but we have to update this to avoid flakiness in the future . Pull Request resolved: pytorch#150937 Approved by: https://github.com/zou3519
Fixes pytorch#144188 Pull Request resolved: pytorch#144458 Approved by: https://github.com/amjames, https://github.com/eellison
Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Pull Request resolved: pytorch#150657 Approved by: https://github.com/malfet
…orch#145834) ## Improvements to `docstring_linter` * Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`) * In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on) * Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on) * New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json) ## Help text ``` $ python tools/linter/adapters/docstring_linter.py --help usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init] [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF] [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather] [files ...] `docstring_linter` reports on long functions, methods or classes without docstrings positional arguments: files A list of files or directories to lint optional arguments: -h, --help show this help message and exit -l, --lintrunner Run for lintrunner and print LintMessages which aren't edits -v, --verbose Print more debug info --grandfather GRANDFATHER, -g GRANDFATHER Set the grandfather list --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE Tolerance for grandfather sizes, in percent --lint-init, -i Lint __init__ and class separately --lint-local, -o Lint definitions inside other functions --lint-protected, -p Lint functions, methods and classes that start with _ --max-class MAX_CLASS, -c MAX_CLASS Maximum number of lines for an undocumented class --max-def MAX_DEF, -d MAX_DEF Maximum number of lines for an undocumented function --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING Minimum number of characters for a docstring --no-grandfather, -n Disable the grandfather list --report, -r Print a report on all classes and defs --write-grandfather, -w Rewrite the grandfather list ``` --- Pull Request resolved: pytorch#145834 Approved by: https://github.com/amjames, https://github.com/eellison
…h#150231) Pull Request resolved: pytorch#150231 Approved by: https://github.com/pianpwk
Remove a workaround added in pytorch#149381. Fixes pytorch/xla#8934 Pull Request resolved: pytorch#150693 Approved by: https://github.com/albanD
world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs. skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs.
skip_if_lt_x_gpu being broken, potentially related to a similar issue: pytorch#146094
Pull Request resolved: pytorch#148578
Approved by: https://github.com/atalman
e631a0d to
b347f0c
Compare
yucai-intel
pushed a commit
that referenced
this pull request
Jun 24, 2025
Use uint64_t index types to avoid
```
torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
#0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
#1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
#2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
#3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```
Pull Request resolved: pytorch#154809
Approved by: https://github.com/soulitzer
yucai-intel
pushed a commit
that referenced
this pull request
Jun 24, 2025
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 22, 2025
) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d.
Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path.
After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled.