Skip to content

Conversation

@yucai-intel
Copy link
Owner

This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d.
Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path.
After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled.

yucai-intel pushed a commit that referenced this pull request Mar 31, 2025
Summary:
fix another combo kernel logging error:

  File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init
    self.create_combo_kernel_nodes(num_ck_nodes=None)
  File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes
    log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes)
Message: 'ComboKernels: Generating with num_ck_nodes = %d...'
Arguments: (None,)

Test Plan:
Verified in test_combo_kernel.py

the logging error went away.

Differential Revision: D71655949

Pull Request resolved: pytorch#149772
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007
anijain2305 and others added 29 commits April 2, 2025 02:31
…`rendezvous` (pytorch#149793)

this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
https://github.com/pytorch/pytorch/blob/9d02b3993f7dae7fa3379d5190ac88291ecd4dce/torch/csrc/distributed/c10d/intra_node_comm.cu#L49

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: pytorch#149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
Summary:
My commandeer of pytorch#150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: pytorch#150370
Approved by: https://github.com/aaronenyeshi
…orch#150269)

operations

Summary:
Fix the test for memory tracking. This PR does:
(1) Add tracking before and after for all memory-related operations.
Make sure the operation do indeed captures memory both in CUDA and
torch's CUDACachAllocator Make sure the operation do indeed captures
consumed memory both in CUDA and torch's CUDACachAllocator.
(2) Keep track of memory being reserved by CUDACacheAllocator in
torch and it's relationship with global CUDA memory consumption.

Test Plan:
This PR is adding tests.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#150269
Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
That restrict the store operation to 0th thread, which should be much better, shouldn't it
(Though I don't observe it in the benchmark)

Pull Request resolved: pytorch#150457
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: pytorch#150452
…ion of types. (pytorch#150204)

Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly.

Pull Request resolved: pytorch#150204
Approved by: https://github.com/malfet
Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed.

Pull Request resolved: pytorch#150396
Approved by: https://github.com/sraikund16
Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue.

### Example

```python
import torch

def foobar(x, y):
    return x * 2, y * 3

foo_c = torch.compile(mode="reduce-overhead")(foobar)
t = torch.empty((1, 16, 128, 128), device="meta")
y = torch.rand([64], device="cuda")

eager_out = foobar(t, y)

for _ in range(3):
    compiled_out = foo_c(t, y)
```

Prior to this PR, above code leads to
```
skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta')
```

With this PR, we don't skip.

Pull Request resolved: pytorch#150478
Approved by: https://github.com/eellison
… RAIIPyObject interface (pytorch#149350)

Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes pytorch#142005.

Pull Request resolved: pytorch#149350
Approved by: https://github.com/desertfire
Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging.

Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907)
Pull Request resolved: pytorch#150188
Approved by: https://github.com/yushangdi
Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we
special-case the pattern matcher to propagate arg_kwarg_vals when
it sees triton_kernel_wrapper_functional.

The strategy is:
1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata)
2) trace out the replacement graph with vals (which have the accurate Inductor metadata)
3) Propagate the arg_kwarg_vals from the first graph to the second.
4) Use the second graph as the replacement graph.

The strategy is this because we want to extend this to handle
auto_functionalized later up in the stack.

Test Plan:
- existing tests
Pull Request resolved: pytorch#148046
Approved by: https://github.com/eellison
…des (pytorch#148063)

Inductor will force exact strides on a custom operator tagged with
needs_exact_strides. I'll make this the default in a follow-up PR.

Test Plan:
- tests
Pull Request resolved: pytorch#148063
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#148046
…48091)

Mutable custom operators get wrapped into an auto_functionalized HOP, so
we need to store the arg_kwarg_vals on the auto_functionalized HOP
itself.

When Inductor does the re-inplacing, it'll use the pattern matcher to
decompose the auto_functionalized HOP back into the original op (and
0+ other view or clone operations). The pattern matcher uses the
arg_kwarg_vals to trace the subgraph to do the decomposition, so it
ultimately sets arg_kwarg_vals on the original op's node correctly.

Test Plan:
- new test
Pull Request resolved: pytorch#148091
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#148046, pytorch#148063
…ytorch#148092)

And added a comment about it. Otherwise it might be confusing

Test Plan:
- wait for CI

Pull Request resolved: pytorch#148092
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#148046, pytorch#148063, pytorch#148091
…alse (pytorch#150486)

I am not sure if this is the right way.
Pull Request resolved: pytorch#150486
Approved by: https://github.com/zou3519
ghstack dependencies: pytorch#150082, pytorch#150450
Disables mm/bmm decompositions.
torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower.

Self-contained reproducer to demonstrate the difference (before the change, after it should be identical)
```python
import torch
import timeit

def bench_mm(f, x, y):
    from torch.utils.benchmark import Timer
    return Timer(stmt="f(x, y); torch.mps.synchronize()",
                 globals={"x": x, "y": y, "f": f},
                  language="python", timer=timeit.default_timer).blocked_autorange()

x = torch.rand(1024, 512, device='mps')
y = torch.rand(512, 1, device='mps')

mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False})
mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True})

print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c, x, y).median}")
print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}")
```

Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x)
The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M:

|                        | stories15M | stories110M |
|------------------------|------------|-------------|
| without compile         | 99.40      | 53.11       |
| compile before change   | 367.68     | 19.43       |
| compile after change    | 582.96     | 355.07      |

stories110M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 53.11
```

stories110M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 19.43
```

stories110M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 355.07
```

stories15M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 99.40
```

stories15M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 367.68
```

stories15M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 582.96
```

Pull Request resolved: pytorch#150541
Approved by: https://github.com/malfet
This reverts commit 5734909.

Reverted pytorch#150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/3ac5a499ddac701f607a9f7206f9bec8871e1cbb) ([comment](pytorch#150370 (comment)))
… `Parameter.__torch_function__` (pytorch#149482)

This fixes most of huggingface/diffusers#10795,
except for `torch.Tensor._make_subclass`, which will be fixed in a
subsequent patch.

The relevant tensor subclass from the aforementioned issue is defined
here: https://github.com/huggingface/diffusers/blob/fbf6b856cc61fd22ad8635547bff4aafe05723f3/src/diffusers/quantizers/gguf/utils.py#L398-L435.

There are two things to note about the tensor subclass:
1. it calls `super().__torch_function__`, which is
   `torch._C._disabled_torch_function_impl`, so this patch updates
   `SuperVariable.call_method` to handle it (we can't do a simpler
   polyfill due to some bug with `var_getattr` raising
   `NotImplementedError`, which forgot to restore symbolic context).
2. it sets and reads attributes (`quant_type`), and
   defines new methods (`as_data`), so this patch adds support for those.
3. it has a `__init__`, which Dynamo needs to trace through in
   `TensorSubclassVariable.call_function`.

Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140)

Pull Request resolved: pytorch#149482
Approved by: https://github.com/jansel, https://github.com/mlazos
…nsor subclass `__new__` (pytorch#149483)

This builds off the previous patch in the stack, and fully fixes
huggingface/diffusers#10795.

Essentially, tensor subclass in the issue uses
`torch.Tensor._make_subclass`, which has a pretty simple shallow-copy
plus type change semantics, as far as Dynamo is concerned. So this patch
adds a polyfill for it.

As a result, this allows us to trace through many user-defined `__new__`
in tensor subclass (it's similar to how we trace through user-defined
`__new__` for `UserDefinedClassVariable`), so this patch also faithfully
trace through these `__new__` methods.

Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139)
Pull Request resolved: pytorch#149483
Approved by: https://github.com/zou3519, https://github.com/mlazos
ghstack dependencies: pytorch#149482
…operties (pytorch#149484)

This fixes most of the "torch.compile X tensor-subclass" issues
encountered in city96/ComfyUI-GGUF#118. The
relevant tensor subclass definition is here:
https://github.com/city96/ComfyUI-GGUF/blob/298192ed60f8ca821c6fe5f8030cae23424cada5/ops.py#L18-L65.

A few things to note about the tensor subclass:
1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`,
   `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr`
   to support that.
2. it overrides the `shape` property, so this patch updates
   `TensorWithTFOverrideVariable.var_getattr` to support property as well.
3. it has calls to `torch.Tensor.size`, which returns `torch.Size`,
   which gets reconstructed in `torch.Tensor.__torch_function__`, so
   this patch adds support for calling `torch.Size(...)` on non-constant
   inputs.

Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137)
Pull Request resolved: pytorch#149484
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: pytorch#149482, pytorch#149483
…rch#149792)

This patch effectively ignores traceable_tensor_subclasses, allowing
Dynamo to always try tracing into the `__torch_function__` of tensor
subclass. This helps us with 2 things:
1. allowing users to directly benefit from better compilation of tensor
   subclass, by just upgrading pytorch, without having to change legacy
   library code (see earlier patches in the stack for examples).
2. potentially exposing more issues in compiling tensor subclass, so we
   can get signals and improve them.

As a consequence, it exposed and fixes 2 subtle bugs:
1. In `build_torch_function_fn`, we could get
   `torch._C._disabled_torch_function_impl` because we have a
   `Parameter` subclass without `__torch_function__` override or if we
   have a tensor subclass with `__torch_dispatch__` override. We graph
   break on this for now, and plan to add support -- the logic for
   simulating `torch._C._disabled_torch_function_impl` is already in
   `SuperVariable`, we just need to reuse it.
2. Sometimes we create `SyntheticLocalSource` and need to remove all the
   guards installed on it, but we only removed the ones whose source
   _is_ the created synthetic source `s`, but forgot about chained
   source like `s.foo`, this showed up as
   `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`.

Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141)
Pull Request resolved: pytorch#149792
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: pytorch#149482, pytorch#149483, pytorch#149484
…torch#150045 (pytorch#150441)

Merges pytorch#150438 and pytorch#150045. pytorch#150045 was already landed, but did not include a change that makes it unable to land internally.

Pull Request resolved: pytorch#150441
Approved by: https://github.com/clee2000
1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864
```
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```
2.  Removes deprecated CUDA 12.4 build
Pull Request resolved: pytorch#150549
Approved by: https://github.com/clee2000
Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds

It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes

The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version.  There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them.  Unfortunately this leads to a lot of files needing to be copied to the docker build

Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is
15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs)
`jq '[.layers[].size, .config.size] | add / 1024 / 1024'`

Example https://hud.pytorch.org/pytorch/pytorch/commit/6eb3c2e2822c50d8a87b43938a9cf7ef0561ede2#39520169577-box
![image](https://github.com/user-attachments/assets/d44ef415-6e48-41ef-ac83-f19bab47560c)

TODO:
* Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available)
* Merge the cusparse installation scripts
* Merge the cuda installation scripts
* Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script

distributed/test_distributed_spawn
Pull Request resolved: pytorch#150226
Approved by: https://github.com/seemethere, https://github.com/atalman
shubhambhokare1 and others added 26 commits April 9, 2025 16:03
…orch#149901)

- Allows opset_version to determine which onnx decomposition to choose
- Adds a cleanup function to modify the registry after it is built

Pull Request resolved: pytorch#149901
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
…e profile title (pytorch#150863)

While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring.

And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter.

Pull Request resolved: pytorch#150863
Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501
…r coalesce collectives (pytorch#150881)

Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions.

This PR mostly is shuffle around the code a bit.

Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120)
Pull Request resolved: pytorch#150881
Approved by: https://github.com/wz337
tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import.

(discovered in pytorch#149646)

Pull Request resolved: pytorch#150325
Approved by: https://github.com/zou3519
This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple.
1. HopArgumentInfoGen creates an argument or an output schema with mutation information.
2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema.

is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond.

Pull Request resolved: pytorch#149688
Approved by: https://github.com/zou3519
)

If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: pytorch#150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: pytorch#150495, pytorch#148104
The util converts a list of placements in the traditional DTensor format
(e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding
is always applied left-to-right (from dim 0 to higher dims))

to a more explicitly ordered format, also replacing '_StridedShard' with
simple 'Shard' placements in the process.
(e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first
item in the tuple is the mesh_dim and the ordering of the tuples is the
sharding order.

This is useful so far as a helper for fixing local shape computation for
strided sharding in the uneven shape case, in the following PR- but may
also be useful more broadly if we can use explicit orderings to simplify
other parts of DTensor logic.

This skips implementing some combinations of _StridedSharding that are
not currently used in the wild today, but could be supported easily.

Pull Request resolved: pytorch#150493
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
…t exported programs (pytorch#150651)

Summary:
Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on.

https://fburl.com/code/9qzytl6q
 torch._functorch.config.generate_fake_kernels_from_real_mismatches

If we set this flag to True, then aoti compilation would work.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts
```

Differential Revision: D72345085

Pull Request resolved: pytorch#150651
Approved by: https://github.com/angelayi
+ a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf).

Pull Request resolved: pytorch#150813
Approved by: https://github.com/xw285cornell
Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method

Test Plan: P1780296121

Differential Revision: D72683568

Pull Request resolved: pytorch#150893
Approved by: https://github.com/davidberard98
Summary:
When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer.

```
FAST = get_fast_op_impls()
fast_div = FAST[torch.ops.aten.div.Tensor]
fast_div(fake_tensor, some_int)
```

Test Plan:
```
python test/test_fake_tensor.py -k test_fast_div
```

Differential Revision: D72667430

Pull Request resolved: pytorch#150874
Approved by: https://github.com/angelayi
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on pytorch/gloo#427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: pytorch#150801
Approved by: https://github.com/fduwjj
Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone
Pull Request resolved: pytorch#150876
Approved by: https://github.com/pianpwk
Test Plan: Sandcastle

Reviewed By: wenxin0319

Pull Request resolved: pytorch#150802
Approved by: https://github.com/Skylion007
<img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" />

PR https://hud.pytorch.org/pr/148104
which is acceptable but we have to update this to avoid  flakiness in the future .

Pull Request resolved: pytorch#150937
Approved by: https://github.com/zou3519
Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen.

Pull Request resolved: pytorch#150657
Approved by: https://github.com/malfet
…orch#145834)

## Improvements to `docstring_linter`

* Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`)
* In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on)
* Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on)
* New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json)

## Help text

```
$ python tools/linter/adapters/docstring_linter.py --help
usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init]
                           [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF]
                           [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather]
                           [files ...]

`docstring_linter` reports on long functions, methods or classes without docstrings

positional arguments:
  files                 A list of files or directories to lint

optional arguments:
  -h, --help            show this help message and exit
  -l, --lintrunner      Run for lintrunner and print LintMessages which aren't edits
  -v, --verbose         Print more debug info
  --grandfather GRANDFATHER, -g GRANDFATHER
                        Set the grandfather list
  --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE
                        Tolerance for grandfather sizes, in percent
  --lint-init, -i       Lint __init__ and class separately
  --lint-local, -o      Lint definitions inside other functions
  --lint-protected, -p  Lint functions, methods and classes that start with _
  --max-class MAX_CLASS, -c MAX_CLASS
                        Maximum number of lines for an undocumented class
  --max-def MAX_DEF, -d MAX_DEF
                        Maximum number of lines for an undocumented function
  --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING
                        Minimum number of characters for a docstring
  --no-grandfather, -n  Disable the grandfather list
  --report, -r          Print a report on all classes and defs
  --write-grandfather, -w
                        Rewrite the grandfather list
```

---

Pull Request resolved: pytorch#145834
Approved by: https://github.com/amjames, https://github.com/eellison
world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs.  skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs.

skip_if_lt_x_gpu being broken, potentially related to a similar issue: pytorch#146094

Pull Request resolved: pytorch#148578
Approved by: https://github.com/atalman
@pytorchmergebot pytorchmergebot force-pushed the yucai-intel-depthwise_conv-xpu branch from e631a0d to b347f0c Compare April 10, 2025 02:21
yucai-intel pushed a commit that referenced this pull request Jun 24, 2025
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: pytorch#154809
Approved by: https://github.com/soulitzer
yucai-intel pushed a commit that referenced this pull request Jun 24, 2025
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes pytorch#155306
Pull Request resolved: pytorch#155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: pytorch#155382
pytorchmergebot pushed a commit that referenced this pull request Oct 22, 2025
)

Summary:
This diff fixes two things which come up when testing a tgif-published pt2 model remote net:
1) Updates isSameDevice to handle meta device to avoid this error:
```
what():  Unsupported device typemeta and meta
Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20
```

2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now.

Test Plan:
Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error:
```
Unsupported device typemeta and meta
```
Then after change #1 and before change #2 we get:
```
what():  Mismatched device for merge.user_tower.linear.weight: meta vs cpu
Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374
```
After change run is successful
Command:
```
MODEL_ENTITY_ID=921242082
SNAPSHOT_ID=1269
module_name=merge
SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0"  --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt
```

Rollback Plan:

Differential Revision: D80713052

Pull Request resolved: pytorch#162842
Approved by: https://github.com/henryoier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.