-
Notifications
You must be signed in to change notification settings - Fork 80
[RFC] Scalable vectors in TIR #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging some people who have been involved in related discussions before: @tqchen @kparzysz-quic @masahi |
|
Some quick comments
One possible way to think about SVE is perhaps drawing inspiration from CUDA programming, where each of the thread corresponds to one element in the vector lane, and ways to distinguish between normal register(that is shared acroess threads), and vector register(thread local storage per thread). Having one special sve vector dtype is a fine compromise in the vector case, since we only needs to tell difference between normal scalar reg and vector reg |
|
Thanks for your comments @tqchen, much appreciated! I want to ask some clarifications and expand on some of the points you made, based on my understanding. TL;DR:
Here's a small LLVM example with the scalable vectors that adds two vectors (without the cleanup loop): That is similar to the LLVM we need to lower to.
I'll assume that you meant the intrinsics like the ones defined in https://github.com/apache/tvm/blob/main/include/tvm/tir/builtin.h - I could see
I'll assume there that you are referring to whether it's better to use
Do you mean lowering loops into something like out of which we can create the SVE vectors in the codegen? It is something we can think about, however, it is not clear to me why we would want to treat vectorizing for SVE differently to Neon. The decision to vectorize would still need to be made in the scheduling and during the TIR passes we would have an awkward situation where some vector operations are represented as ramps and others as hypothetical vectors that only come into existence during codegen. We'd miss out on the optimisations and simplifications in the lowering pipeline. Can you bring an example of the more complex operation you are referring to?
I am not familar with CUDA programming - can you point me to a relevant reference? |
|
it might be useful also bring some discussions to forums. here is a quick related sketch of GPU related models for y in range(64):
for x in range(64):
C[y, x] = A[y, x] * (B[y] + 1)Say we are interested in the original program. In a normal GPU programming terminology, we will map the compute of x to "threads", there
S0: GPU stylefor y in range(64):
for x in range(64 // n):
for tid in T.scalable_vectorized_as_threads(n):
a0: local = A[y, tid + n * x]
b0: shared = B[y]
b1: shared = b0 + 1
c0: local = a0 * b0
C[y, tid + n * 4 * i] = c0The above code is a rough sketch of what it might looks like. Now, it might also be possible to produce a similar more "vector-view" version using the following rule:
S1: Vector style# note vscale = n
for y in range(64):
for x in range(64 // n):
with T.sve_scope(n) as tid:
a0: vector<vscale> = A[y, tid + n * x]
b0: scalar = B[y]
b1: vector<vscale> = b0 + 1
c0: scalar = a0 * b0
C[y, tid + n * 4 * i] = c0They are not that different. But one thing is true: we do need to be able to identify the vector dtype differently from the scalar dtype(or in the case of GPU programming local from shared). Being able to mark a dtype as ScalableVectorMark seems to serve that purpose. |
|
BTW, after writing it down, we can find that perhaps it is not necessary (for S1) to explicitly introduce a special vscale. Another approach is that we can mark an SVE scope, and use a normal tvm variable # note vscale = n
n = T.let(call(tvm.builtin.vscale(), ()))
for y in range(64):
for x in range(64 // n):
with T.sve_scope(n) as tid:
a0: vector<vscale> = A[y, tid + n * x]
b0: scalar = B[y]
b1: vector<vscale> = b0 + 1
c0: scalar = a0 * b0
C[y, tid + n * 4 * i] = c0This circles back to our questions about how to deal with Generalizing things a bit, say we are looking into higher dimensional instructions(e.g. SME), likely we need two or more variables (instead of a single vscale). Introducing a new variable node for each can become less tractable, but the reality is that we just need to be able to know that they are variables, and be able to track them through context, so having a var with annotation somewhere likely can serve similar purposes. |
|
@tqchen Thanks for elaborating on the GPU programming model, I see the parallels between programming for variable number of threads and vectors with unknown lenghts. S1 option looks quite similar to what is described in this RFC, except using the scoping instead of marking the variable with I should mention some of the technical goals we want to achieve that I have not mentioned a lot before:
Not really a technical goal, but it would be nice to reuse as much of the current TVM infrastructure as possible, e.g. all the arith rewrite rules also apply (except the ones that use the vector length as part of the simplification) and with the addition of mapping the Speaking about reuse...
Thanks for pointing this out! I'll do some further experimentation, but that combination of
In SME we target the outer product engine by adderssing the same SVE vectors, so there is still just one Maybe few more words on SME, processor states etc... Our thinking so far has been influenced by the support of these extensions in LLVM. While for SVE all generic LLVM intrinsics are supported, there are various optimisations and it is pretty much treated just like another set of vector registers, SME is going to be targeted though AArch64 specific intrinsics only. So for SVE we'd like to continue using the optimisations at LLVM stage and deal in TVM with the things LLVM can't do, like high level loop reordering and tuning support. In SME, however, the plan is to use tensorize with microkernel style approach. The SME code would also need to execute in the streaming mode, so using the context infra there is definitley something to consider. I'll be away next well, but will look into making changes to the current proposal with the points we have agreed on so far after that. Also cc @neildhickey and his more substantial GPU experience. |
|
Thanks for bringing this up again. A few suggestions to make it more general:
For dealing with an unknown vector lengths and simultaneously allowing specific lengths per use-site we could either
|
|
Thanks for your comments @kparzysz-quic! Some clarifying questions and thoughts:
Happy to include it, but I'd like to understand better the value it would add. AFAIK the the 4 represents min_vector_length / size_of_the_data_type. If we follow that philosophy and mimic LLVM's I'm mostly looking at it from the point of SVE, so I'm interested to learn if there is a case for it for other scalable architecture extensions out there.
Agreed! This might require its own mini-RFC.
Option 2. is what we propose in this RFC. From some prototyping experience, it would let us use all the current infrastructure for vectors in TVM and the LLVM codegen pretty much "just works", with ca 10 lines to map |
|
I'm back from holiday and want to get this RFC moving again! Thanks for all the good discussion so far, I've made some changes to the RFC:
|
|
Sorry for the delay... What I'm aiming at is to be able to lower the TIR to a generic CPU, that is to an architecture that does not support SVE. The TIR will need to have some default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is important. For that, we should be able to assume that What I wrote earlier about |
Could it instead be in a target-dependent lowering pass? That is, since a lowering pass after I'd like to avoid adding more complexity to the |
Sure. My idea is to have a single SVE-aware vectorization pass in TVM, and then be able to utilize it for all targets. I'm particularly interested in predication. How the codegen is done doesn't matter much. |
Right, I see... Would we get any benefit form mapping the scalable TIR vectors to fixed length LLVM vectors for targets that don't support scalable vectors? At least for Arm's SVE implementations, all access to scalable vectors should be intentional, in this RFC proposal directed by target dependent schedules (SVE is not preferable over fixed length vectors in all cases). I think if I'm compiling code with scalable vectors to a target that doesn't support it, I'd rather it errored out since something has gone wrong somewhere. I was wondering if there is a case for schedules that would apply to all scalable architectures? My intuition would say no since the implementations are sufficiently different, but would be interesting to hear what others think.
Yes that's a good point. I'll have to think about it a bit more, but I tend to agree. Besides the case you mentioned, I can think of some additional upsides - it will help with reliably handling the scalable vectors in the TVM passes since checking if something is
I suppose this is also related to whether we want to implicitly convert to/from scalable vectors. I think it is a cool idea, maybe an optional (command line triggered) Regarding to predication... In my mind the changes to support predication are necessary for SVE, but in terms of the code changes tangential. So change |
|
I guess we could pass an argument to the vectorizer whether to generate SVE-friendly code. If this is limited to emitting additional TIR builtins, then I'm ok with that. I just want to be able to reuse as much of the vectorization code as possible between SVE and non-SVE targets. As far as predication goes, you're right---it's somewhat independent from SVE. To take full advantage of SVE we'd need to be able to vectorize loops with iteration count that is not known at compile time, which is the part I'm interested in. Are you planning to implement that in the near future, or is this a longer-term goal? |
I feel extending DLDataType to represent scalable vectors explicitly would be a more robust design than depending on interpreting -1 in a special way for the lane parameter. Is there any technical reason blocking us from extending DLDataType to have a
|
DLDataType comes from dlpack not TVM. Changing it may affect the ABI of any function accepting or returning a value of that type, and will affect the memory layout of a DLTensor (and likely more). As a consequence, code build with older TVM will not be compatible with that built with a newer TVM, plus it will have an impact on any other project using dlpack. Changing it is not impossible, but we should be careful about it. |
|
Agreeing with @kparzysz-quic, changes that update the One way to limit the scope of the change might be to introduce a distinction between the runtime |
@kparzysz-quic I'm somewhat confused about the meaning of "non-SVE targets" there - do you mean targets that don't support VLA programming at all or do you mean other scalable vector architectures like RVV? If it's the latter, then yes, ideally we'd converge to a design that works for all TVM users.
Vectorizing a loop with compile time unknown iteration count is core part of this proposal - see the code examples in the RFC. |
|
Regarding to changing the One of the main problems we have with using -1 to denote scalable vectors is that it doesn't capture all information. E.g. if we want to set the
How do you feel about extending |
|
I think assuming a single vector width(vscale) and use If we want to go beyond a single symbolic variable, having some explicit loop might be better |
|
I think there's a confusion about the difference between what we have referred to as For reference, this is how LLVM represents vectors (copied from the documentation): A concrete example of a scalable vector: or To construct these vectors we need to know the minimum vector length (SVE's 128 used in these examples) and the size of the data type of the vector elements (32 bits or 8 bits in these examples). VscaleThis would mirror LLVM's Pros
Cons
VfactorThis was proposed in the first version of this RFC. A TVM vector that would map to a hardware vector would be: In this case the constant is implicitly absorbed into Pros
Cons
*The arbitrarily long vectorsThis is the "vectors with multiple vector width" that @tqchen mentioned. It is referring to there being no restrictions to the length of the TIR vectors and subsequently LLVM vectors in TVM. I've seen things like coming out of TVM's codegen. I've always wondered if this is feature or (mostly harmless) side effect. LLVM itself deals with it by breaking these vectors down into a string of vector instruction that match the hardware length. SVE support in LLVM can also do that for SVE vectors, so in theory we could create vectors like So the question there is if we want to support creating these vectors in TVM. If we do, |
|
Thanks @ekalda for the nice work of the proposal, permit few personal points of view supporting the initiative: Pros
Personal note: I would keep going (a +1 ✌️) to align with llvm concepts regarding the From ASIC point of view, in the very CPU design, there is a clear trend that these single-shot atomic "reductors" are becoming increasingly parametrizable w.r.t to data (the veclen/lanes concept), easily trading between bandwidth needs and specific data access in their hottest possible pipeline path. There is also the "v" RISCV extension that I think is well aligned to these recent concepts (if not they were even the first introducing these) so it looks like it is becoming a defacto thing in the SIMD design trends. Update: As the last one, there would be even a interesting way, quite elegant one, aligning even the classical x86 internals with the |
|
Regarding the changes required to support scalability in the data type, I've been prototyping adding a new However, I've ran into what I believe is an issue when accessing data types at compile-time across the FFI boundary between python and c++. I wonder if there could be something I've missed here or if there are any other suggestions? Are there any rules for using |
|
FYI, |
|
Just to circle back here a bit. the main root issue is that we are using runtime::DataType, which is supposely being concrete through out the TIR node. This places restrictions on what we can normally represent. A more comprehensive update would change the PrimExpr's field to also an object, as per StructInfo in the relax. That would requires bit more thinking, which likely can get around the issues mentioned in the thread(of passing around runtime::DataType which is not an object). I think in short term making the protocol of lanes = -1 and lanes = -8(for vscale(8)) may not be a bad idea. The main reason is I cannot think of another possible use of the lanes field other than for the SVE. |
|
@cbalint13 @tqchen Thank you for your input! This thread has been dormant for a bit, but we're still on it!
Thanks for sharing this, a really nice presentation! I'm trying to think how RVV's features will align with this RFC proposal... I think LLVM can be a good source of inspiration there :) Based on my (quite basic) understanding of RVV, there are two features that need consideration: 1. Addressing several vectors at once ( 2. Predication
* ... implement both
|
Given SVE is only at compile time concept, likely we don't need DLDataType counterpart, if we remove runtime data type from the compile time repr |
|
Happy new year everyone! 🎉 Here's the SVE prototype, as promised - apache/tvm#16347. It's made by @lhutton1, @neildhickey and me. @tqchen @cbalint13 @Lunderberg @kparzysz-quic et al please have a look! |
|
A change that has not yet been included in the prototype was the predicate representation on buffer loads/stores in TVMScript programs. This was briefly referenced in the RFC. So far we have explored the following options: In python, keyword arguments within subscripts are not supported. Without a keyword argument, e.g. When this approach is used to represent a buffer store (the expression is to the left of an assignment), it creates invalid python code: "cannot assign to a function call". This is the only syntactically valid approach. However, the predicate is now associated with the buffer itself, as opposed to the buffer load/store. I'm curious to hear from folks more familiar with TVMScript if there are any other options we've not considered? |
|
if predication is involved, maybe we can explicitly do A.store(...)? where predicate can be a kwarg |
This RFC is to add support for vector length agnostic programming in TVM stack.
Also add a note about expressing scalable lanes in runtime::DataType as -1 * lanes.
Thanks @tqchen for the good suggestion, I included it into the RFC text (as an extension to I also included a note about the "-8" decision regarding to |
|
Thanks everyone for all the good discussion so far! ❤️ We've had this RFC public for over 4 months now and the prototype up for few weeks and from what I can see, there are currently no outstanding issues here - hence we'd like to proceed with merging this RFC next week. I'll then create a tracking issue and we'll upstream the contents of the prototype in logical chunks (with some more substantial testing). |
|
Thanks for working through this. One final comment, on |
|
Thanks @tqchen, good point! I updated the Future Possibilities section with some ideas for enabling the scalable vector support in the meta schedule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ekalda for the work in this RFC and all who joined the discussion to reviewed it.
Given there is some alignment and no new blocking items spotted, I'll merge this and we can tackle any outstanding items in the scope of the tracking issue items to be raised. Thanks again!
This commit extends the functionality of the SME dense and matmul schedules to support operations with fp16 inputs and an fp32 output, where `transpose_a=False` and `transpose_b=True`. For convenience, it also adds a utility called `get_vscale_factor` which created the correct multiplier for `vscale` given a data type, reflecting ideas from an early design of the [SVE](apache/tvm-rfcs#104) RFC. Change-Id: I8c00bc6baf2df6015fa41200a238781126c73589
This commit extends the functionality of the SME dense and matmul schedules to support operations with fp16 inputs and an fp32 output, where `transpose_a=False` and `transpose_b=True`. For convenience, it also adds a utility called `get_vscale_factor` which created the correct multiplier for `vscale` given a data type, reflecting ideas from an early design of the [SVE](apache/tvm-rfcs#104) RFC. Change-Id: I8c00bc6baf2df6015fa41200a238781126c73589
This commit extends the functionality of the SME dense and matmul schedules to support operations with fp16 inputs and an fp32 output, where `transpose_a=False` and `transpose_b=True`. For convenience, it also adds a utility called `get_vscale_factor` which created the correct multiplier for `vscale` given a data type, reflecting ideas from an early design of the [SVE](apache/tvm-rfcs#104) RFC.
This RFC is to add support for vector length agnostic programming in TVM stack.