Skip to content

Conversation

@ekalda
Copy link
Contributor

@ekalda ekalda commented Aug 24, 2023

This RFC is to add support for vector length agnostic programming in TVM stack.

@ekalda
Copy link
Contributor Author

ekalda commented Aug 24, 2023

Tagging some people who have been involved in related discussions before: @tqchen @kparzysz-quic @masahi

@tqchen
Copy link
Member

tqchen commented Aug 24, 2023

Some quick comments

  • I think we should use tir intrinsics(as opposed to a new node, which would add extra burdens in the IR)

  • In general, it might be useful to know the information that a value is multiple of something (e.g. 128), so having something like x * 128 might help

  • I would still love us think about tensorization support in the codegen with some form of loop annotation (without explicit vector dtypes), as they will generalize across to more complex operations.

One possible way to think about SVE is perhaps drawing inspiration from CUDA programming, where each of the thread corresponds to one element in the vector lane, and ways to distinguish between normal register(that is shared acroess threads), and vector register(thread local storage per thread).

Having one special sve vector dtype is a fine compromise in the vector case, since we only needs to tell difference between normal scalar reg and vector reg

@ekalda
Copy link
Contributor Author

ekalda commented Aug 30, 2023

Thanks for your comments @tqchen, much appreciated! I want to ask some clarifications and expand on some of the points you made, based on my understanding.

TL;DR:

  • We need to be able to express vscale dependent extents in the TIR For nodes
  • Aside of predication, SVE vectors are not much different to the fixed length vectors, especially in terms of how they are represented in LLVM. The existing TVM infrastructure lends itself quite well to the scalable vector support.

Here's a small LLVM example with the scalable vectors that adds two vectors (without the cleanup loop):

entry:
  tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(4000) %agg.result, i8 0, i64 4000, i1 false)
  %0 = tail call i64 @llvm.vscale.i64()
  %.neg = mul nuw nsw i64 %0, 1016
  %n.vec = and i64 %.neg, 1000
  %1 = tail call i64 @llvm.vscale.i64()
  %2 = shl nuw nsw i64 %1, 2
  %3 = tail call i64 @llvm.vscale.i64()
  %4 = shl nuw nsw i64 %3, 2
  %5 = tail call i64 @llvm.vscale.i64()
  %6 = shl nuw nsw i64 %5, 2
  %7 = tail call i64 @llvm.vscale.i64()
  %8 = shl nuw nsw i64 %7, 3
  br label %vector.body

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %9 = getelementptr inbounds i32, ptr %arr0, i64 %index
  %wide.load = load <vscale x 4 x i32>, ptr %9, align 4
  %10 = getelementptr inbounds i32, ptr %9, i64 %2
  %wide.load9 = load <vscale x 4 x i32>, ptr %10, align 4
  %11 = getelementptr inbounds i32, ptr %arr1, i64 %index
  %wide.load10 = load <vscale x 4 x i32>, ptr %11, align 4
  %12 = getelementptr inbounds i32, ptr %11, i64 %4
  %wide.load11 = load <vscale x 4 x i32>, ptr %12, align 4
  %13 = add nsw <vscale x 4 x i32> %wide.load10, %wide.load
  %14 = add nsw <vscale x 4 x i32> %wide.load11, %wide.load9
  %15 = getelementptr inbounds [1000 x i32], ptr %agg.result, i64 0, i64 %index
  store <vscale x 4 x i32> %13, ptr %15, align 4
  %16 = getelementptr inbounds i32, ptr %15, i64 %6
  store <vscale x 4 x i32> %14, ptr %16, align 4
  %index.next = add nuw i64 %index, %8
  %17 = icmp eq i64 %index.next, %n.vec
  br i1 %17, label %middle.block, label %vector.body

That is similar to the LLVM we need to lower to.

I think we should use tir intrinsics(as opposed to a new node, which would add extra burdens in the IR)

I'll assume that you meant the intrinsics like the ones defined in https://github.com/apache/tvm/blob/main/include/tvm/tir/builtin.h - I could see vscale or similar being defined as intrinsic since it is something that just needs to be matched to llvm.vscale in the codegen. However, from a bit of experimentation, the main problem I see there is around expressing vscale dependent arithmetic. When we map fixed shape tensors to scalable vectors, the extent of the For node will become an expression involving vscale, so we need to be able to include vscale into artihmetic expressions. It looks like the intrinsics are passed around as Call or Op nodes, which don't mix well with PrimExpr. In that sense, another node seems like much less invasive change. Let me know if I have missed something there.

In general, it might be useful to know the information that a value is multiple of something (e.g. 128), so having something like x * 128 might help

I'll assume there that you are referring to whether it's better to use vfactor or something like 4 * vscale. Happy to go with vscale, I don't have a strong preference there.

I would still love us think about tensorization support in the codegen with some form of loop annotation (without explicit vector dtypes), as they will generalize across to more complex operations.

Do you mean lowering loops into something like

@T.prim_func
def main(A: T.Buffer((50, 50), "float32"), B: T.Buffer((50, 50), "float32")):
    for i in T.VectorizeScalable(50):
        for j in range(50):
            B_1 = T.Buffer((2500,), data=B.data)
            A_1 = T.Buffer((2500,), data=A.data)
            B_1[i * 50 + j] = A_1[i * 50 + j]

out of which we can create the SVE vectors in the codegen? It is something we can think about, however, it is not clear to me why we would want to treat vectorizing for SVE differently to Neon. The decision to vectorize would still need to be made in the scheduling and during the TIR passes we would have an awkward situation where some vector operations are represented as ramps and others as hypothetical vectors that only come into existence during codegen. We'd miss out on the optimisations and simplifications in the lowering pipeline. Can you bring an example of the more complex operation you are referring to?

One possible way to think about SVE is perhaps drawing inspiration from CUDA programming,

I am not familar with CUDA programming - can you point me to a relevant reference?

@tqchen
Copy link
Member

tqchen commented Aug 30, 2023

it might be useful also bring some discussions to forums. here is a quick related sketch of GPU related models

for y in range(64):
  for x in range(64):
      C[y, x] = A[y, x] * (B[y] + 1)

Say we are interested in the original program. In a normal GPU programming terminology, we will map the compute of x to "threads", there tid is the thread index. In GPU programming there is also different memory scopes (i am using cuda terminology here):

  • local: the variable is local to each thread
  • shared: the variable is "shared" across threads, concurrent writing different values to the same shared variable is somewhat undefined.
  • warp shuffle: sometimes we might need to exchange data(e.g. take sum) across the threads, and it is done through shuffle instructions(like warp.all_reduce).

S0: GPU style

for y in range(64):
  for x in range(64 // n):
    for tid in T.scalable_vectorized_as_threads(n):
      a0: local = A[y, tid + n * x]
      b0: shared = B[y]
      b1: shared =  b0 + 1
      c0: local = a0 * b0
      C[y, tid + n * 4 * i] = c0

The above code is a rough sketch of what it might looks like. Now, it might also be possible to produce a similar more "vector-view" version using the following rule:

  • local <=> vector
  • shared <=> normal register

S1: Vector style

# note vscale = n
for y in range(64):
  for x in range(64 // n):
    with T.sve_scope(n) as tid:
      a0: vector<vscale> = A[y, tid + n * x]
      b0: scalar = B[y]
      b1: vector<vscale> =  b0 + 1
      c0: scalar = a0 * b0
      C[y, tid + n * 4 * i] = c0

They are not that different. But one thing is true: we do need to be able to identify the vector dtype differently from the scalar dtype(or in the case of GPU programming local from shared). Being able to mark a dtype as ScalableVectorMark seems to serve that purpose.

@tqchen
Copy link
Member

tqchen commented Aug 30, 2023

BTW, after writing it down, we can find that perhaps it is not necessary (for S1) to explicitly introduce a special vscale. Another approach is that we can mark an SVE scope, and use a normal tvm variable n to mark the sve extent.

# note vscale = n
n = T.let(call(tvm.builtin.vscale(), ()))

for y in range(64):
  for x in range(64 // n):
    with T.sve_scope(n) as tid:
      a0: vector<vscale> = A[y, tid + n * x]
      b0: scalar = B[y]
      b1: vector<vscale> =  b0 + 1
      c0: scalar = a0 * b0
      C[y, tid + n * 4 * i] = c0

This circles back to our questions about how to deal with vscale. My feeling is that having a special intrin marking it(and use call) would be useful once per function(or SVE scope). Then we can reuse normal arithmetic analysis built for integer variables, without worrying too much about a special vscale.

Generalizing things a bit, say we are looking into higher dimensional instructions(e.g. SME), likely we need two or more variables (instead of a single vscale). Introducing a new variable node for each can become less tractable, but the reality is that we just need to be able to know that they are variables, and be able to track them through context, so having a var with annotation somewhere likely can serve similar purposes.

@ekalda
Copy link
Contributor Author

ekalda commented Sep 1, 2023

@tqchen Thanks for elaborating on the GPU programming model, I see the parallels between programming for variable number of threads and vectors with unknown lenghts. S1 option looks quite similar to what is described in this RFC, except using the scoping instead of marking the variable with T.Vectorized. What do you see as the benefits of using the scoping?

I should mention some of the technical goals we want to achieve that I have not mentioned a lot before:

  • Ability to mix fixed lenght and scalable vectors in same PrimFuncs
  • To make scalable vectors natural part of TVM's powerful sheduling, namely the various combinations of splitting, reordering loops and vectorizing
  • There are cases where we want to use scalable vectors and cases where we don't and it depends on the details of the sepcific hardware - so getting to the point where we can use the tuner in that decision making would be great

Not really a technical goal, but it would be nice to reuse as much of the current TVM infrastructure as possible, e.g. all the arith rewrite rules also apply (except the ones that use the vector length as part of the simplification) and with the addition of mapping the vscale to llvm.vscale the LLVM codegen currently supports simple contiguous unpredicated loads, stores and binary operations pretty much out of the box.

Speaking about reuse...

n = T.let(call(tvm.builtin.vscale(), ())

Thanks for pointing this out! I'll do some further experimentation, but that combination of call and let seem to be sufficient to realize our goals. I don't want to introduce a new node if there is a decent existing alternative.

Generalizing things a bit, say we are looking into higher dimensional instructions(e.g. SME), likely we need two or more variables (instead of a single vscale).

In SME we target the outer product engine by adderssing the same SVE vectors, so there is still just one vscale in the program. (Technically there is the streaming mode, which implies a different scalable vector lenght, but that is controlled by a processor state, so the different lenghts are not exposed to the software). In general though, I think it is good to have a way to express different scalable vector lenghts in the same code, it would make the implementation more general.

Maybe few more words on SME, processor states etc... Our thinking so far has been influenced by the support of these extensions in LLVM. While for SVE all generic LLVM intrinsics are supported, there are various optimisations and it is pretty much treated just like another set of vector registers, SME is going to be targeted though AArch64 specific intrinsics only. So for SVE we'd like to continue using the optimisations at LLVM stage and deal in TVM with the things LLVM can't do, like high level loop reordering and tuning support. In SME, however, the plan is to use tensorize with microkernel style approach. The SME code would also need to execute in the streaming mode, so using the context infra there is definitley something to consider.

I'll be away next well, but will look into making changes to the current proposal with the points we have agreed on so far after that. Also cc @neildhickey and his more substantial GPU experience.

@kparzysz-quic
Copy link

Thanks for bringing this up again.

A few suggestions to make it more general:

  1. Add a parameter to tir.vscale to state the minimal assumed vector length. For AArch64 SVE it will be 128 (bits), but some other non-SVE architecture can provide a different value (via a target hook, or something like that). This way more targets can take advantage of this.
  2. The special case of lanes == -1 in Ramp does not easily extend to multiple parameters, but it could be handled in some ways...
  3. If you plan to include predication eventually, that would be something that a lot of targets could use. The LLVM intrinsics for predicated operations do not explicitly require SVE, they can be used with fixed-sized vectors as well.

For dealing with an unknown vector lengths and simultaneously allowing specific lengths per use-site we could either

  1. Require that if Ramp/Broadcast has lanes == -1, then the base/value member must be a TIR intrinsic specifying the vscale for the value. E.g. Ramp(tir.vscale(128, base), stride, -1) or Broadcast(tir.vscale(256, value), -1).
  2. Extend Ramp and Broadcast to take lanes as PrimExpr, with restrictions on what that expression can contain.

@ekalda
Copy link
Contributor Author

ekalda commented Sep 13, 2023

Thanks for your comments @kparzysz-quic! Some clarifying questions and thoughts:

Add a parameter to tir.vscale to state the minimal assumed vector length. For AArch64 SVE it will be 128 (bits), but some other non-SVE architecture can provide a different value (via a target hook, or something like that).

Happy to include it, but I'd like to understand better the value it would add. AFAIK the llvm::vscale does not have the minimum vector length associated with it, it's encoded in the "multiplier", e.g. in

%wide.load11 = load <vscale x 4 x i32>, ptr %12

the 4 represents min_vector_length / size_of_the_data_type. If we follow that philosophy and mimic LLVM's vscale in TIR, then it will be the responsibility of the author of target specific schedule to set that multiplier correctly. It would be different if we opted for something like vfactor instead of vscale (as originally proposed in the RFC) since vfactor would essentially represent the number of elements in a vector which would depend on the minimum length.

I'm mostly looking at it from the point of SVE, so I'm interested to learn if there is a case for it for other scalable architecture extensions out there.

If you plan to include predication eventually, that would be something that a lot of targets could use. The LLVM intrinsics for predicated operations do not explicitly require SVE, they can be used with fixed-sized vectors as well.

Agreed! This might require its own mini-RFC.

For dealing with an unknown vector lengths and simultaneously allowing specific lengths per use-site we could either

  1. Require that if Ramp/Broadcast has lanes == -1, then the base/value member must be a TIR intrinsic specifying the vscale for the value. E.g. Ramp(tir.vscale(128, base), stride, -1) or Broadcast(tir.vscale(256, value), -1).
  2. Extend Ramp and Broadcast to take lanes as PrimExpr, with restrictions on what that expression can contain.

Option 2. is what we propose in this RFC. From some prototyping experience, it would let us use all the current infrastructure for vectors in TVM and the LLVM codegen pretty much "just works", with ca 10 lines to map tir.vscale to llvm::vscale (that applies to simple consecutive loads and stores, it's a bit more complex for things like ramps with stride != 1). I'm not in favour of exposing -1 to user in any form, e.g. from TVMScript or just from printing TIR, it is not particularly intuitive interface. The only reason for -1 is the DLPack standard for which we need a way to express scalable vectors. Another idea to handle this would be to add a new field to DLDataType, e.g. bool is_scalable, but I'm not sure how feasible changing that standard is.

@ekalda
Copy link
Contributor Author

ekalda commented Oct 6, 2023

I'm back from holiday and want to get this RFC moving again! Thanks for all the good discussion so far, I've made some changes to the RFC:

  • Use vscale directly instead of vfactor and use TIR intrinsic to represent vscale instead of introducing new node
  • Opt for predication instead of cleanup loop

@kparzysz-quic
Copy link

Sorry for the delay... What I'm aiming at is to be able to lower the TIR to a generic CPU, that is to an architecture that does not support SVE. The TIR will need to have some default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is important. For that, we should be able to assume that vscale is 1. The vscale would simply be an indicator to the codegen (in TVM) that the code may be lowered to SVE.

What I wrote earlier about vfactor was that the value of it depended on the data type for which it was calculated. If you're sticking with vscale, then it may seem like we don't need it, but the issue is with using "x * vscale" as an idiom: if you have several occurrences of "4 * vscale" in an expression, it may end up being rearranged to something like "(4*vi + 4) * vscale", or "ceildiv(128, 4 * vscale)" may end up being "ceildiv(32, vscale)". So, instead of "x * vscale", I suggest "vscale(x)".

@Lunderberg
Copy link
Contributor

What I'm aiming at is to be able to lower the TIR to a generic CPU, that is to an architecture that does not support SVE. The TIR will need to have some default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is important.

Could it instead be in a target-dependent lowering pass? That is, since a lowering pass after BindTarget (here in driver_api.cc) would know whether the target CPU supports SVE or not, we could make a pass that either returns the IRModule unmodified for CPUs that support SVE, or converts it to non-SVE instructions otherwise.

I'd like to avoid adding more complexity to the CodeGenLLVM and CodeGenCPU steps, as it is more difficult to test than IRModule to IRModule transformations.

@kparzysz-quic
Copy link

Could it instead be in a target-dependent lowering pass?

Sure. My idea is to have a single SVE-aware vectorization pass in TVM, and then be able to utilize it for all targets. I'm particularly interested in predication. How the codegen is done doesn't matter much.

@ekalda
Copy link
Contributor Author

ekalda commented Oct 9, 2023

What I'm aiming at is to be able to lower the TIR to a generic CPU, that is to an architecture that does not support SVE. The TIR will need to have some default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is important. For that, we should be able to assume that vscale is 1. The vscale would simply be an indicator to the codegen (in TVM) that the code may be lowered to SVE.

Right, I see... Would we get any benefit form mapping the scalable TIR vectors to fixed length LLVM vectors for targets that don't support scalable vectors? At least for Arm's SVE implementations, all access to scalable vectors should be intentional, in this RFC proposal directed by target dependent schedules (SVE is not preferable over fixed length vectors in all cases). I think if I'm compiling code with scalable vectors to a target that doesn't support it, I'd rather it errored out since something has gone wrong somewhere.

I was wondering if there is a case for schedules that would apply to all scalable architectures? My intuition would say no since the implementations are sufficiently different, but would be interesting to hear what others think.

If you're sticking with vscale, then it may seem like we don't need it, but the issue is with using "x * vscale" as an idiom: if you have several occurrences of "4 * vscale" in an expression, it may end up being rearranged to something like "(4*vi + 4) * vscale", or "ceildiv(128, 4 * vscale)" may end up being "ceildiv(32, vscale)". So, instead of "x * vscale", I suggest "vscale(x)".

Yes that's a good point. I'll have to think about it a bit more, but I tend to agree. Besides the case you mentioned, I can think of some additional upsides - it will help with reliably handling the scalable vectors in the TVM passes since checking if something is vscale is easier than checking if it is an expression involving vscale. It also makes it easier to enforce that if lanes in the ramp is not integer, it is vscale and not just anything. Shouldn't create significantly more complexity for the codegen either (just need to emit an extra multiply when encountering the vscale). So I think it would give us more robust implementation.

Could it instead be in a target-dependent lowering pass? That is, since a lowering pass after BindTarget (here in driver_api.cc) would know whether the target CPU supports SVE or not, we could make a pass that either returns the IRModule unmodified for CPUs that support SVE, or converts it to non-SVE instructions otherwise.

I suppose this is also related to whether we want to implicitly convert to/from scalable vectors. I think it is a cool idea, maybe an optional (command line triggered) IRModule -> IRModule pass to turn the fixed length vectors into scalable vectors (or vice versa) that users can experiment without having to write schedules. I think this would be a future improvement, the goal of this RFC is to add the tools to the toolbox, give the TVM users access to the scalable vectors and to unblock SME (which will bring very significant performance improvements).

Regarding to predication... In my mind the changes to support predication are necessary for SVE, but in terms of the code changes tangential. So change BufferLoad and BufferStore nodes to accept a predicate and change the LoopVectorizer such that instead of scalarising the loop it can't exactly vectorize, it creates a predicated buffer operations. I haven't implemented it and I'm not much of a BufferLoad expert, so I might be missing something there, but to me it looks like predication could be used without any SVE infra.

@kparzysz-quic
Copy link

I guess we could pass an argument to the vectorizer whether to generate SVE-friendly code. If this is limited to emitting additional TIR builtins, then I'm ok with that. I just want to be able to reuse as much of the vectorization code as possible between SVE and non-SVE targets.

As far as predication goes, you're right---it's somewhat independent from SVE. To take full advantage of SVE we'd need to be able to vectorize loops with iteration count that is not known at compile time, which is the part I'm interested in. Are you planning to implement that in the near future, or is this a longer-term goal?

@neildhickey
Copy link

Another idea to handle this would be to add a new field to DLDataType, e.g. bool is_scalable, but I'm not sure how feasible changing that standard is.

I feel extending DLDataType to represent scalable vectors explicitly would be a more robust design than depending on interpreting -1 in a special way for the lane parameter. Is there any technical reason blocking us from extending DLDataType to have a is_scalable vector field, allowing us to maintain the meaning of the lanes field to represent the number of lanes?

<vscale*4*float> can then be encoded by setting the is_scalable field and setting the lane field to 8 and we do not need to introduce any special handling.

@kparzysz-quic
Copy link

Is there any technical reason blocking us from extending DLDataType to have a is_scalable vector field, allowing us to maintain the meaning of the lanes field to represent the number of lanes?

DLDataType comes from dlpack not TVM. Changing it may affect the ABI of any function accepting or returning a value of that type, and will affect the memory layout of a DLTensor (and likely more). As a consequence, code build with older TVM will not be compatible with that built with a newer TVM, plus it will have an impact on any other project using dlpack.

Changing it is not impossible, but we should be careful about it.

@Lunderberg
Copy link
Contributor

Agreeing with @kparzysz-quic, changes that update the DLDataType would need to be approached very cautiously. I usually lean toward allowing short-term breakages if they lead to better long-term code health, but updating the DLDataType would be very wide reaching even more my tastes.

One way to limit the scope of the change might be to introduce a distinction between the runtime DLDataType, and some new compile-time data-type. This would be analogous to the distinction between the runtime DLTensor and the compile-time tir::Buffer. It would still involve massive changes inside of TVM, but would preserve the runtime types, avoiding the ABI breakage of compiled functions.

@ekalda
Copy link
Contributor Author

ekalda commented Oct 11, 2023

I guess we could pass an argument to the vectorizer whether to generate SVE-friendly code. If this is limited to emitting additional TIR builtins, then I'm ok with that. I just want to be able to reuse as much of the vectorization code as possible between SVE and non-SVE targets.

@kparzysz-quic I'm somewhat confused about the meaning of "non-SVE targets" there - do you mean targets that don't support VLA programming at all or do you mean other scalable vector architectures like RVV? If it's the latter, then yes, ideally we'd converge to a design that works for all TVM users.

To take full advantage of SVE we'd need to be able to vectorize loops with iteration count that is not known at compile time, which is the part I'm interested in. Are you planning to implement that in the near future, or is this a longer-term goal?

Vectorizing a loop with compile time unknown iteration count is core part of this proposal - see the code examples in the RFC.

@ekalda
Copy link
Contributor Author

ekalda commented Oct 11, 2023

Regarding to changing the DLDataType, I can see how it could have a wide disruptive impact. Scalable vectors are here to stay though, so could be a way to future proof DLPack standard? 🤷‍♀️

One of the main problems we have with using -1 to denote scalable vectors is that it doesn't capture all information. E.g. if we want to set the lanes of the Ramp to 8 * vscale() (or vscale(8)), what's the lanes in DataType/DLDataType going to be?

  1. -1: it's impossible to recover from this what was the "multiplier" (in this example 8), so runtime data type does not have all the information. Quite a lot of the stack (especially codegen) relies on checking the lanes from the DataType object.
  2. -8: captures all the information, but we think it's awful 🙈

One way to limit the scope of the change might be to introduce a distinction between the runtime DLDataType, and some new compile-time data-type.

How do you feel about extending tvm::runtime::DataType to hold some information about the scalability? From what I can see, it is just a wrapper around DLDataType plus some helper functions. We could add a field there to indicate if it is scalable vector. How disruptive would that be? I suppose we'd have the problem of the DLDataType not holding the information about the scalability, but I'm not sure if that would be very important.

@tqchen
Copy link
Member

tqchen commented Oct 11, 2023

I think assuming a single vector width(vscale) and use kScalableVectorMark=-1 to mark it would be a good tradeoff, given it may not be that useful to create vectors with multiple vector width anyway for optimization reasons.

If we want to go beyond a single symbolic variable, having some explicit loop might be better

@ekalda
Copy link
Contributor Author

ekalda commented Oct 12, 2023

I think there's a confusion about the difference between what we have referred to as vscale and vfactor. I'll try to summarise the the difference and the respective pros and cons.

For reference, this is how LLVM represents vectors (copied from the documentation):

< <# elements> x <elementtype> >          ; Fixed-length vector
< vscale x <# elements> x <elementtype> > ; Scalable vector

A concrete example of a scalable vector:

<vscale x 4 x float>

or

<vscale x 16 x i8> 

To construct these vectors we need to know the minimum vector length (SVE's 128 used in these examples) and the size of the data type of the vector elements (32 bits or 8 bits in these examples).

Vscale

This would mirror LLVM's vscale intrinsic, so if we had a TIR intrinsic with the same meaning, a TVM vector of floats that would exactly map to a hardware vector would look like

ramp(base, stride, 4 * vscale)   # or vscale(4) depending on which UI we want to go for

Pros

  1. When eyeballing the TIR, the meaning of the vscale intrinsic is intuitive since it's matches LLVM
  2. It makes translating the expressions involving vscale that exist outside of the vectors in codegen very easy since we just have to map tir.vscale -> llvm.vscale
  3. Since we can pull the information about the vector element data type from the ramp node, we can deduce the minimum vector length from the multiplier
  4. Makes it simpler to support arbitrarily long vectors*

Cons

  1. Representing lanes in runtime data type is very awkward (see the comments above)
  2. It's harder to place restriction on what ramp->lanes can be so it can get accidentally set to something nonsensical. This could be alleviated by using vscale(4) though as recommended by @kparzysz-quic

Vfactor

This was proposed in the first version of this RFC. A TVM vector that would map to a hardware vector would be:

ramp(base, stride, vfactor)

In this case the constant is implicitly absorbed into vfactor and will be deduced during codegen. The minimum vector length should be known to the backend specific codegen and the data type size can be pulled from the data type of the elements in the vector.

Pros

  1. Simpler to use in the scheduling, you don't have to worry about data type size and minimum vector length
  2. Less visual clutter
  3. Easier to create a robust implementation since we can enforce that if lanes of the ramp is not int, it is vfactor (unless we go to the territory of arbitrarily long vectors*)
  4. DLDataType representation is less of an issue, we can just go for -1

Cons

  1. We don't know the implicit data type of vfactor that is outside of the vector (this is a big problem)

*The arbitrarily long vectors

This is the "vectors with multiple vector width" that @tqchen mentioned. It is referring to there being no restrictions to the length of the TIR vectors and subsequently LLVM vectors in TVM. I've seen things like

<1024 x float>

coming out of TVM's codegen. I've always wondered if this is feature or (mostly harmless) side effect. LLVM itself deals with it by breaking these vectors down into a string of vector instruction that match the hardware length. SVE support in LLVM can also do that for SVE vectors, so in theory we could create vectors like

<vscale x 512 x float>

So the question there is if we want to support creating these vectors in TVM. If we do, vscale approach would be more appropriate. I agree tough that it is not probably particularly useful. So depends how much we care about the feature parity between the vector types there.

@cbalint13
Copy link

cbalint13 commented Oct 20, 2023

Thanks @ekalda for the nice work of the proposal, permit few personal points of view supporting the initiative:

Pros

  1. When eyeballing the TIR, the meaning of the vscale intrinsic is intuitive since it's matches LLVM
  2. It makes translating the expressions involving vscale that exist outside of the vectors in codegen very easy since we just have to map tir.vscale -> llvm.vscale
  3. Since we can pull the information about the vector element data type from the ramp node, we can deduce the minimum vector length from the multiplier
  4. Makes it simpler to support arbitrarily long vectors*

Cons

  1. Representing lanes in runtime data type is very awkward (see the comments above)
  • I don't see lanes information being awkward, it is already happening for classical x86, see: x86 unrolled tensorizers
  • Also given lanes information now even the schedulers starts to be aware of this, see recent fragment: x86 proposal
  1. It's harder to place restriction on what ramp->lanes can be so it can get accidentally set to something nonsensical. This could be alleviated by using vscale(4) though as recommended by @kparzysz-quic
ramp(base, stride, vfactor)

Cons

  1. We don't know the implicit data type of vfactor that is outside of the vector (this is a big problem)
  • Why not have both vfactor (abstract) concept along with vscale (real), where the vfactor would be a "virtual" teller of how a single true type vscale ramps ? This make the "implicit data type to be know" on one hand, and also would be expressive enough for "vectors with multiple vector width".

Personal note:

I would keep going (a +1 ✌️) to align with llvm concepts regarding the vscale type, even with the price to have a native data type implemented from the very bottom of dlpack stack up to the top TVM endings of the llvm emmiters.

From ASIC point of view, in the very CPU design, there is a clear trend that these single-shot atomic "reductors" are becoming increasingly parametrizable w.r.t to data (the veclen/lanes concept), easily trading between bandwidth needs and specific data access in their hottest possible pipeline path.

There is also the "v" RISCV extension that I think is well aligned to these recent concepts (if not they were even the first introducing these) so it looks like it is becoming a defacto thing in the SIMD design trends.

Update:

As the last one, there would be even a interesting way, quite elegant one, aligning even the classical x86 internals with the vscale concept (as kind of backward compatibility) considering historical expansion of x86 SIMD subsets (sse3, avx2, avx512, +amx?) over the expansions of _m128, _m256, _m512 widths (reflecting inside some multiple x i8 integers or f16 in some cases) in all current TVM schedulers implementation.

@lhutton1
Copy link
Contributor

Regarding the changes required to support scalability in the data type, I've been prototyping adding a new scalable_ attribute to DataType that wraps DLDataType.

However, I've ran into what I believe is an issue when accessing data types at compile-time across the FFI boundary between python and c++. TVMArgValue and TVMRetValue may have a value stored as a DLDataType. Storing a scalable DataType as a DLDataType will mean that we lose information about the scalability (assuming we don't want to alter DLPack, or use the negative lanes < -1 approach). For the limited number of test cases I've written, I've worked around this limitation by forcing DataType to be stored as a string across the boundary. But this feels a bit wrong.

I wonder if there could be something I've missed here or if there are any other suggestions? Are there any rules for using string, DataType and DLDataType interchangeably?

@cbalint13
Copy link

cbalint13 commented Dec 6, 2023

FYI,
@ekalda , @lhutton1 , @tqchen
A comprehensive presentation on SVE design booth on RISCV and ARM from perspective of LLVM.
The presentation captures all the design details of the SVE rationale in LLVM including arch comparisions.
https://youtu.be/-ox8iJmbp0c?feature=shared (Vector Codegen / Luke Lau)

@tqchen
Copy link
Member

tqchen commented Dec 7, 2023

Just to circle back here a bit. the main root issue is that we are using runtime::DataType, which is supposely being concrete through out the TIR node.

This places restrictions on what we can normally represent. A more comprehensive update would change the PrimExpr's field to also an object, as per StructInfo in the relax. That would requires bit more thinking, which likely can get around the issues mentioned in the thread(of passing around runtime::DataType which is not an object).

I think in short term making the protocol of lanes = -1 and lanes = -8(for vscale(8)) may not be a bad idea. The main reason is I cannot think of another possible use of the lanes field other than for the SVE.

@ekalda
Copy link
Contributor Author

ekalda commented Dec 8, 2023

@cbalint13 @tqchen Thank you for your input! This thread has been dormant for a bit, but we're still on it!

A comprehensive presentation on SVE design booth on RISCV and ARM from perspective of LLVM.
The presentation captures all the design details of the SVE rationale in LLVM including arch comparisions.
https://youtu.be/-ox8iJmbp0c?feature=shared (Vector Codegen / Luke Lau)

Thanks for sharing this, a really nice presentation! I'm trying to think how RVV's features will align with this RFC proposal... I think LLVM can be a good source of inspiration there :) Based on my (quite basic) understanding of RVV, there are two features that need consideration:

1. Addressing several vectors at once (LMUL)
They have resolved it in LLVM by encoding the LMUL value into the multiplier of vscale. Since this proposal follows the LLVM convection in expressing the scalable length, it can easily be adopted in TVM. As it currently stands, it will be up to the schedule author to do the maths and figure out the correct multiplier. It would be even easier if we implemented both vscale and vfactor...*

2. Predication
If I understood it correctly, there are two ways of setting the active lanes:

  1. By providing a bitmask as a predicate to the operation - I'd expect LLVM RVV backend supports llvm.get_active_lane_mask for that purpose, so for this case the current proposal should work
  2. By setting the VL register to the number of active lanes - I suppose that's the feature @cbalint13 you mentioned in your previous comment? I can think of few options there:
    1. If it is more like a status register that will apply to several instructions, we can use pragmas/TensorIR block attributes
    2. If it is not an expensive operation, we can add an optional argument to ramps and broadcasts to indicate the active lanes
    3. The RISC-V backend in TVM should have all the information to translate tir.get_active_lane_mask to appropriate LLVM intrinsics that set the VL register.

* ... implement both vscale and vfactor

Why not have both vfactor (abstract) concept along with vscale (real), where the vfactor would be a "virtual" teller of how a single true type vscale ramps ? This make the "implicit data type to be know" on one hand, and also would be expressive enough for "vectors with multiple vector width".

Sorry I missed this before! That's a good point, I think there would be benefits in having both of these available. It would certainly make expressing multiples of vector length simpler, e.g.

Ramp(base, 1, 2 * vfactor)

would imply LMUL = 2. If we want to keep vfactor as a user facing convenience function, we could do the translation from vfactor to n * vscale in the Ramp constructor so we wouldn't need to teach the TVM internal passes how to deal with it.


@tqchen

Just to circle back here a bit. the main root issue is that we are using runtime::DataType, which is supposely being concrete through out the TIR node.

This places restrictions on what we can normally represent. A more comprehensive update would change the PrimExpr's field to also an object, as per StructInfo in the relax. That would requires bit more thinking, which likely can get around the issues mentioned in the thread(of passing around runtime::DataType which is not an object).

I think I see what you mean and I agree, if we had something of a base type Object representing the data type that would give us much more freedom in expressing the compile time data type. I see how this would be a pretty invasive change though, I'm also not sure how this would interoperate with the DLDataType dependent runtime implementation (but I also don't know the runtime implementation very well).

I think in short term making the protocol of lanes = -1 and lanes = -8(for vscale(8)) may not be a bad idea. The main reason is I cannot think of another possible use of the lanes field other than for the SVE.

I'm fine with going with this option (especially if we manage to hide the -8 from the user) as it is probably the least invasive and sturdy option that will allow us to achieve our goals. @lhutton has been prototyping an additional field in runtime::DataType, but it's a bit of a can of worms (as per his post above).

We intend to upload a draft prototype soon, then you guys will have something more concrete to look at :)

@tqchen
Copy link
Member

tqchen commented Dec 8, 2023

I'm also not sure how this would interoperate with the DLDataType dependent runtime implementation (but I also don't know the runtime implementation very well).

Given SVE is only at compile time concept, likely we don't need DLDataType counterpart, if we remove runtime data type from the compile time repr

@ekalda
Copy link
Contributor Author

ekalda commented Jan 4, 2024

Happy new year everyone! 🎉 Here's the SVE prototype, as promised - apache/tvm#16347. It's made by @lhutton1, @neildhickey and me.

@tqchen @cbalint13 @Lunderberg @kparzysz-quic et al please have a look!

@lhutton1
Copy link
Contributor

lhutton1 commented Jan 8, 2024

A change that has not yet been included in the prototype was the predicate representation on buffer loads/stores in TVMScript programs. This was briefly referenced in the RFC.

So far we have explored the following options:

1. A[i:i+4, predicate=T.get_active_lane_mask(0, 4)]

In python, keyword arguments within subscripts are not supported. Without a keyword argument, e.g. [i:i+4, T.get_active_lane_mask(0, 4)], it wouldn't be easy to tell (when reading the TVMScript) if the last argument refers to another dimension of the buffer or a predicate.

2. A[i:i+4](predicate=T.get_active_lane_mask(0, 4))

When this approach is used to represent a buffer store (the expression is to the left of an assignment), it creates invalid python code: "cannot assign to a function call".

3. A(predicate=T.get_active_lane_mask(0, 4))[i:i+4]

This is the only syntactically valid approach. However, the predicate is now associated with the buffer itself, as opposed to the buffer load/store.

I'm curious to hear from folks more familiar with TVMScript if there are any other options we've not considered?

@tqchen
Copy link
Member

tqchen commented Jan 8, 2024

if predication is involved, maybe we can explicitly do A.store(...)? where predicate can be a kwarg

This RFC is to add support for vector length agnostic programming in TVM
stack.
Also add a note about expressing scalable lanes in runtime::DataType as
-1 * lanes.
@ekalda
Copy link
Contributor Author

ekalda commented Jan 11, 2024

if predication is involved, maybe we can explicitly do A.store(...)? where predicate can be a kwarg

Thanks @tqchen for the good suggestion, I included it into the RFC text (as an extension to vload and vstore).

I also included a note about the "-8" decision regarding to runtime::DataType.

@ekalda
Copy link
Contributor Author

ekalda commented Jan 17, 2024

Thanks everyone for all the good discussion so far! ❤️ We've had this RFC public for over 4 months now and the prototype up for few weeks and from what I can see, there are currently no outstanding issues here - hence we'd like to proceed with merging this RFC next week. I'll then create a tracking issue and we'll upstream the contents of the prototype in logical chunks (with some more substantial testing).

@tqchen
Copy link
Member

tqchen commented Jan 17, 2024

Thanks for working through this. One final comment, on Exposing scalable vectors to tuning. Iet us discuss through MetaSchedule as that is a more synergistic approach of tuning moving forward and also works well with the IRModule PrimFunc system. There is not blocking items here though.

@ekalda
Copy link
Contributor Author

ekalda commented Jan 18, 2024

Thanks @tqchen, good point! I updated the Future Possibilities section with some ideas for enabling the scalable vector support in the meta schedule.

Copy link
Contributor

@leandron leandron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ekalda for the work in this RFC and all who joined the discussion to reviewed it.

Given there is some alignment and no new blocking items spotted, I'll merge this and we can tackle any outstanding items in the scope of the tracking issue items to be raised. Thanks again!

@leandron leandron merged commit 1eeb3ff into apache:main Jan 23, 2024
@ekalda ekalda deleted the sve-rfc2 branch January 23, 2024 14:37
@ekalda
Copy link
Contributor Author

ekalda commented Jan 23, 2024

The tracking issue

lhutton1 added a commit to lhutton1/tvm that referenced this pull request May 8, 2024
This commit extends the functionality of the SME dense and matmul
schedules to support operations with fp16 inputs and an fp32 output,
where `transpose_a=False` and `transpose_b=True`.

For convenience, it also adds a utility called `get_vscale_factor`
which created the correct multiplier for `vscale` given a data type,
reflecting ideas from an early design of the
[SVE](apache/tvm-rfcs#104) RFC.

Change-Id: I8c00bc6baf2df6015fa41200a238781126c73589
lhutton1 added a commit to lhutton1/tvm that referenced this pull request May 15, 2024
This commit extends the functionality of the SME dense and matmul
schedules to support operations with fp16 inputs and an fp32 output,
where `transpose_a=False` and `transpose_b=True`.

For convenience, it also adds a utility called `get_vscale_factor`
which created the correct multiplier for `vscale` given a data type,
reflecting ideas from an early design of the
[SVE](apache/tvm-rfcs#104) RFC.

Change-Id: I8c00bc6baf2df6015fa41200a238781126c73589
ekalda pushed a commit to apache/tvm that referenced this pull request May 28, 2024
This commit extends the functionality of the SME dense and matmul
schedules to support operations with fp16 inputs and an fp32 output,
where `transpose_a=False` and `transpose_b=True`.

For convenience, it also adds a utility called `get_vscale_factor`
which created the correct multiplier for `vscale` given a data type,
reflecting ideas from an early design of the
[SVE](apache/tvm-rfcs#104) RFC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants