Skip to content

Commit eec8f6b

Browse files
committed
rewrite stuff
1 parent 2082681 commit eec8f6b

File tree

1 file changed

+90
-106
lines changed

1 file changed

+90
-106
lines changed

content/post/multiply.md

Lines changed: 90 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,16 @@ counter as the output to, say, an XOR instruction. Or an AND instruction.
1616
Or a multiply instruction.
1717

1818
<a name="instructions"></a>
19-
The ARM7TDMI has six different multiply instructions. They are:
20-
- u32 = u32 x u32
21-
- u64 = u32 x u32
22-
- i64 = i32 x i32
23-
- u32 = u32 x u32 + u32
24-
- u64 = u32 x u32 + u64
25-
- i64 = i32 x i32 + i64
19+
Multiplication on the ARM7TDMI has a few neat features. You can multiply two 32-bit operands together to produce a 64-bit result. You can also optionally choose to do a multiply-add and add a third 64-bit operand to the 64-bit result, within the same instruction. Additionally, you can choose to treat the two 32-bit as either signed or unsigned.
2620

27-
Why are we talking about these instructions? Well the ARM7TDMI's multiplication instructions have a pretty interesting side effect. Here the manual says that
28-
after a multiplication instruction executes, the carry flag is `UNPREDICTABLE`.
21+
Why are we talking about the multiplication instruction? Well the ARM7TDMI's multiplication instructions have a pretty interesting side effect. Here the manual says that
22+
after a multiplication instruction executes, the carry flag is set to a "meaningless value".
2923

3024
![An image of the ARM7TDMI manual explaining that the carry and overflow flags are UNPREDICTABLE after a multiply instruction.](../../manual.png)
25+
3126
<small>A short description of carry and overflow flags after a multiplication instruction from the ARM7TDMI manual. <sup>[[1](#cite1)]</sup></small>
3227

33-
As if anything else in this god forsaken CPU was predictable. What this means is that software cannot and
28+
What this means is that software cannot and
3429
should not rely on the value of the carry flag after multiplication executes. It can be set to anything. Any
3530
value. 0, 1, a horse, whatever. This has been a source of memes in the emulator development community for a few years -
3631
people would frequently joke about how the implementation of the carry flag may as well be `cpu.flags.c =
@@ -223,15 +218,23 @@ Using CSAs, the ARM7TDMI can sum up the addends together much faster. <sup>[[4,
223218
224219
# Parallelism
225220
Until now, we've mostly treated "generate the addends" and "add the addends" as two separate, entirely
226-
discrete steps of the algorithm. But, turns out, we can do both of these steps _at the same time_. We
227-
know we can only add 4 addends per cycle, so what if we generate 4 addends per cycle, and compress
228-
them using four CSAs to generate only two addends? So, we pipe 4 CSAs into each other, allowing us to process 6 `N`-bit inputs into two `N + 8` bit outputs. The reason the outputs are of size `N + 8` can be derived from [the equation above](#finaleq) - each addend is shifted left by 2 more than the previous addend.
221+
discrete steps of the algorithm. Can we do them at the same time? Turns out, yes! We can generate some number of addends per cycle, and add them together using CSAs in the same cycle. We repeat this process until we've added up all our addends, and then we can send the results from the CSA to the ALU to be added together.
222+
223+
This is what the ARM7TDMI does - it generates 4 addends per cycle, and compresses
224+
them using four CSAs to generates only two addends.
225+
226+
<center>
227+
228+
![An image of the ARM7TDMI manual explaining that the carry and overflow flags are UNPREDICTABLE after a multiply instruction.](../../diagram.png)
229+
230+
</center>
229231
230232
Each cycle, we read 8 bits from the <span style="color:#3a7dc9"> **multiplier**</span>, and with it, we generate 4 addends. We then
231233
feed them into 4 of the 6 outputs of this CSA array, and when we have our 2 results, feed those
232234
2 results back to the very top of the CSA array for the next cycle. On the first cycle of the algorithm, we can initialize those 2 inputs to the
233235
CSA array with `0`s.
234236
237+
<a name="trick"> </a>
235238
A clever trick can be done here. The ARM7TDMI [supports mutliply accumulates](#instructions), which perform multiplication and addition in one instruction. We can implement multiply accumulate by initializing one
236239
of those two inputs with the accumulate value, and get multiply accumulate without extra cycles. This trick is what the
237240
ARM7TDMI employs to do multiply accumulate. (This ends up being a moot point, because the CPU is stupid and can only read two register values at a time per cycle. So, using an accumulate causes the CPU to take
@@ -270,7 +273,7 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry,
270273
csa_output.carry &= 0x1FFFFFFFFULL;
271274
272275
struct CSAOutput result = perform_csa(csa_output.output,
273-
addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry);
276+
addends.m[i].recoded_output & 0x3FFFFFFFFULL, csa_output.carry);
274277
275278
// Inject the carry caused by booth recoding
276279
result.carry <<= 1;
@@ -382,123 +385,104 @@ But that's not all. Remember the carry flag from earlier? With this simple chang
382385

383386
# Mathematical Black Magic
384387

385-
It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on 64-bit and signed multiplies. It was around this time that I found two patents, that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating. <sup>[[5](#cite5)], [[6](#cite6)]</sup>
386-
387-
After reading the patents, it turns out my implementation of the CSA array is slightly flawed (see [`perform_csa_array`](#perform_csa_array) above). In particular, that function uses CSAs with a width of _64_ bits. That's way too large and wastes space on the chip - the actual hardware gets away with only using _31_.
388+
We have a few remaining issues with our implementation of `perform_csa_array`, let's discuss them one at a time.
388389

389-
Another difference is that my algorithm has no way yet of supporting long accumulate values. Sure, I can initialize the partial output with the accumulate value, but the partial output is only 32 bits wide.
390+
## Handling 64-bit Accumulates
390391

392+
First of all, we don't know how to handle 64-bit accumulates yet. We know how to handle 32-bit accumulates - just [initialize the partial sum with the value of the accumulator](#trick). We can use a similar trick for 64-bit ones. First, we can initialize the partial sum with the bottom 33 bits of the 64 bit accumulate. Why 33? I thought the partial sum was 32 bits wide? Well, if we make the width of the partial sum 33 bits, we'd also be able to handle unsigned and signed multiplication by zero / sign extending appropriately. More on this in the next section.
391393

392-
Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. Pretty much the entire rest of this section is derived from one of the patents <sup>[[5, pp. 14-17](#cite5)]</sup>. This is the hardest part of the algorithm, so hang in there.
394+
We take the remaining 31 bits of the acc and drip-feed them, 2 bits per CSA, like so:
393395

394-
Roughly, on each CSA, we want to add three numbers together to produce two numbers. Let's give these five numbers some names. Define `S` to be a 33-bit value representing the previous CSA's sum (even though the actual sum is 32-bits, adding an extra bit allows us to handle both signed and unsigned multiplication), `C` to be a 33-bit value representing the previous CSA's carry, and `S'` and `C'` to be 33-bit values representing the resulting CSA sum / carry. Finally, define `X` to be a 34-bit value containing the current addend. Then we have:
396+
```c
397+
// Contains the current high 31 bits of the acc.
398+
// This is shifted by 2 after each CSA.
399+
u64 acc_shift_register = 0;
395400

396-
$$
397-
S', C' = S + C + X
398-
$$
401+
struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry,
402+
struct RecodedMultiplicands addends) {
403+
struct CSAOutput csa_output = { partial_sum, partial_carry };
404+
struct CSAOutput final_csa_output = { 0, 0 };
399405

400-
This, mathematically speaking, can be represented as a 65-bit addition. The reason why is that `X` can be left-shifted by as little as 0, and as much as 32.
406+
for (int i = 0; i < 4; i++) {
407+
// ... omitted
401408

402-
Now, if we define `i` to be a number from `[0 - 3]` representing the CSA's position in the CSA array, we can divide the 65 bit CSA addition region into five chunks:
403-
- Lower: A region of size `2i` that represents `final_csa_output`. This region is unaffected by future CSAs, since all future addends are multiplied by at least `2^(2*i)`.
404-
- TransL: The two bits of CSA `#i` that will become Lower bits in CSA `#(i + 1)`.
405-
- Active: The 31-bit region where, including TransL, the actual CSA will be performed. Active itself is 31-bits wide, but with TransL, this is 33-bits.
406-
- TransH: The two bits of CSA `#i` that will become Active bits in CSA `#(i + 1)`
407-
- High: Contains values that have not yet been put into the CSA.
409+
result.output |= (acc_shift_register & 3) << 31;
410+
acc_shift_register >>= 2;
411+
}
408412

409-
Define `A` to be a `65-bit` accumulate (even though the actual accumulate is 64-bits, adding an extra bit allows us to handle both signed and unsigned accumulates). Define `SL` and `CL` to be the analogue of `final_csa_output` in (see [`the code snippet above`](#perform_csa_array) above). Finally, define `XC` to be the carry flag produced by booth recoding. Then, we can model the addition of the 3 operands in the CSA as follows:
413+
final_csa_output.output |= csa_output.output << 8;
414+
final_csa_output.carry |= csa_output.carry << 8;
410415

411-
| Region: | High | TransH | Active | TransL | Lower |
412-
| -- | - | - | - | - | - |
413-
| Size: | 30 - 2i | 2 | 31 | 2 | 2i |
414-
| Operand #1: | 0 | 0 | S[32:2] | S[1:0] | SL[2i:0]
415-
| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0]
416-
| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0
417-
| Result Sum: | 0 | S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
418-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
416+
return final_csa_output;
417+
}
418+
```
419419
420-
Seriously, take time to make sure you understand this table. It represents the CSA that we want to be able to perform.
420+
You can think of this trick conceptually as us initializing all 64-bits of `csa_output.output` to the acc, instead of just the bottom 32-bits. <sup>[[5 p. 14](#cite5)]</sup>
421421
422-
Here's a simple way to implement long accumulates. 33 bits of the `A` will be placed in `S` as initialization. Meanwhile, we can shove the other `31` bits, two bits per CSA, into the high region of addend #1.
422+
## Handling Signed Multiplication
423423
424-
| Region: | High | TransH | Active | TransL | Lower |
425-
| - | - | - | - | - | - |
426-
| Size: | 30 - 2i | 2 | 31 | 2 | 2i |
427-
| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | S[1:0] | SL[2i:0]
428-
| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0]
429-
| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0
430-
| Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
431-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
424+
Turns out this algorithm doesn't support signed multiplication yet either. To implement this, we need to take a closer look at the CSA.
432425
433-
We can ignore the Lower column, since the result there is always the same as the addends. We can also ignore the TransL and Active columns, as the operation in those two columns can be implemented using a simple 33-bit CSA (and we already have shown how to do so in [`perform_csa_array`](#perform_csa_array) above). This leaves:
426+
The CSA in its current form takes in 3 33-bit inputs, and outputs 2 33-bit outputs. One of these inputs, however, is actually supposed to be *34* bits (ha, lied to you all again). Specifically, `addends.m[i].recoded_output`. The recoded output is derived from a 32-bit <span style="color:#DC6A76"> **multiplicand**</span>, which, when [booth recoded](#finaleq), can be multiplied by at most `2`, giving it a size of 33 bits. However, because we can support both signed and unsigned multiplies, this value needs to be 34 bits - the extra bit, as mentioned earlier, allows us to choose to either zero-extend or sign-extend the number to handle both signed and unsigned multiplication elegantly.
434427
435-
| Region: | High | TransH
436-
| - | - | - |
437-
| Size: | 30 - 2i | 2
438-
| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] |
439-
| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0]
440-
| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0
441-
| Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
442-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
428+
Let's take a look at the other two of the CSA's addends as well. `csa_output.carry`, a 33 bit number, also needs to be properly sign extended. However, `csa_output.output` does _not_ need to be sign extended, since `csa_output.output` is technically already a 65 bit number that was fully initialized with the acc.
443429
444-
We can do some trickery to replace Operand #2 with one row of all ones, and another row with just `!C[N]`. Convince yourself why this is mathematically OK.
430+
Let's summarize the bit widths so far:
431+
- `csa_output.output`: 65
432+
- `csa_output.carry`: 33
433+
- `addends.m[i].recoded_output`: 34
445434
446-
| Region: | High | TransH
447-
| - | - | - |
448-
| Size: | 30 - 2i | 2
449-
| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] |
450-
| Operand #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0]
451-
| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0]
452-
| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0
453-
| Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
454-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
435+
In order to implement signed multiplication, we need to sign-extend all 3 of these numbers to the full 65 bits. How can we do so? Well, `csa_output.output` is already 65 bits, so that one is done for us. What about the other two? For now, I will use the following shortened forms for readability:
436+
- `csa_output.output` will be referred to as `S`
437+
- `csa_output.carry` will be referred to as `C`
438+
- `addends.m[i].recoded_output` will be referred to as `X`
455439
456-
Do the same to Operand #3:
457440
458-
| Region: | High | TransH
459-
| - | - | - |
460-
| Size: | 30 - 2i | 2
461-
| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] |
462-
| Operand #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0]
463-
| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0]
464-
| Operand #3: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0]
465-
| Operand #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0]
466-
| Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
467-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
468-
469-
Now, Operands #2 and #3 can be added together, being replaced by a new `Operand #4`.
470-
471-
| Region: | High | TransH
472-
| - | - | - |
473-
| Size: | 30 - 2i | 2
474-
| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] |
475-
| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0]
476-
| Operand #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0]
477-
| Operand #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0]
478-
| Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0]
479-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
441+
Here's a helpful visualization of these desired 65-bit numbers, after they've been sign extended:
442+
| addend | bits 65-35 | bit 34 | bit 33 | bit 32 | bits 31-0
443+
| -- | - | - | - | - | - |
444+
| `csa_output.output` | S[65..35] | S[34] | S[33] | S[32] | S[31..0] | SL[2i:0]
445+
| `csa_output.carry` | C[32], ..., C[32] | C[32] | C[32] | C[32] | C[31..0] | CL[2i:0]
446+
| `addends.m[i].recoded_output` | X[33], ..., X[33] | X[33] | X[33] | X[32] | X[31..0] | 0
480447
481-
Now we can define `S'` as:
482-
`S'[32:31] = A[2i + 34] + !C[32] + !X[33]`
448+
We can do a magic trick here. We can replace the `csa_output.carry` row with a row of ones, and `!C[32]`. Convince yourself that this is mathematically okay:
483449
484-
We can now remove `S'` and the bits used to calculate it. Let's see what's left.
450+
| addend | bits 65-35 | bit 34 | bit 33 | bit 32 | bits 31-0
451+
| -- | - | - | - | - | - |
452+
| `csa_output.output` | S[65..35] | S[34] | S[33] | S[32] | S[31..0] | SL[2i:0]
453+
| `csa_output.carry` | 0, ..., 0 | 0 | !C[32] | C[32] | C[31..0] | CL[2i:0]
454+
| `magic trick` | 1, ..., 1 | 1 | 1 | 0 | 0 | CL[2i:0]
455+
| `addends.m[i].recoded_output` | X[33], ..., X[33] | X[33] | X[33] | X[32] | X[31..0] | 0
485456
486-
| Region: | High | TransH
487-
| - | - | - |
488-
| Size: | 30 - 2i | 2
489-
| Operand #1: |0 | A[2i+35] 0 | S[32:2] |
490-
| Operand #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0]
491-
| Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0]
457+
Let's do it again, this time to `X`:
492458
493-
Meaning `C'[32] = !A[2i+35]`.
459+
| addend | bits 65-35 | bit 34 | bit 33 | bit 32 | bits 31-0
460+
| -- | - | - | - | - |-|
461+
| `csa_output.output` | S[65..35] | S[34] | S[33] | S[32] | S[31..0] | SL[2i:0]
462+
| `csa_output.carry` | 0, ..., 0 | 0 | !C[32] | C[32] | C[31..0] | CL[2i:0]
463+
| `magic trick` | 1, ..., 1 | 1 | 1 | 0 | 0 | CL[2i:0]
464+
| `addends.m[i].recoded_output` | 0, ..., 0 | 0 | !X[33] | X[32] | X[31..0] | 0
465+
| `another magic trick` | 1, ..., 1 | 1 | 1 | 0 | 0 | CL[2i:0]
494466
467+
Now we add the magic tricks together:
495468
469+
| addend | bits 65-35 | bit 34 | bit 33 | bit 32 | bits 31-0
470+
| -- | - | - | - | - | - |
471+
| `csa_output.output` | S[65..35] | S[34] | S[33] | S[32] | S[31..0] | SL[2i:0]
472+
| `csa_output.carry` | 0, ..., 0 | 0 | !C[32] | C[32] | C[31..0] | CL[2i:0]
473+
| `addends.m[i].recoded_output` | 0, ..., 0 | 0 | !X[33] | X[32] | X[31..0] | 0
474+
| `combined magic tricks` | 1, ..., 1 | 1 | 0 | 0 | 0 | CL[2i:0]
496475
497-
And with that, we managed to go from using 64 bits of CSA, to only 33. <sup>[[5 pp. 14-17](#cite5)]</sup> Our final algorithm for the CSAs is as follows:
498476
477+
And we've done it - we removed all the repeated instances of `C[32]` and `X[33]`, using some mathematical black magic. <sup>[[5 pp. 14-17](#cite5)]</sup> The result:
499478
500479
<a name="perform_csa_array2"></a>
480+
501481
```C
482+
// Contains the current high 31 bits of the acc.
483+
// This is shifted by 2 after each CSA.
484+
u64 acc_shift_register = 0;
485+
502486
struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry,
503487
struct RecodedMultiplicands addends[4]) {
504488
struct CSAOutput csa_output = { partial_sum, partial_carry };
@@ -507,7 +491,7 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry,
507491
for (int i = 0; i < 4; i++) {
508492
csa_output.output &= 0x1FFFFFFFFULL;
509493
csa_output.carry &= 0x1FFFFFFFFULL;
510-
494+
511495
struct CSAOutput result = perform_csa(csa_output.output,
512496
addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry);
513497
@@ -528,9 +512,9 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry,
528512
result.output >>= 2;
529513
result.carry >>= 2;
530514
531-
// Perform the magic described in the tables for the handling of TransH
532-
// and High. acc_shift_register contains the upper 31 bits of the acc
533-
// in its lower bits.
515+
// Perform the magic described in the tables for the sign extension
516+
// of csa_output.carry and the recoded addend. Remember that bits 0-1
517+
// of the acc_shift_register is bits 33-34 of S.
534518
u64 magic = bit(acc_shift_register, 0) +
535519
!bit(csa_output.carry, 32) + !bit(addends.m[i].recoded_output, 33);
536520
result.output |= magic << 31;

0 commit comments

Comments
 (0)