You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/post/multiply.md
+22-14Lines changed: 22 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -220,7 +220,7 @@ Using CSAs, the ARM7TDMI can sum up the addends together much faster. <sup>[[4,
220
220
Until now, we've mostly treated "generate the addends" and "add the addends" as two separate, entirely
221
221
discrete steps of the algorithm. Can we do them at the same time? Turns out, yes! We can generate some number of addends per cycle, and add them together using CSAs in the same cycle. We repeat this process until we've added up all our addends, and then we can send the results from the CSA to the ALU to be added together.
222
222
223
-
This is what the ARM7TDMI does - it generates 4 addends per cycle, and compresses
223
+
This is what the ARM7TDMI does - it generates four addends per cycle, and compresses
224
224
them using four CSAs to generates only two addends.
225
225
226
226
<center>
@@ -230,7 +230,7 @@ them using four CSAs to generates only two addends.
230
230
</center>
231
231
232
232
Each cycle, we read 8 bits from the <span style="color:#3a7dc9"> **multiplier**</span>, and with it, we generate 4 addends. We then
233
-
feed them into 4 of the 6 outputs of this CSA array, and when we have our 2 results, feed those
233
+
feed them into 4 of the 6 inputs of this CSA array, and when we have our 2 results, feed those
234
234
2 results back to the very top of the CSA array for the next cycle. On the first cycle of the algorithm, we can initialize those 2 inputs to the
235
235
CSA array with `0`s.
236
236
@@ -389,7 +389,7 @@ We have a few remaining issues with our implementation of `perform_csa_array`, l
389
389
390
390
## Handling 64-bit Accumulates
391
391
392
-
First of all, we don't know how to handle 64-bit accumulates yet. We know how to handle 32-bit accumulates - just [initialize the partial sum with the value of the accumulator](#trick). We can use a similar trick for 64-bit ones. First, we can initialize the partial sum with the bottom 33 bits of the 64 bit accumulate. Why 33? I thought the partial sum was 32 bits wide? Well, if we make the width of the partial sum 33 bits, we'd also be able to handle unsigned and signed multiplication by zero / sign extending appropriately. More on this in the next section.
392
+
First of all, we don't know how to handle 64-bit accumulates yet. We know how to handle 32-bit accumulates - just [initialize the partial sum with the value of the accumulator](#trick). We can use a similar trick for 64-bit ones. First, we can initialize the partial sum with the bottom 33 bits of the 64 bit accumulate. Why 33? I thought the partial sum was 32 bits wide? Well, if we make the width of the partial sum 33 bits, we'd also be able to handle unsigned and signed multiplication by zero / sign extending appropriately. This way, our algorithm itself only needs to be able to perform signed multiplication, and our choice of zero-extension or sign-extension at initialization will handle the rest. More on this in the next section.
393
393
394
394
We take the remaining 31 bits of the acc and drip-feed them, 2 bits per CSA, like so:
And we've done it - we removed all the repeated instances of `C[32]` and `X[33]`, using some mathematical black magic. <sup>[[5 pp. 14-17](#cite5)]</sup> The result:
479
+
And we've done it - we removed all the repeated instances of `C[32]` and `X[33]`, using some mathematical black magic. <sup>[[5 pp. 14-17](#cite5)]</sup> This means that all we need to do to handle sign extension is the following two operations:
Note that <span style="color:#3a7dc9"> **multiplier**</span> is a signed 33-bit number. Now here's the main issue with early termination. After every cycle of booth's algorithm, a total of 41 bits of result are produced. To convince yourself of this, look at the final two lines of [`perform_csa_array`](#perform_csa_array2). The bottom eight bits contain the result of each CSA, and the upper 33 bits above those 8 contain `csa_output`. After every cycle of booth's algorithm, the bottom eight bits are fed into a result register, since the _next_ cycle of booth's algorithm cannot change the value of those bottom eight bits. The upper 33 bits become the next input into the next cycle of booth's algorithm. Something like this:
555
+
Note that <span style="color:#3a7dc9"> **multiplier**</span> is a signed 33-bit number. After every cycle of booth's algorithm, the bottom eight bits are fed into a result register, since the _next_ cycle of booth's algorithm cannot change the value of those bottom eight bits. The remaining upper bits become the next input into the next cycle of booth's algorithm. Something like this:
549
556
550
557
551
558
```c
@@ -581,23 +588,23 @@ do {
581
588
partial_sum = u128_ror(partial_sum, 8);
582
589
partial_carry = u128_ror(partial_carry, 8);
583
590
584
-
// ASR = Arithmetic Shift Right for 33-bit numbers
591
+
// ASR == Arithmetic Shift Right for 33-bit numbers
585
592
multiplier = asr_33(multiplier, 8);
586
593
} while (!should_terminate(multiplier, flavor));
587
594
588
595
partial_sum.lo |= csa_output.output;
589
596
partial_carry.lo |= csa_output.carry;
590
597
```
591
598
592
-
Since `partial_sum` and `partial_carry` are shift registers that get rotated with each iteration of booths algorithm, we need to rotate them again after the algorithm ends in order to correct them to their proper values. Thankfully, the ARM7TDMI has something called a barrel shifter. The barrel shifter is a nifty piece of hardware that allows the CPU to perform an arbitrary shift/rotate before an ALU operation, all in one cycle. Since we plan to add `partial_sum` and `partial_carry` in the ALU, we may as well use the barrel shifter to rotate one of those two operands, with no additional cost.
599
+
Since `partial_sum` and `partial_carry` are shift registers that get rotated with each iteration of booths algorithm, we need to rotate them again after the algorithm ends in order to correct them to their proper values. Thankfully, the ARM7TDMI has something called a barrel shifter. The barrel shifter is a nifty piece of hardware that allows the CPU to perform an arbitrary shift/rotate before an ALU operation, all in one cycle. Since we plan to add `partial_sum` and `partial_carry` in the ALU, we may as well use the barrel shifter to rotate one of those two operands, with no additional cost. The other operand ends up requiring special hardware to rotate, since the barrel shifter only operates on one value per cycle.
593
600
594
-
For long (64-bit) multiplies, two right rotations (known on the CPU as RORs) occur, since the ALU can only add 32-bits at a time and so must be used twice.
601
+
For long (64-bit) multiplies, two right rotations (known on the CPU as RORs) occur, since the ALU can only add 32-bits at a time and so the ALU / barrel shifter must be used twice.
595
602
596
603
Spoiler alert, the value of the carry flag after a multiply instruction comes from the carryout of this barrel shifter.
597
604
598
605
So, what rotation values does the ARM7TDMI use? According to one of the patents, for an unsigned multiply, all (1 for 32-bit multiplies or 2 for 64-bit ones) uses of the barrel shifter do this: <sup>[[6, p. 9](#cite6)]</sup>
599
606
600
-
| # Iterations | Type | Rotation |
607
+
| # Iterations of Booths | Type | Rotation |
601
608
| - | - | - |
602
609
| 1 |ROR|22 |
603
610
| 2 |ROR|14 |
@@ -606,24 +613,24 @@ So, what rotation values does the ARM7TDMI use? According to one of the patents,
606
613
607
614
Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies uses Arithmetic Shift Rights (ASRs) and looks like this: <sup>[[6, p. 9](#cite6)]</sup>
608
615
609
-
| # Iterations | Type | Rotation |
616
+
| # Iterations of Booths | Type | Rotation |
610
617
| - | - | - |
611
618
| 1 |ASR|22 |
612
619
| 2 |ASR|14 |
613
620
| 3 |ASR|6 |
614
621
| 4 |ROR|30 |
615
622
616
-
I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Why? Well, observe what hapens when the ARM7TDMI does a `ROR` or `ASR`:
623
+
I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Why? Well, observe what happens when the ARM7TDMI does a `ROR` or `ASR`:
617
624
618
625
Code from fleroviux's wonderful NanoBoyAdvance. <sup>[[7]](#cite7)</sup>
Note that in both ROR and ASR the carry will always be set to the last bit of the `operand` to be shifted out. i.e., if I rotate a value by `n`, then the carry will always be bit `n - 1` of the `operand`, since that was the last bit to be rotated out. Same goes for ASR.
672
+
Note that in both ROR and ASR the carry will always be set to the last bit of the `operand` to be shifted out. i.e., if I rotate a value by `n`, then the carry will always be bit `n - 1` of the `operand` before rotation, since that was the last bit to be rotated out. Same goes for ASR.
665
673
666
674
So, _it doesn't matter_ if I don't use the same rotation values as the patents. Since, no matter the rotation value, as long as the output from _my_ barrel shifter is the same as the output from the _ARM7TDMI's_ barrel shifter, then the last bit to be shifted out must be the same, and therefore the carry flag must _also_ have been the same.
667
675
@@ -754,7 +762,7 @@ if (is_long(flavor)) {
754
762
755
763
Anyway, that's basically it. What a meme. If you're interested in the full code, take a look [here](https://github.com/zaydlang/multiplication-algorithm/tree/master).
756
764
757
-
# Works Cited
765
+
# References
758
766
759
767
<aname="cite1"></a>
760
768
[1] “Advanced RISC Machines ARM ARM 7TDMI Data Sheet,” 1995. Accessed: Oct. 21, 2024. [Online]. Available: https://www.dwedit.org/files/ARM7TDMI.pdf
0 commit comments