- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 476
implement SIMD UniformInt #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I don’t know wasm well enough to know why the test is failing | 
| 
 And it worked yesterday 😢. Seems to be rust-lang/rust#52353 | 
| I've got a  The performance seems very hardware dependent | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me FWIW!!
        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      | // The `range == 0` case is complicated with multiple lanes, | ||
| // avoid it. | ||
| assert!(range.eq($unsigned::splat(0)).none(), | ||
| "Uniform::new_inclusive called with the full range"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to deal actually implement this would be to set zone to whatever here for lanes where range==0. Then in sample, change the return statement to:
return range.ne($unsigned::splat(0)).select(self.low + $ty::from(hi), v);It obviously adds some overhead, but so does our handling in the non-simd case.
It's possible that the overhead is bigger here though. In particular I'm not sure if the $unsigned::splat(0) call will get compiled into something fast. It would for non-simd types, but I don't know SIMD instructions well enough to know if that's the case here.
If there's no way to make the splat operation fast, you could also try:
return range.ne(hi).select(self.low + $ty::from(hi), v);        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      | fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X { | ||
| let range = $unsigned::from(self.range); | ||
| // TODO: ensure these casts are a no-op | ||
| let zone = $unsigned::from($signed::from(self.zone)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simply doing
$unsigned::from($signed::from(self.zone));should work. We need multiple casts in the non-simd case because we're also changing the bit-width, but that's not the case here AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err.. I mean
$unsigned::from(self.zone);        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      | $( | ||
| let v: &[($ty, $ty)] = &[ | ||
| ($ty::splat(0), $ty::splat(10)), | ||
| ($ty::splat(10), $ty::splat(127)), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fold this into the existing macro by using FloatSIMDUtils::cast_from_int and FloatSIMDUtils::all_le?
Or rather, replace the existing macro with your new macro, and just add the scalar types to it.
I guess we'd still have to keep the full-range tests separate unless we make the changes above.
| I should also add that my approval is entirely non-official. It just means that I think the code looks good 😃 | 
| Here's the  Currently investigating the  | 
| Added code to handle the full range case. Builds fail because of the stdsimd changes. I can fix that when those changes stabilize. There seems to be no significant change in benchmarks  name                        no_full_range_support ns/iter  full_range_support ns/iter  diff ns/iter   diff %  speedup
+simd::distr_uniform_i16x16  13,230 (2418 MB/s)             12,070 (2651 MB/s)                -1,160   -8.77%   x 1.10 
+simd::distr_uniform_i16x2   8,769 (456 MB/s)               7,062 (566 MB/s)                  -1,707  -19.47%   x 1.24 
+simd::distr_uniform_i16x32  43,309 (1477 MB/s)             35,561 (1799 MB/s)                -7,748  -17.89%   x 1.22 
+simd::distr_uniform_i16x4   7,993 (1000 MB/s)              6,250 (1280 MB/s)                 -1,743  -21.81%   x 1.28 
+simd::distr_uniform_i16x8   3,695 (4330 MB/s)              3,409 (4693 MB/s)                   -286   -7.74%   x 1.08 
-simd::distr_uniform_i32x16  46,434 (1378 MB/s)             54,196 (1180 MB/s)                 7,762   16.72%   x 0.86 
-simd::distr_uniform_i32x2   6,253 (1279 MB/s)              7,861 (1017 MB/s)                  1,608   25.72%   x 0.80 
-simd::distr_uniform_i32x4   7,447 (2148 MB/s)              8,922 (1793 MB/s)                  1,475   19.81%   x 0.83 
+simd::distr_uniform_i32x8   18,559 (1724 MB/s)             17,130 (1868 MB/s)                -1,429   -7.70%   x 1.08 
+simd::distr_uniform_i64x2   6,651 (2405 MB/s)              6,313 (2534 MB/s)                   -338   -5.08%   x 1.05 
-simd::distr_uniform_i64x4   15,708 (2037 MB/s)             18,424 (1736 MB/s)                 2,716   17.29%   x 0.85 
+simd::distr_uniform_i64x8   27,984 (2287 MB/s)             27,929 (2291 MB/s)                   -55   -0.20%   x 1.00 
+simd::distr_uniform_i8x16   15,296 (1046 MB/s)             12,748 (1255 MB/s)                -2,548  -16.66%   x 1.20 
+simd::distr_uniform_i8x2    10,303 (194 MB/s)              8,399 (238 MB/s)                  -1,904  -18.48%   x 1.23 
+simd::distr_uniform_i8x32   27,191 (1176 MB/s)             25,844 (1238 MB/s)                -1,347   -4.95%   x 1.05 
+simd::distr_uniform_i8x4    10,192 (392 MB/s)              10,137 (394 MB/s)                    -55   -0.54%   x 1.01 
+simd::distr_uniform_i8x64   146,547 (436 MB/s)             120,560 (530 MB/s)               -25,987  -17.73%   x 1.22 
-simd::distr_uniform_i8x8    11,385 (702 MB/s)              13,822 (578 MB/s)                  2,437   21.41%   x 0.82  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'll defer to others for the test stuff.
        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      | let ints_to_reject = (unsigned_max - range + 1) % modulo; | ||
| // `unsigned_max - 0 + 1 % x` => `0 % x` => `0` so | ||
| // the zone for the full range case is `unsigned_max` which | ||
| // means only one sample is needed. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any value is ok since the sample code does (rand * range).low <= zone. But since range is 0 the low part will always be zero. So the test will always pass. But unsigned_max here is as good as any other value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned_max is nice as it allows a faster select. That and "range is 0" should probably be documented over this though.
| let my_uniform = Uniform::new_inclusive(low, high); | ||
| for _ in 0..1000 { | ||
| let v: $ty = rng.sample(my_uniform); | ||
| assert!($le(low, v) && $le(v, high)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could introduce something like FloatSIMDUtils and FloatAsSIMD::splat, but for integer types. That would allow a lot of this test code to be written more nicely.
But I'll defer to @dhardy and @pitdicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but what's written here is sufficient.
| Updated for  | 
| If we want to include benchmarks, 16-bit, 32-bit, and 64-bit lane widths will likely have very different performance on most hardware. At the very least doing benchmarks for  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As has already been said, this looks really good! Thanks @TheIronBorn!
There's no sample_single implementation; possibly it could benefit from the same range << range.leading_zeros() optimisation, but at this stage it doesn't seem important.
A couple of little points about arithmetic here — if it's wrapping by default (unlike other Rust numeric types) then there are no issues. (But are there plans to change this in the future?)
        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      |  | ||
| #[inline] // if the range is constant, this helps LLVM to do the | ||
| // calculations at compile-time. | ||
| // TODO: test this ^ for SIMD | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on testing? It seems fine. Note that #[inline] is needed to allow inlining in external crates, unless using (full) link-time-optimisation, so although the compiler might inline this anyway in internal tests, it wouldn't otherwise from a separate lib/bin.
| "Uniform::new_inclusive called with `low > high`"); | ||
| let unsigned_max = ::core::$u_scalar::MAX; | ||
|  | ||
| let range: $unsigned = ((high - low) + 1).cast(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is SIMD arithmetic wrapping by default?
| // replacing 0 with `unsigned_max` allows a faster `select` | ||
| // with bitwise OR | ||
| let modulo = not_full_range.select(range, $unsigned::splat(unsigned_max)); | ||
| let ints_to_reject = (unsigned_max - range + 1) % modulo; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If range is 0, this will overflow — or wrap to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap to 0
        
          
                src/distributions/uniform.rs
              
                Outdated
          
        
      | // bulk implementation | ||
| ($(($unsigned:ident, $signed:ident),)+ $u_scalar:ident) => { | ||
| $( | ||
| uniform_simd_int_impl!($unsigned, $unsigned, $signed, $u_scalar); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The $signed parameter appears to never be used.
| let my_uniform = Uniform::new_inclusive(low, high); | ||
| for _ in 0..1000 { | ||
| let v: $ty = rng.sample(my_uniform); | ||
| assert!($le(low, v) && $le(v, high)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but what's written here is sufficient.
| There is no easy SIMD leading zeros. I implemented one on my other branch using a bit twiddling hack but its performance seems too hardware dependent | 
| All  | 
| Could you add a note then where we need to use wrapping? Might help avoid bugs in the future. | 
| Assembly inspection confirms range calculations are done at compile-time | 
| 
 (512-bit vectors will still need the  | 
|  | 
| Turns out a scalar leading zeros implementation is still fast enough. I'll investigate if there's a similar slowdown for  Edit: perhaps only because I was using flat vectors,  | 
| Scalar leading zeros is still faster even when the compiler can't (theoretically) optimize it to a single  I'm currently investigating some of the other methods listed here http://www.pcg-random.org/posts/bounded-rands.html. SIMD division is extremely slow so I expect only the bitmask method will be comparable. | 
| One optimization that might be worthwhile here is to check for small ranges. I.e. when the range can fit in 16 bits, don't use the full 32 and a widening multiply, but just a mask and normal multiply. I imagine this to be a advantage in the common case, because ranges are rarely the size an integer can hold. | 
| 
 Edit: oh are you suggesting switching from  | 
| That gives a large speedup in an ideal situation (most likely though because it needs only half as many random bits). A proper implementation would be difficult and likely slow because some lanes may still have a large range. We could apply it only when all the lanes are small. (less speedup with fewer lanes, lower bit requirement is trumped by overhead)  | 
| It might be a nice optimization for shuffles or other cases where we know more about the range. | 
| The bitmask method seems better  name                        wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
-simd_uniform_i16x16         13,282 (2467 MB/s)  42,491 (771 MB/s)             29,209  219.91%   x 0.31 
-simd_uniform_i16x2          4,446 (921 MB/s)    18,856 (217 MB/s)             14,410  324.11%   x 0.24 
-simd_uniform_i16x32         38,002 (1724 MB/s)  87,600 (748 MB/s)             49,598  130.51%   x 0.43 
-simd_uniform_i16x4          4,656 (1759 MB/s)   20,315 (403 MB/s)             15,659  336.32%   x 0.23 
-simd_uniform_i16x8          6,408 (2556 MB/s)   24,405 (671 MB/s)             17,997  280.85%   x 0.26 
+simd_uniform_i32x16         50,561 (1296 MB/s)  26,833 (2442 MB/s)           -23,728  -46.93%   x 1.88 
-simd_uniform_i32x2          5,257 (1558 MB/s)   5,411 (1513 MB/s)                154    2.93%   x 0.97 
+simd_uniform_i32x4          11,600 (1412 MB/s)  8,789 (1864 MB/s)             -2,811  -24.23%   x 1.32 
+simd_uniform_i32x8          19,774 (1657 MB/s)  14,907 (2198 MB/s)            -4,867  -24.61%   x 1.33 
+simd_uniform_i64x2          9,799 (1672 MB/s)   8,965 (1827 MB/s)               -834   -8.51%   x 1.09 
+simd_uniform_i64x4          16,291 (2011 MB/s)  14,064 (2329 MB/s)            -2,227  -13.67%   x 1.16 
+simd_uniform_i64x8          30,050 (2180 MB/s)  28,011 (2339 MB/s)            -2,039   -6.79%   x 1.07 
+simd_uniform_i8x16          16,832 (973 MB/s)   12,637 (1296 MB/s)            -4,195  -24.92%   x 1.33 
+simd_uniform_i8x2           5,822 (351 MB/s)    5,363 (381 MB/s)                -459   -7.88%   x 1.09 
+simd_uniform_i8x32          24,748 (1324 MB/s)  17,520 (1870 MB/s)            -7,228  -29.21%   x 1.41 
+simd_uniform_i8x4           7,536 (543 MB/s)    6,640 (616 MB/s)                -896  -11.89%   x 1.13 
+simd_uniform_i8x64          147,202 (445 MB/s)  109,336 (599 MB/s)           -37,866  -25.72%   x 1.35 
+simd_uniform_i8x8           9,910 (826 MB/s)    9,794 (836 MB/s)                -116   -1.17%   x 1.01 
+simd_uniform_single_i16x16  82,276 (398 MB/s)   63,487 (516 MB/s)            -18,789  -22.84%   x 1.30 
+simd_uniform_single_i16x2   22,122 (185 MB/s)   16,473 (248 MB/s)             -5,649  -25.54%   x 1.34 
-simd_uniform_single_i16x32  175,026 (374 MB/s)  404,042 (162 MB/s)           229,016  130.85%   x 0.43 
+simd_uniform_single_i16x4   35,129 (233 MB/s)   21,397 (382 MB/s)            -13,732  -39.09%   x 1.64 
+simd_uniform_single_i16x8   27,776 (589 MB/s)   26,615 (615 MB/s)             -1,161   -4.18%   x 1.04 
+simd_uniform_single_i32x16  81,288 (806 MB/s)   53,903 (1215 MB/s)           -27,385  -33.69%   x 1.51 
+simd_uniform_single_i32x2   9,129 (897 MB/s)    8,398 (975 MB/s)                -731   -8.01%   x 1.09 
+simd_uniform_single_i32x4   17,449 (938 MB/s)   14,151 (1157 MB/s)            -3,298  -18.90%   x 1.23 
+simd_uniform_single_i32x8   40,181 (815 MB/s)   32,625 (1004 MB/s)            -7,556  -18.80%   x 1.23 
+simd_uniform_single_i64x2   24,001 (682 MB/s)   11,544 (1419 MB/s)           -12,457  -51.90%   x 2.08 
+simd_uniform_single_i64x4   40,520 (808 MB/s)   23,478 (1395 MB/s)           -17,042  -42.06%   x 1.73 
+simd_uniform_single_i64x8   85,092 (770 MB/s)   48,127 (1361 MB/s)           -36,965  -43.44%   x 1.77 
+simd_uniform_single_i8x16   66,271 (247 MB/s)   54,895 (298 MB/s)            -11,376  -17.17%   x 1.21 
+simd_uniform_single_i8x2    23,645 (86 MB/s)    18,929 (108 MB/s)             -4,716  -19.95%   x 1.25 
-simd_uniform_single_i8x32   139,309 (235 MB/s)  348,528 (94 MB/s)            209,219  150.18%   x 0.40 
+simd_uniform_single_i8x4    28,646 (142 MB/s)   23,976 (170 MB/s)             -4,670  -16.30%   x 1.19 
-simd_uniform_single_i8x64   967,238 (67 MB/s)   986,099 (66 MB/s)             18,861    1.95%   x 0.98 
+simd_uniform_single_i8x8    29,513 (277 MB/s)   27,354 (299 MB/s)             -2,159   -7.32%   x 1.08 benchmarks here https://github.com/TheIronBorn/rand/blob/simd/benches/simd_uniform_int.rs. (The single modulo and double division methods are too slow) The bitmask method slows down when the range is far from the next power of two, as is happening with the  Using a better range:  name                 wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
+simd_uniform_i16x16  17,330 (1890 MB/s)  13,225 (2477 MB/s)            -4,105  -23.69%   x 1.31 
+simd_uniform_i16x2   7,075 (578 MB/s)    5,299 (772 MB/s)              -1,776  -25.10%   x 1.34 
+simd_uniform_i16x32  51,725 (1267 MB/s)  29,134 (2249 MB/s)           -22,591  -43.68%   x 1.78 
+simd_uniform_i16x4   6,961 (1176 MB/s)   5,676 (1443 MB/s)             -1,285  -18.46%   x 1.23 
-simd_uniform_i16x8   9,380 (1746 MB/s)   11,650 (1406 MB/s)             2,270   24.20%   x 0.81 There is an more complex bitmask method which tries to use the remaining bits of the random integer upon rejection. If it could be optimized better, it may be a further improvement. | 
| The biased widening multiply method is much faster so it might be worth providing an extra, faster version. | 
| @TheIronBorn let me know when you think this is ready for merging — I'm not sure if you're still tweaking. | 
| If we provide an API which allows  | 
| Still tweaking but I think either the widening multiply method or the bitmask method would work well. The bitmask method seems best for  | 
| I think this is fine to merge now,  | 
| When the range is flat (i.e.  If you think a varying range could be used often, do speak up. | 
| @TheIronBorn could you squash some of the commits together please (i.e. eliminate most of the fix/refactor commits)? | 
| Squashed | 
Not sure what we should do about benchmarks. We really ought to have some benchmarks for
UniformIntbut there are 36 different vectors we could benchmark.