two: Optimize computation of idx2 further #530

sjakobi · 2025-10-21T10:56:02Z

This reduces the core size for the inner loop of two by 10 terms.

sjakobi · 2025-10-21T11:17:25Z

To my surprise, this actually seems to make fromList (which uses unsafeInsert) slower!

Before:

$ fine-grained --stdev 0.6 -p fromList.Int -p '$NF == "1000" || $NF == "10000"'
All
  HashMap.Strict
    fromList
      Int
        1000:  OK
          37.5 μs ± 226 ns
        10000: OK
          540  μs ± 5.5 μs
  HashSet
    fromList
      Int
        1000:  OK
          37.7 μs ± 206 ns
        10000: OK
          546  μs ± 3.5 μs

After:

$ fine-grained --stdev 0.6 -p fromList.Int -p '$NF == "1000" || $NF == "10000"'
All
  HashMap.Strict
    fromList
      Int
        1000:  OK
          38.3 μs ± 426 ns
        10000: OK
          551  μs ± 4.0 μs
  HashSet
    fromList
      Int
        1000:  OK
          39.1 μs ± 325 ns
        10000: OK
          560  μs ± 5.6 μs

(This is with GHC-9.12.2 on a ThinkPad with an Intel i5-8265U CPU)

I don't understand why, yet.

sjakobi · 2025-10-21T11:30:07Z

Core, STG, Cmm and ASM all look slightly improved. I will run the benchmarks on a different machine too.

treeowl · 2025-10-21T11:40:46Z

Interesting. Could you explain how this works? Also, what's the Core look like? I wonder if the new one made branch prediction less reliable or if it's just alignment nonsense.

sjakobi · 2025-10-22T11:31:19Z

So mask h s is just unsafeShiftL 1 (index h s). Therefore, when i1 > i2, we also have bp1 > bp2 and we can avoid recomputing the index.

idx2 is 0 or 1 depending on the comparison, so we can use the result of the comparison directly instead of branching on it.

Here's the Core diff for two.go

 Rec {
--- RHS size: {terms: 105, types: 165, coercions: 0, joins: 0/2}
+-- RHS size: {terms: 95, types: 165, coercions: 0, joins: 0/2}
 $wgo
   :: forall k v s.
      Int#
@@ -11131,18 +11131,18 @@ $wgo
       (t2 :: HashMap k v)
       (eta [OS=OneShot] :: State# s) ->
       let {
-        bp2 :: Word#
+        bp2# :: Word#
         [LclId]
-        bp2
+        bp2#
           = uncheckedShiftL#
               1## (word2Int# (and# (uncheckedShiftRL# ww2 ww) 31##)) } in
       let {
-        bp1 :: Word#
+        bp1# :: Word#
         [LclId]
-        bp1
+        bp1#
           = uncheckedShiftL#
               1## (word2Int# (and# (uncheckedShiftRL# ww1 ww) 31##)) } in
-      case eqWord# bp1 bp2 of {
+      case eqWord# bp1# bp2# of {
         __DEFAULT ->
           case k1 of conrep { __DEFAULT ->
           case newSmallArray#
@@ -11150,20 +11150,12 @@ $wgo
           of
           { (# ipv, ipv1 #) ->
           case writeSmallArray#
-                 @Lifted
-                 @s
-                 @(HashMap k v)
-                 ipv1
-                 (<#
-                    (word2Int# (and# (uncheckedShiftRL# ww1 ww) 31##))
-                    (word2Int# (and# (uncheckedShiftRL# ww2 ww) 31##)))
-                 t2
-                 ipv
+                 @Lifted @s @(HashMap k v) ipv1 (ltWord# bp1# bp2#) t2 ipv
           of s'
           { __DEFAULT ->
           case unsafeFreezeSmallArray# @Lifted @s @(HashMap k v) ipv1 s' of
           { (# ipv2, ipv3 #) ->
-          (# ipv2, or# bp1 bp2, ipv3 #)
+          (# ipv2, or# bp1# bp2#, ipv3 #)
           }
           }
           }
@@ -11177,7 +11169,7 @@ $wgo
           { (# ipv, ipv1 #) ->
           case unsafeFreezeSmallArray# @Lifted @s @(HashMap k v) ipv1 ipv of
           { (# ipv2, ipv3 #) ->
-          (# ipv2, bp1, ipv3 #)
+          (# ipv2, bp1#, ipv3 #)
           }
           }
           }

I wouldn't expect much of a speedup from this change, since the operations avoided are very cheap, but I didn't expect a slow-down.

two: Optimize computation of idx2 further

8eef567

This reduces the core size for the inner loop of two by 10 terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

two: Optimize computation of idx2 further #530

two: Optimize computation of idx2 further #530

Uh oh!

sjakobi commented Oct 21, 2025

Uh oh!

sjakobi commented Oct 21, 2025 •

edited

Loading

Uh oh!

sjakobi commented Oct 21, 2025

Uh oh!

treeowl commented Oct 21, 2025

Uh oh!

sjakobi commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

two: Optimize computation of idx2 further #530

Are you sure you want to change the base?

two: Optimize computation of idx2 further #530

Uh oh!

Conversation

sjakobi commented Oct 21, 2025

Uh oh!

sjakobi commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjakobi commented Oct 21, 2025

Uh oh!

treeowl commented Oct 21, 2025

Uh oh!

sjakobi commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjakobi commented Oct 21, 2025 •

edited

Loading