Skip to content

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

@mrange

Description

@mrange

Hi.

I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.

I noticed it when discussing my other issue: #2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.

When setting up a simple push stream pipeline

// Minimalistic PushStream
//  A PushStream accepts a receiver function that will be called
//  with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit

module PushStream =
  let inline zero      ()       = LanguagePrimitives.GenericZero
  let inline push      r v      = r v

  // Creates a PushStream with all integers from b to e (inclusive)
  let inline fromRange b e    r = for i = b to e do push r i
  // Maps all values in ps using mapping function f
  let inline map       f   ps r = ps (fun v -> push r (f v))
  // Filters all values in ps using filter function f
  let inline filter    f   ps r = ps (fun v -> if f v then push r v)
  // Sums all values in ps
  let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

[<DisassemblyDiagnoser>]
type Benchmarks () =
  [<Params (10000, 100)>] 
  member val public Count = 100 with get, set

  [<Benchmark>]
  member x.SimplePushStreamTest () =
    PushStream.fromRange  0 x.Count
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum

Benchmark dotnet reports:

$ dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
...
|               Method |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest |      .NET 4.8 |         net48 |   100 |    400.6 ns |   3.92 ns |   3.67 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 |   100 |    439.3 ns |   4.35 ns |   4.07 ns |  1.10 |    0.02 |     273 B |
|                      |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest |      .NET 4.8 |         net48 | 10000 | 33,542.5 ns | 143.25 ns | 133.99 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 | 10000 | 39,449.8 ns | 259.08 ns | 242.35 ns |  1.18 |    0.01 |     273 B |

.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.

I dug a bit into the jitted assembler and found the following differences

--- dotnetcore.asm
+++ net48.asm
@@ -1,4 +1,4 @@
-; dotnet core 3.1
+; .net v48
 
 ; PushStream.fromRange  0 x.Count
 LOOP:
@@ -12,7 +12,6 @@
 jne     LOOP
 
 ; PushStream.map     int64
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 movsxd  rdx,edx
 mov     rax,qword ptr [rcx]
@@ -21,8 +20,7 @@
 jmp     rax
 
 ; PushStream.filter  (fun v -> (v &&& 1L) = 0L)
-nop     dword ptr [rax+rax]
-mov     eax,edx
+mov     rax,rdx
 test    al,1
 jne     BAILOUT
 mov     rcx,qword ptr [rcx+8]
@@ -35,7 +33,6 @@
 ret
 
 ; PushStream.map     ((+) 1L)
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 inc     rdx
 mov     rax,qword ptr [rcx]
@@ -44,11 +41,9 @@
 jmp     rax
 
 ; PushStream.sum
-nop     dword ptr [rax+rax]
 mov     rax,qword ptr [rcx+8]
 mov     rcx,rax
 add     rdx,qword ptr [rax+8]
 mov     qword ptr [rcx+8],rdx
 xor     eax,eax
 ret
-

It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.

It surprises me that the nop adds this much overhead but I can't spot anything else of significance.

The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression

And here:

module PerformanceRegression =
  open System
  open System.Linq
  open System.Diagnostics

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open BenchmarkDotNet.Attributes
    open BenchmarkDotNet.Configs
    open BenchmarkDotNet.Jobs
    open BenchmarkDotNet.Horology
    open BenchmarkDotNet.Running
    open BenchmarkDotNet.Diagnostics.Windows.Configs

    [<DisassemblyDiagnoser>]
    type Benchmarks () =
      [<Params (10000, 100)>] 
      member val public Count = 100 with get, set

      [<Benchmark>]
      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

    let run argv = 
      let job = Job.Default
                    .WithWarmupCount(30)
                    .WithIterationTime(TimeInterval.FromMilliseconds(250.0)) // the default is 0.5s per iteration, which is slighlty too much for us
                    .WithMinIterationCount(15)
                    .WithMaxIterationCount(20)
                    .AsDefault()
      let config = DefaultConfig.Instance.AddJob(job)
      let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
      let summary = b.Run(argv, config)
      printfn "%A" summary

// Run with: dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
[<EntryPoint>]
let main argv =
  PerformanceRegression.Tests.run argv
  0

category:cq
theme:optimization
skill-level:intermediate
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions