Skip to content

Conversation

@AlgorithmsAreCool
Copy link
Collaborator

This PR closes #116. It surfaces Span overloads for most operations and should improve both read and write performance primarily by reducing GC allocations

@AlgorithmsAreCool
Copy link
Collaborator Author

Before :

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT

Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 424.5 ns 8.15 ns 7.62 ns 0.0710 - - 336 B
Read 1 64 Sequential 429.0 ns 8.51 ns 13.00 ns 0.0830 - - 392 B
Read 1 256 Sequential 441.4 ns 7.69 ns 7.19 ns 0.1240 - - 584 B
Read 100 8 Sequential 12,507.1 ns 244.10 ns 372.77 ns 0.7324 - - 3504 B
Read 100 64 Sequential 15,080.6 ns 288.58 ns 364.96 ns 1.9226 - - 9104 B
Read 100 256 Sequential 16,760.2 ns 331.27 ns 515.74 ns 6.0120 - - 28304 B
Read 1000 8 Sequential 173,016.9 ns 3,397.00 ns 8,332.91 ns 6.8359 - - 32304 B
Read 1000 64 Sequential 179,190.1 ns 3,554.55 ns 5,428.18 ns 18.5547 - - 88304 B
Read 1000 256 Sequential 201,845.5 ns 3,963.95 ns 6,512.88 ns 59.5703 - - 280304 B

After

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT

Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 426.7 ns 8.46 ns 11.00 ns 0.0710 - - 336 B
Read 1 64 Sequential 437.0 ns 8.76 ns 15.79 ns 0.0830 - - 392 B
Read 1 256 Sequential 445.2 ns 8.53 ns 10.79 ns 0.1240 - - 584 B
Read 100 8 Sequential 12,377.9 ns 236.80 ns 324.14 ns 0.7324 - - 3504 B
Read 100 64 Sequential 14,276.2 ns 281.75 ns 438.66 ns 1.9226 - - 9104 B
Read 100 256 Sequential 16,474.7 ns 324.40 ns 485.55 ns 6.0120 - - 28304 B
Read 1000 8 Sequential 165,299.1 ns 3,158.93 ns 3,243.99 ns 6.8359 - - 32304 B
Read 1000 64 Sequential 179,787.0 ns 3,389.49 ns 3,170.53 ns 18.5547 - - 88305 B
Read 1000 256 Sequential 220,032.9 ns 5,844.55 ns 17,048.84 ns 59.5703 - - 280304 B

After (with new non allocating API)

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT

Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 449.6 ns 8.88 ns 14.83 ns 0.0644 - - 304 B
Read 1 64 Sequential 435.0 ns 8.72 ns 21.88 ns 0.0644 - - 304 B
Read 1 256 Sequential 415.3 ns 7.60 ns 8.44 ns 0.0644 - - 304 B
Read 100 8 Sequential 12,363.8 ns 240.95 ns 337.78 ns 0.0610 - - 304 B
Read 100 64 Sequential 13,812.4 ns 272.65 ns 505.38 ns 0.0610 - - 304 B
Read 100 256 Sequential 14,835.0 ns 293.48 ns 521.66 ns 0.0610 - - 304 B
Read 1000 8 Sequential 164,105.6 ns 3,256.16 ns 5,529.21 ns - - - 304 B
Read 1000 64 Sequential 172,565.8 ns 3,365.86 ns 5,715.49 ns - - - 304 B
Read 1000 256 Sequential 173,051.3 ns 3,410.59 ns 7,557.63 ns - - - 304 B

@AlgorithmsAreCool
Copy link
Collaborator Author

I need to do another pass over the documentation and do the write benchmarks

I'll take care of those tomorrow, it's late

_currentValue = default;
var findKey = new MDBValue(new Span<byte>(key));
return mdb_cursor_get(_handle, ref findKey, ref _currentValue, operation) == 0;
#warning should this update _currentKey?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like I introduced this in my first pass at changing things. Good catch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question For these Gets, if the key doesn't exist, should the cursor data be updated?
The LMDB docs don't actually say what might get returned if the key isn't found

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's using search under the covers and if my memory is correct it will change the cursor index to the one just below a key "higher" than what you looked for if that makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. One last update one the way then

@CoreyKaylor
Copy link
Owner

CoreyKaylor commented Jun 28, 2020

Seriously, well done.

@AlgorithmsAreCool
Copy link
Collaborator Author

I've spent some time today seeing if there was any low hanging fruit in terms of perf:

In the table below:

  • Read is a normal TryRead into a preallocated buffer
  • ReadNoCopy is a read but with the value copy ommitted
  • ReadMinimal is a rawusage of the LmdbMethods, to see how much overhead safety of the library was adding.

The good news is that at peak, we are hitting 5M+ reads per second depending on batch and value sizes. This translates to read throughput of over 1.2GB/s. Not too shabby. Also, the overhead of copying the value into an array is pretty low, this hurts my case for #121

The not so great is that the overhead of the transaction is felt more than I expected. The common pattern of single reads per transaction is giving up about half of it's performance to library overhead. I'm a little shocked by this result because the overhead was only about 15% or so when using sequential keys and a smaller database. This test uses random keys and 256K entries in the DB instead of just 1000.

But even the "worst case" is still ~2M reads per second. That ain't bad at all. Also, even though we are allocating 300 bytes per read, they are all clustered in Gen0 so the GC not incurring much pain from it.

I'll probably create another issue to poke at it some more after I address feedback and remove those warnings i forgot about in this one.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
[Host] : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT

Method OpsPerTransaction DatabaseSize ValueSize KeyOrder Mean Error StdDev Throughput Memory Throughput Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Read 1 262144 8 Random 513.3 ns 9.67 ns 14.17 ns 1,948,162.52 / s 15,585,300.18 MB/s 513.3 ns 2.05 0.07 0.0639 - - 304 B
ReadNoCopy 1 262144 8 Random 497.6 ns 9.95 ns 18.93 ns 2,009,765.73 / s 16,078,125.87 MB/s 491.2 ns 1.98 0.07 0.0639 - - 304 B
ReadMinimal 1 262144 8 Random 250.9 ns 5.05 ns 7.71 ns 3,985,211.15 / s 31,881,689.24 MB/s 251.1 ns 1.00 0.00 - - - -
Read 1 262144 64 Random 492.7 ns 9.69 ns 13.59 ns 2,029,661.31 / s 129,898,323.92 MB/s 488.3 ns 2.08 0.07 0.0639 - - 304 B
ReadNoCopy 1 262144 64 Random 492.4 ns 9.53 ns 13.35 ns 2,030,757.10 / s 129,968,454.16 MB/s 495.6 ns 2.08 0.08 0.0639 - - 304 B
ReadMinimal 1 262144 64 Random 237.8 ns 4.77 ns 6.36 ns 4,205,150.29 / s 269,129,618.80 MB/s 237.8 ns 1.00 0.00 - - - -
Read 1 262144 256 Random 498.5 ns 9.97 ns 14.61 ns 2,005,850.63 / s 513,497,761.70 MB/s 503.6 ns 2.04 0.10 0.0639 - - 304 B
ReadNoCopy 1 262144 256 Random 494.2 ns 9.80 ns 16.63 ns 2,023,338.34 / s 517,974,614.30 MB/s 495.7 ns 2.00 0.13 0.0644 - - 304 B
ReadMinimal 1 262144 256 Random 252.6 ns 5.03 ns 12.24 ns 3,958,301.20 / s 1,013,325,108.30 MB/s 254.3 ns 1.00 0.00 - - - -
Read 100 262144 8 Random 19,900.9 ns 389.37 ns 594.61 ns 5,024,900.04 / s 40,199,200.35 MB/s 20,074.8 ns 1.26 0.04 0.0610 - - 304 B
ReadNoCopy 100 262144 8 Random 19,488.5 ns 377.90 ns 912.68 ns 5,131,219.32 / s 41,049,754.55 MB/s 19,426.4 ns 1.20 0.07 0.0610 - - 304 B
ReadMinimal 100 262144 8 Random 15,874.3 ns 298.27 ns 319.14 ns 6,299,496.24 / s 50,395,969.92 MB/s 15,858.7 ns 1.00 0.00 - - - -
Read 100 262144 64 Random 20,338.6 ns 402.07 ns 774.65 ns 4,916,754.83 / s 314,672,309.30 MB/s 20,428.5 ns 1.18 0.05 0.0610 - - 304 B
ReadNoCopy 100 262144 64 Random 18,794.6 ns 373.90 ns 445.10 ns 5,320,667.67 / s 340,522,731.20 MB/s 18,747.5 ns 1.10 0.05 0.0610 - - 304 B
ReadMinimal 100 262144 64 Random 17,299.3 ns 344.10 ns 584.31 ns 5,780,576.01 / s 369,956,864.43 MB/s 17,583.2 ns 1.00 0.00 - - - -
Read 100 262144 256 Random 21,189.2 ns 417.04 ns 390.10 ns 4,719,374.46 / s 1,208,159,862.97 MB/s 21,199.8 ns 1.20 0.03 0.0610 - - 304 B
ReadNoCopy 100 262144 256 Random 20,382.1 ns 402.55 ns 523.43 ns 4,906,272.15 / s 1,256,005,670.03 MB/s 20,381.3 ns 1.15 0.03 0.0610 - - 304 B
ReadMinimal 100 262144 256 Random 17,678.7 ns 350.80 ns 344.53 ns 5,656,513.21 / s 1,448,067,381.00 MB/s 17,541.7 ns 1.00 0.00 - - - -
Read 1000 262144 8 Random 271,680.7 ns 5,404.78 ns 8,880.21 ns 3,680,791.47 / s 29,446,331.78 MB/s 270,716.9 ns 1.09 0.04 - - - 304 B
ReadNoCopy 1000 262144 8 Random 257,954.2 ns 4,940.35 ns 5,073.38 ns 3,876,657.19 / s 31,013,257.49 MB/s 257,343.4 ns 1.06 0.04 - - - 304 B
ReadMinimal 1000 262144 8 Random 248,480.0 ns 4,924.56 ns 8,227.83 ns 4,024,468.93 / s 32,195,751.46 MB/s 247,993.7 ns 1.00 0.00 - - - -
Read 1000 262144 64 Random 279,058.9 ns 5,554.03 ns 7,024.07 ns 3,583,473.35 / s 229,342,294.41 MB/s 281,347.1 ns 1.06 0.04 - - - 304 B
ReadNoCopy 1000 262144 64 Random 271,230.2 ns 5,180.11 ns 4,845.48 ns 3,686,904.78 / s 235,961,906.02 MB/s 272,830.3 ns 1.03 0.02 - - - 304 B
ReadMinimal 1000 262144 64 Random 263,690.2 ns 5,026.03 ns 4,701.35 ns 3,792,328.96 / s 242,709,053.15 MB/s 265,527.0 ns 1.00 0.00 - - - -
Read 1000 262144 256 Random 300,485.7 ns 5,819.14 ns 5,975.82 ns 3,327,944.95 / s 851,953,908.31 MB/s 302,726.5 ns 1.09 0.04 - - - 304 B
ReadNoCopy 1000 262144 256 Random 284,745.1 ns 5,492.79 ns 6,105.23 ns 3,511,912.41 / s 899,049,576.07 MB/s 287,330.3 ns 1.03 0.02 - - - 304 B
ReadMinimal 1000 262144 256 Random 276,339.5 ns 5,475.50 ns 5,858.72 ns 3,618,736.54 / s 926,396,553.86 MB/s 278,851.8 ns 1.00 0.00 - - - 1 B
Read 10000 262144 8 Random 3,156,183.7 ns 28,290.50 ns 26,462.95 ns 3,168,383.35 / s 25,347,066.80 MB/s 3,156,604.1 ns 1.09 0.01 - - - 306 B
ReadNoCopy 10000 262144 8 Random 3,082,680.0 ns 26,803.50 ns 25,072.01 ns 3,243,930.62 / s 25,951,444.96 MB/s 3,075,674.0 ns 1.07 0.01 - - - 304 B
ReadMinimal 10000 262144 8 Random 2,886,256.1 ns 28,983.67 ns 24,202.68 ns 3,464,696.05 / s 27,717,568.39 MB/s 2,885,249.2 ns 1.00 0.00 - - - 5 B
Read 10000 262144 64 Random 3,978,827.5 ns 29,298.17 ns 27,405.53 ns 2,513,303.23 / s 160,851,406.60 MB/s 3,974,537.5 ns 1.06 0.01 - - - 314 B
ReadNoCopy 10000 262144 64 Random 3,869,549.2 ns 24,773.98 ns 23,173.60 ns 2,584,280.35 / s 165,393,942.24 MB/s 3,865,378.9 ns 1.03 0.01 - - - 306 B
ReadMinimal 10000 262144 64 Random 3,761,864.1 ns 26,050.39 ns 21,753.25 ns 2,658,256.56 / s 170,128,420.15 MB/s 3,758,513.3 ns 1.00 0.00 - - - 5 B
Read 10000 262144 256 Random 4,671,714.2 ns 35,431.55 ns 31,409.15 ns 2,140,541.91 / s 547,978,729.99 MB/s 4,671,530.5 ns 1.10 0.02 - - - 304 B
ReadNoCopy 10000 262144 256 Random 4,362,596.4 ns 27,258.76 ns 24,164.19 ns 2,292,212.97 / s 586,806,520.99 MB/s 4,352,195.3 ns 1.03 0.01 - - - 310 B
ReadMinimal 10000 262144 256 Random 4,231,405.8 ns 45,450.64 ns 40,290.81 ns 2,363,280.78 / s 604,999,879.20 MB/s 4,225,128.1 ns 1.00 0.00 - - - 37 B

@CoreyKaylor
Copy link
Owner

This is all good news. Sounds like there might be more to "squeeze" out, but we're in a much better place with these changes than we were. Looking forward to getting these changes in. Thanks for all the effort here.

@CoreyKaylor
Copy link
Owner

I'm planning on pulling this in tomorrow unless you have any objections?

@AlgorithmsAreCool
Copy link
Collaborator Author

Yeah, I don't object. I can land anything else in another PR. Sorry about the delay here

@CoreyKaylor
Copy link
Owner

Added you as a collaborator if you're interested. All I ask is bigger changes get discussed first. Regarding time, I get it, nothing to be sorry for.

@CoreyKaylor CoreyKaylor merged commit 7958629 into CoreyKaylor:master Jul 6, 2020
}
else
{
fixed (byte* contiguousValuesPtr = new byte[overallLength])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if new buffers are created elsewhere, or indeed how often you'd expect this unoptimised path to be hit, but maybe it could use pooled memory instead of creating a new byte array, e.g. using ArrayPool from System.Buffers?

(just a random user happy to see Span support, having a quick browse through the PR 😄)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation allocated a lot more also, so this should be a net improvement either way.

As for pooling vs allocs, I've actually been kinda frustrated at how good the .NET GC has become 😅. I haven't benchmarked this method specifically, but it is probably still improvable, the question is by how much.

If this path is important to your use case I can create an issue for it to investigate.

Copy link

@cocowalla cocowalla Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't actually benchmarked (or even tried the new Spanified APIs yet), I was just looked at the PR and happened to notice this, and it's similar to a pattern I use a lot (fast path using stackalloc for small buffers, slow path using ArrayPool for larger buffers). But for sure as-is it's a huge net benefit 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-evaluate public API for possibly using Span overloads

3 participants