Skip to content

thoeltig/CodingChallenge.1BRC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1️⃣🐝🏎️ The One Billion Row Challenge

At the beginning of January 2024 Gunnar Morling launched The One Billion Row Challenge (1BRC) to the Java Community on his blog and GitHub. The challenge aimed to find Java code that processed one billion rows from a text file and aggregate the values in the fastest time possible. A lot of developers submitted their solutions until Morling closed the challenge at the end of the month and published the final leaderboards. After the news of this challenge spread many people built their own solutions using other languages, databases, and tools.

Basics of running the challenge

  1. Generate the measurements file with 1B rows (just once).
  2. Calculate the min, max & average values from the measurements file.
    • Measure the time needed for reading the file and calculating the average. Output of the result values is not part of the challenge.
  3. Optimize the heck out of it to speed up your code!

Rules and limits

  • No external library dependencies may be used.
  • Implementations must be provided as a single source file.
  • The computation must happen at application runtime, i.e. you cannot process the measurements file at build time and just bake the result into the binary.
  • Input value ranges are as follows:
    • Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither ; nor \n characters. (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.).
    • Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit.
  • There is a maximum of 10,000 unique station names.
  • Line endings in the file are \n characters on all platforms.
  • Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.
  • The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive".

Attention:

  • The original generation used a list of weather station names and selected 10k random names from it. The shortest name I found is 3 bytes and the longest is 24 bytes long.
    • This results in a line range of 3-24 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 6-31 bytes per line. Total file size is 6-31 GB, average 18,5 GB.
    • The original post contained a warning that the generated file will be approximately 12 GB in size which would mean that most names are 5-9 bytes long.
  • The extended challenge added the generic weather station names with 1-100 bytes in length.
    • This results in a line range of 1-100 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 3-106 bytes per line. Total file size is 3-106 GB, average 54,5 GB.

Setup

Old notebook

  • Intel Core i5-4200U - max. 2.3 GHz
  • DDR3 - 8GB RAM - 1600 MHz
  • Toshiba MQ01ABF050 - Read 100 MB/s Write 96 MB/s
  • Visual Studio 2019 (Version 16.11.44)
  • .NET Framework 4.7.2
  • .NET 5 (Core)

Writing & reading a file

There are a couple of classes that could be used to read & write data to & from a file:

  • FileInfo.CreateText
  • File.Create
  • File.CreateText
  • File.ReadAllText
  • StreamWriter
  • StreamReader
  • BinaryWriter
  • BinaryReader
  • FileStream
  • Memory mapped files
  • P/invoke native methods

After an initial test with reduced data the FileStream with default settings was the clear winner. Because I wanted to understand why that was the case I took a deep dive in the documentation, code and performanced tests for a couple of days. The collected informations and results can be found here.

The logic to generate and read the measurements file is not complicated but it might take a lot of time depending on the implementation. The following tables show a step by step refactoring of the generating and reading code and highlight the performance improvements achieved by them.

Generating the measurements file

Duration File size in GB MB/s Improvement in %
StreamWriter.WriteLine(line as string) in .NET Framework (name length 7) 11:03:084 10.720 16.17 base line File
FileStream.Write(line as byte array) 12:01:301 16.750 23.22 42.19 File
FileStream.Write(block filled multiple lines) 11:04:100 13.201 19.88 22.94 File
Reduced array creation & copy when combining the line 10:00:579 13.201 21.98 35.93 File
Temperature generation writes bytes directly to block 03:17:209 11.595 58.79 263.57 File
Parallel execution 03:06:660 13.195 70.69 337.17 File
Copy name with Marshal.Copy instead of Array.Copy 02:49:013 13.195 78.07 382.81 File
Replaced real random with iteration through the values 02:40:766 13.033 81.07 401.36 File
Refactored code (parallel line generation & asyn write) 02:39:172 13.200 82.93 412.86 File
Allocate slices, store pointers & P/invoke CopyMemory 02:35:141 13.200 85.08 426.16 File
P/invoke CreateFile, WriteFile & CloseHandle 02:26:106 13.200 90.34 458.69 File
Removed slices & write directly to block (added fixed 10k names of list) 02:52:751 15.157 87.74 442.61 File
Copied current code to .NET 5, changed to safe version & use spans 02:46:481 14.882 89.39 452.81 File

Reading the file

Duration File size in GB MB/s Improvement in %
StreamWriter.ReadLine(line as string) in .NET Framework 08:14:673 14.882 30.08 base line File
FileStream.ReadByte (single byte) 08:04:929 14.882 30.69 2.03 File
Parse temperature to int instead of double 07:41:859 14.882 32.22 7.11 File
Faster int parse 04:42:568 14.882 52.67 75.10 File
Faster dictionary look up 03:10:418 14.882 78.15 159.81 File
FileStream.Read (byte array) 02:48:808 14.882 88.16 193.08 File
P/invoke ReadFile & CloseHandle 02:48:665 14.882 88.23 193.32 File
Parallel execution 02:45:487 14.882 89.93 198.97 File
Copied current code to .NET 5, changed to safe version & use spans 02:52:967 15.157 87.63 191.32 File

Conclusion

After a lot of tests and refactoring the final .NET Framework version with unsafe code and the .NET 5 version with safe code both reached close to 90% of the possible write & read speed. Maybe with a bit of tweaking a couple more seconds could be shaved of the result but right now it is only off by about 10 seconds of the maximum write & read speed which might only be theoretically possible. I might have to test this with a better device with more I/O speed in the future but for now it is enough. Also upgrading to newer versions of .NET will have a better performance.

It was a fun challenge and I enjoyed learning a bit more about different ways and pitfalls when reading & writing files.

About

The One Billion Row Challenge - Aggregate 1B rows from a text file with C#

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages