At the beginning of January 2024 Gunnar Morling launched The One Billion Row Challenge (1BRC) to the Java Community on his blog and GitHub. The challenge aimed to find Java code that processed one billion rows from a text file and aggregate the values in the fastest time possible. A lot of developers submitted their solutions until Morling closed the challenge at the end of the month and published the final leaderboards. After the news of this challenge spread many people built their own solutions using other languages, databases, and tools.
- Generate the measurements file with 1B rows (just once).
- Calculate the min, max & average values from the measurements file.
- Measure the time needed for reading the file and calculating the average. Output of the result values is not part of the challenge.
- Optimize the heck out of it to speed up your code!
- No external library dependencies may be used.
- Implementations must be provided as a single source file.
- The computation must happen at application runtime, i.e. you cannot process the measurements file at build time and just bake the result into the binary.
- Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither
;
nor\n
characters. (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.). - Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit.
- Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither
- There is a maximum of 10,000 unique station names.
- Line endings in the file are
\n
characters on all platforms. - Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.
- The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive".
Attention:
- The original generation used a list of weather station names and selected 10k random names from it. The shortest name I found is 3 bytes and the longest is 24 bytes long.
- This results in a line range of 3-24 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 6-31 bytes per line. Total file size is 6-31 GB, average 18,5 GB.
- The original post contained a warning that the generated file will be approximately 12 GB in size which would mean that most names are 5-9 bytes long.
- The extended challenge added the generic weather station names with 1-100 bytes in length.
- This results in a line range of 1-100 bytes name + 1 byte separator + 1-5 bytes for the decimal + 1 byte new line = 3-106 bytes per line. Total file size is 3-106 GB, average 54,5 GB.
Old notebook
- Intel Core i5-4200U - max. 2.3 GHz
- DDR3 - 8GB RAM - 1600 MHz
- Toshiba MQ01ABF050 - Read 100 MB/s Write 96 MB/s
- Visual Studio 2019 (Version 16.11.44)
- .NET Framework 4.7.2
- .NET 5 (Core)
There are a couple of classes that could be used to read & write data to & from a file:
- FileInfo.CreateText
- File.Create
- File.CreateText
- File.ReadAllText
- StreamWriter
- StreamReader
- BinaryWriter
- BinaryReader
- FileStream
- Memory mapped files
- P/invoke native methods
After an initial test with reduced data the FileStream with default settings was the clear winner. Because I wanted to understand why that was the case I took a deep dive in the documentation, code and performanced tests for a couple of days. The collected informations and results can be found here.
The logic to generate and read the measurements file is not complicated but it might take a lot of time depending on the implementation. The following tables show a step by step refactoring of the generating and reading code and highlight the performance improvements achieved by them.
Duration | File size in GB | MB/s | Improvement in % | ||
---|---|---|---|---|---|
StreamWriter.WriteLine(line as string) in .NET Framework (name length 7) | 11:03:084 | 10.720 | 16.17 | base line | File |
FileStream.Write(line as byte array) | 12:01:301 | 16.750 | 23.22 | 42.19 | File |
FileStream.Write(block filled multiple lines) | 11:04:100 | 13.201 | 19.88 | 22.94 | File |
Reduced array creation & copy when combining the line | 10:00:579 | 13.201 | 21.98 | 35.93 | File |
Temperature generation writes bytes directly to block | 03:17:209 | 11.595 | 58.79 | 263.57 | File |
Parallel execution | 03:06:660 | 13.195 | 70.69 | 337.17 | File |
Copy name with Marshal.Copy instead of Array.Copy | 02:49:013 | 13.195 | 78.07 | 382.81 | File |
Replaced real random with iteration through the values | 02:40:766 | 13.033 | 81.07 | 401.36 | File |
Refactored code (parallel line generation & asyn write) | 02:39:172 | 13.200 | 82.93 | 412.86 | File |
Allocate slices, store pointers & P/invoke CopyMemory | 02:35:141 | 13.200 | 85.08 | 426.16 | File |
P/invoke CreateFile, WriteFile & CloseHandle | 02:26:106 | 13.200 | 90.34 | 458.69 | File |
Removed slices & write directly to block (added fixed 10k names of list) | 02:52:751 | 15.157 | 87.74 | 442.61 | File |
Copied current code to .NET 5, changed to safe version & use spans | 02:46:481 | 14.882 | 89.39 | 452.81 | File |
Duration | File size in GB | MB/s | Improvement in % | ||
---|---|---|---|---|---|
StreamWriter.ReadLine(line as string) in .NET Framework | 08:14:673 | 14.882 | 30.08 | base line | File |
FileStream.ReadByte (single byte) | 08:04:929 | 14.882 | 30.69 | 2.03 | File |
Parse temperature to int instead of double | 07:41:859 | 14.882 | 32.22 | 7.11 | File |
Faster int parse | 04:42:568 | 14.882 | 52.67 | 75.10 | File |
Faster dictionary look up | 03:10:418 | 14.882 | 78.15 | 159.81 | File |
FileStream.Read (byte array) | 02:48:808 | 14.882 | 88.16 | 193.08 | File |
P/invoke ReadFile & CloseHandle | 02:48:665 | 14.882 | 88.23 | 193.32 | File |
Parallel execution | 02:45:487 | 14.882 | 89.93 | 198.97 | File |
Copied current code to .NET 5, changed to safe version & use spans | 02:52:967 | 15.157 | 87.63 | 191.32 | File |
After a lot of tests and refactoring the final .NET Framework version with unsafe code and the .NET 5 version with safe code both reached close to 90% of the possible write & read speed. Maybe with a bit of tweaking a couple more seconds could be shaved of the result but right now it is only off by about 10 seconds of the maximum write & read speed which might only be theoretically possible. I might have to test this with a better device with more I/O speed in the future but for now it is enough. Also upgrading to newer versions of .NET will have a better performance.
It was a fun challenge and I enjoyed learning a bit more about different ways and pitfalls when reading & writing files.