Skip to content

use mmap to read files when possible #106

@StefanKarpinski

Description

@StefanKarpinski

It occurs to me that we could potentially mmap files that are mmappable when reading them, potentially improving performance quite drastically. We'd have to test it out to make sure it didn't have major issues, but it could be a big win.

One really great example of where this could be an enormous performance improvement would be when using a regex to scan a file. The normal approach is to real in a chunk at a time and then scan that with the regex. However, that has issues if you can't be sure whether the regex has to match within the chunk. For example, when reading line-by-line, matching a regex that can only match within a line, this is fine — as long as the lines don't get prohibitively long. If the scan is through the entire file, e.g. in the case where the regex is being used to split the file into chunks for consumption, then this can get tricky. Using the mmap approach, you can just map the file and let the regex engine go to work — the kernel will handle getting the data when the regex engine is ready for it!

It would be nice to be able to apply the same trick to streamed data, which cannot be mmapped. However, I think a related trick might work: memory protect the page after the last bit of valid, read data from the file and trap accesses to that memory. Then if the regex engine (or whatever is doing the reading) reads past the valid buffer, you can go ahead and read more data on demand. This allows a very similar trick to be done even with streamed input.

Metadata

Metadata

Labels

performanceMust go fasterspeculativeWhether the change will be implemented is speculative

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions