use mmap to read files when possible

It occurs to me that we could potentially mmap files that are mmappable when reading them, potentially improving performance quite drastically. We'd have to test it out to make sure it didn't have major issues, but it could be a big win.

One really great example of where this could be an enormous performance improvement would be when using a regex to scan a file. The normal approach is to real in a chunk at a time and then scan that with the regex. However, that has issues if you can't be sure whether the regex has to match within the chunk. For example, when reading line-by-line, matching a regex that can only match within a line, this is fine — as long as the lines don't get prohibitively long. If the scan is through the entire file, e.g. in the case where the regex is being used to split the file into chunks for consumption, then this can get tricky. Using the mmap approach, you can just map the file and let the regex engine go to work — the kernel will handle getting the data when the regex engine is ready for it!

It would be nice to be able to apply the same trick to streamed data, which cannot be mmapped. However, I think a related trick might work: memory protect the page after the last bit of valid, read data from the file and trap accesses to that memory. Then if the regex engine (or whatever is doing the reading) reads past the valid buffer, you can go ahead and read more data on demand. This allows a very similar trick to be done even with streamed input.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

use mmap to read files when possible #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

use mmap to read files when possible #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions