Skip to content

Conversation

marschall
Copy link
Contributor

TransactionAwareBufferedWriter offers a number of optimization
potentials. First it buffers at the char level. Buffering at the byte
level instead saves about 50% memory usage in common cases. Second it
does not overwrite any of the #write(String) methods leading to
unnecessary intermediate copies.

  • buffer in a ByteArrayOutputStream instead of StringBuilder
  • overwrite #write(String) methods to avoid copies

Together these two changes should help to reduce both live set size and
allocation rate.

Issue: BATCH-2434
https://jira.spring.io/browse/BATCH-2434

@marschall
Copy link
Contributor Author

I have signed and agree to the terms of the SpringSource Individual Contributor License Agreement.

@fmbenhassine
Copy link
Contributor

Hi @marschall ,

Thank you for this PR!

Buffering at the byte level instead saves about 50% memory usage in common cases

How did you measure this? I would like to be able to measure the improvement of this PR before applying any change.

@marschall
Copy link
Contributor Author

Hi @marschall ,

Thank you for this PR!

Buffering at the byte level instead saves about 50% memory usage in common cases

How did you measure this? I would like to be able to measure the improvement of this PR before applying any change.

TransactionAwareBufferedWriter uses a StringBuilder to buffer. StringBuilder uses a char[] to buffer. A char is 16 bit so we end up using 16 bit for every character.
With the change TransactionAwareBufferedWriter uses ByteArrayOutputStream to buffer. ByteArrayOutputStream uses a byte[] to buffer. When a 1 byte encoding is used like ASCII or Latin-1 we end up using only 8 bit for every character. Even if we use UFT-8 most characters end up only using 8 bit.

This changes a bit with Java 9+ where StringBuilder uses a byte[] with a flexible encoding. When the whole string is Latin-1 then only a single byte is used for storing a character, otherwise two bytes are used as before.

Doing a heap dump while a transaction is active and looking at the retained heap of a TransactionAwareBufferedWriter should reveal the differences on Java 8. On Java 9+ a single non Latin-1 character like € has to be in the buffer, otherwise they should be equal.

@fmbenhassine
Copy link
Contributor

Thank you for your answer, I was aware of the theory and just wanted to know how you practically measured the 50% improvement in terms of memory usage.

I see the PR is opened against the 3.0.x branch, can you please rebase it on the latest master?

I will add some inline comments about the changes.

@marschall marschall changed the base branch from 3.0.x to master August 11, 2019 13:28
@marschall marschall force-pushed the BATCH-2434 branch 3 times, most recently from abcccb2 to 694b99e Compare August 11, 2019 13:49
this.writer = new OutputStreamWriter(outputStream, encoding);
}

int length() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used in org.springframework.batch.support.transaction.TransactionAwareBufferedWriter#getBufferSize. According to its Javadoc, the method should return the size of unflushed buffered data. However, the method length flushes the writer before returning the size of the outputStream. Do we need to flush the writer to get the length of the buffer? A get method should be side-effect free IMO. I tried to remove the writer.flush but this makes several tests to fail. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that OutputStreamWriter has an internal buffer in sun.nio.cs.StreamEncoder. If we want to know how many bytes we will write this will be the bytes in the ByteArrayOutputStream plus the bytes in the OutputStreamWriter / sun.nio.cs.StreamEncoder. There is no way of accessing the number of bytes in the internal buffer in OutputStreamWriter / sun.nio.cs.StreamEncoder. The only way to be sure is to flush those bytes to the ByteArrayOutputStream.

@fmbenhassine
Copy link
Contributor

fmbenhassine commented Apr 29, 2020

@marschall Thank you for all these updates! While I like the improvement suggested here and the build is passing with this PR, I still have a couple of concerns:

  1. According to its javadoc, TransactionAwareBufferedWriter is designed to be a wrapper for a file channel and the fact that we are changing its constructor to accept an output stream instead of a file channel makes me nervous (even if an output stream has a file channel associated with it). I feel like we can implement the improvement without introducing this API breaking change (that we are trying to make non breaking by deprecating the current constructor in favor of the new one)
  2. TransactionAwareBufferedWriter is used in all file item writers (flat, json and xml files). So any hidden regression related to this can have a huge impact. The reason I'm pointing that out is that while we have some pretty good unit tests for this class, I don't see any integration test covering the "file write failed + restart" scenario where Spring Batch should truncate the file to the last known offset before starting to write items from where it left off in the previous failed run. I can take care of creating such a test suite (a test for each file type json/flat/xml + different encodings: single byte and multi-byte), but I really want to make sure there are no regressions at this level before merging this PR.

@marschall What do you think?

I also would like to have a second opinion on that PR, @mminella What are your thoughts? Anything I could have missed or overlooked?

Thank you both.

@marschall
Copy link
Contributor Author

@benas can you be a bit more specific what changes you would like to see or which approach you would prefer? A deprecated constructor is provided for backwards comparability. Should I change the class Javadoc as well?


private FileChannel channel;

private OutputStream outputStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would still require a FileOutputStream given the dependency on the FileChannel so I'd make this be a FileOutputStream

@fmbenhassine
Copy link
Contributor

fmbenhassine commented May 6, 2020

We have been discussing this with Michael and here are our thoughts. Since the PR contains two improvements, I will address them separately.

1. buffer in a ByteArrayOutputStream instead of StringBuilder

This changes a bit with Java 9+ where StringBuilder uses a byte[] with a flexible encoding.

Indeed, this is basically JEP 254. Since v4.3 will be the last version in the v4 line (as announced last year) and since the next major version will have a Java9+ baseline anyway, we think the deprecation + additional complexity here is not worth it for the remaining short lifetime of v4.

2. overwrite #write(String) methods to avoid copies

This enhancement makes sense and could be merged. I tried it here and the build passes without any regression.

@marschall Could you please update the PR to keep only the second enhancement? It should be good to merge with that. Thank you upfront.

@fmbenhassine fmbenhassine added the status: waiting-for-triage Issues that we did not analyse yet label May 6, 2020
@marschall
Copy link
Contributor Author

@benas yes I can do this

@fmbenhassine fmbenhassine added status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter and removed status: waiting-for-triage Issues that we did not analyse yet labels May 6, 2020
TransactionAwareBufferedWriter offers a number of optimization
potentials. First it creates an unnecessary local, temporary char[] in

 * avoid local, temporary char[] in #write(char[], int, int)
 * overwrite #write(String) methods to avoid copies

Together these two changes should help to reduce allocation rate.

Issue: BATCH-2434
https://jira.spring.io/browse/BATCH-2434
@marschall
Copy link
Contributor Author

I pushed the new version.

@fmbenhassine fmbenhassine linked an issue May 7, 2020 that may be closed by this pull request
@fmbenhassine
Copy link
Contributor

LGTM. Rebased and merged as 10fd371. Thank you very much for your contribution!

@fmbenhassine fmbenhassine added has: backports Legacy label from JIRA. Superseded by "for: backport-to-x.x.x" and removed status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter labels May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

has: backports Legacy label from JIRA. Superseded by "for: backport-to-x.x.x" pr-for: enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve TransactionAwareBufferedWriter efficiency [BATCH-2434]

3 participants