Skip to content

Conversation

@vsaase
Copy link
Contributor

@vsaase vsaase commented Jul 31, 2021

Hi,

this adds support for lossless decoding. As this is a special need mostly in medical imaging I tried to touch as little code as possible.
Performance is a bit low at the moment, help would be appreciated here.

If you think this is too special a use case I might also publish it in its own crate.

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty happy with this in total, except that the numerical part of prediction isn't very SIMD friendly (which is probably the reason for the slowness). This is necessarily a breaking change due to the additional pixel formats that are being introduced which makes it really easy to justify the MSRV bump. Just one question, can you specify precisely where the 1.40 requirement comes from?

&mut self,
frame: &FrameInfo,
scan: &ScanInfo,
) -> Result<(Option<Marker>, Option<Vec<Vec<u16>>>)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any return path making use of Option<Vec<_>> here. In any case, when the types become this complicated I found it better to create a very simple struct and be it just for the ability to properly document each portion of the return type as field documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same return type the decode_scan function in decoder.rs has. However I now removed the Option around Vec<Vec<u16>>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I see why it was initially chosen. Although, this increases the benefit of having a new type defined for this instance since it could be shared between methods.

Comment on lines 85 to 126
for (i, component) in components.iter().enumerate() {
let dc_table = self.dc_huffman_tables[scan.dc_table_indices[i]]
.as_ref()
.unwrap();
let value = huffman.decode(reader, dc_table)?;
let diff = match value {
0 => 0,
1..=15 => huffman.receive_extend(reader, value)? as i32,
16 => 32768,
_ => {
// Section F.1.2.1.1
// Table F.1
return Err(Error::Format(
"invalid DC difference magnitude category".to_owned(),
));
}
};
if mcu_x > 0 {
ra[i] = results[i][mcu_y * frame.image_size.width as usize + mcu_x - 1];
}
if mcu_y > 0 {
rb[i] = results[i][(mcu_y - 1) * frame.image_size.width as usize + mcu_x];
if mcu_x > 0 {
rc[i] = results[i]
[(mcu_y - 1) * frame.image_size.width as usize + (mcu_x - 1)];
}
}
let prediction = predict(
ra[i] as i32,
rb[i] as i32,
rc[i] as i32,
scan.predictor_selection,
scan.point_transform,
frame.precision,
mcu_x,
mcu_y,
self.restart_interval > 0
&& mcus_left_until_restart == self.restart_interval - 1,
);
let result = ((prediction + diff) & 0xFFFF) as u16; // modulo 2^16
results[i].push(result << scan.point_transform);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the cause of the slowness because there are countless checked indices here and none of these computations can be effectively transformed to SIMD loops. We can work on this later though but a comment would make it clear we're aware of it.

Copy link
Contributor Author

@vsaase vsaase Jul 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I added a comment. There is probably some redundancy in checking the indices and the selection of the predictor. However I think this prediction method is generally not well suited to parallelization, since almost all computations depend on the immediate previous results when iterating in the standard way. There might be some potential performance gain when iterating diagonally (computing diagonals with mcu_x + mcu_y = const), as predictions are done from top left to bottom right

Copy link
Member

@197g 197g Jul 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident that we can exploit instruction level parallelism already if the prediction loop was on its own, and the selection of algorithm happened before. Note that for some of the predictor paths we effectively ignore 2-3 of the values passed in. In other words, those are useless loads and they occupy register space.

Additionally, this interleaving of decoding and computation does not mix too well with bounds checks since it adds extra complication for the compiler to notice that all accesses are in bounds. If the length in the bounds check of access to result[i] was a constant by that point this would be easier. See here for a similar change that had up 30% throughput gains.

There are some predictors that parallelize well, such as Rb and Rc which only depends on the previous line that is finished completely beforehand. Lastly, a data dependency doesn't rule out the use of simd entirely although by this point it would probably need to be done by hand instead of relying on auto-vectorization. In images with large enough lines (greater than SIMD width) only the input ra changes. We can calculate the prediction of a chunk of pixels at once using an assumed input of 0 and then iterate pixels and, once we know the true input, update the prediction. This still makes some of the computation parallel. For example in RaRbRc? predictors the update is simply delaying the add of ra. This should provide additional throughput in total.

Copy link
Contributor Author

@vsaase vsaase Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ideas. The by far most common predictor seems to be Ra (default output of DCMTK library JPEG encoder). On the first column it is always Rb. Thus we can first compute the first column and then in parallel compute all the rows. I see 45% better performance with this.

@vsaase
Copy link
Contributor Author

vsaase commented Aug 1, 2021

Just one question, can you specify precisely where the 1.40 requirement comes from?

This is not related to my changes. When running cargo check in the master branch with the current version 1.34.2 under Windows I get this error in crossbeam-utils, which is a dependency of rayon:
error[E0161]: cannot move a value of type dyn std::ops::FnOnce() + std::marker::Send: the size of dyn std::ops::FnOnce() + std::marker::Send cannot be statically determined --> C:\Users\vsaas\.cargo\registry\src\gitproxy.zycloud.tk-1ecc6299db9ec823\crossbeam-utils-0.8.5\src\thread.rs:449:44

I thought I saw also some issues with 1.39, but that was maybe related to an old Cargo.lock file.
Now the lowest working version seems to be 1.36, I changed it to that.

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding version changes we'll do a big bump to 1.48 soon. I'm marking this as approved. There may be some other changes in the next few days and depending on order they will create a merge conflict with this one. But don't worry, me or another maintainer should take care of them—just leave the fork open and we should have push access to that specific branch.

@197g 197g merged commit 413869b into image-rs:master Oct 5, 2021
@Shnatsel Shnatsel mentioned this pull request May 13, 2022
wartmanm pushed a commit to wartmanm/jpeg-decoder that referenced this pull request Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants