-
-
Couldn't load subscription status.
- Fork 888
Add premultiplied bulk pixel operations #847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To validate the usability and correctness of the new API we need:
- Tests
- Dogfeed it into an existing image processor (eg. resize) and check the effects on performance & code clarity.
We also need to make sure things can be kept maintainable when we also add simdified srgb companding.
| { | ||
| ref Vector4 sp = ref Unsafe.Add(ref sourceRef, i); | ||
| ref TPixel dp = ref Unsafe.Add(ref destRef, i); | ||
| Vector4Utils.UnPremultiply(ref sp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mixed loop as default generic implementation faster than a sequence of bulk conversions + premultiplications? Could be only checked by a benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic would dictate yes but I can add a rough benchmark to test I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU caches and pipelines follow a different kind of logic sometimes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely OK for the default path, but we should be aware that there is some kind of individual threshold for the number of operations it is OK to put inside a loop without making it slower than doing separate bulk operations instead.
Codecov Report
@@ Coverage Diff @@
## master #847 +/- ##
==========================================
+ Coverage 88.89% 89.01% +0.12%
==========================================
Files 1014 1015 +1
Lines 44295 44586 +291
Branches 3209 3226 +17
==========================================
+ Hits 39376 39690 +314
+ Misses 4198 4175 -23
Partials 721 721
Continue to review full report at Codecov.
|
|
@antonfirsov Agree with all the above. Forgot to add WIP. I want as many eyes on this as possible as I develop it and hopefully some assistance as I'm off to Seattle on Sunday. |
|
@antonfirsov We're doing something weird in Our pattern is.
When it should be (requires change to companding algorithm) or |
| namespace SixLabors.ImageSharp.Tests.PixelFormats | ||
| { | ||
| public class PixelBlenderTests | ||
| // Copyright (c) Six Labors and contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing line endings only.
|
Ready for review. Coverage is good. Codecov just can't compare as the master build failed due to System.Drawing. |
|
@JimBobSquarePants gonna review this extensively in the evening or tomorrow. Won't be easy and quick round, I can see many points where perf. could be hurt. |
|
@antonfirsov I think you'll be pleasantly surprised actually. Here's an (unplugged laptop cos it's 6am, I'm jetlagged, no-one else is awake, the cable is in a room where others are sleeping) benchmark where I would expect a potential hit (Resize + Companding) BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.2.202
[Host] : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3362.0
Core : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
IterationCount=3 LaunchCount=1 WarmupCount=3
Method | Job | Runtime | SourceSize | DestSize | Mean | Error | StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|-----------:|-----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
SystemDrawing | Clr | Clr | 3032 | 400 | 115.72 ms | 19.455 ms | 1.0664 ms | 1.00 | 0.00 | - | - | - | 1638 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Clr | Clr | 3032 | 400 | 135.83 ms | 355.538 ms | 19.4882 ms | 1.17 | 0.18 | - | - | - | 16384 B |
'ImageSharp, MaxDegreeOfParallelism = 4' | Clr | Clr | 3032 | 400 | 58.82 ms | 10.872 ms | 0.5959 ms | 0.51 | 0.01 | - | - | - | 23666 B |
'ImageSharp, MaxDegreeOfParallelism = 8' | Clr | Clr | 3032 | 400 | 66.90 ms | 82.627 ms | 4.5291 ms | 0.58 | 0.04 | - | - | - | 23666 B |
| | | | | | | | | | | | | |
SystemDrawing | Core | Core | 3032 | 400 | 134.72 ms | 127.207 ms | 6.9727 ms | 1.00 | 0.00 | - | - | - | 96 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Core | Core | 3032 | 400 | 122.37 ms | 16.997 ms | 0.9317 ms | 0.91 | 0.05 | - | - | - | 15704 B |
'ImageSharp, MaxDegreeOfParallelism = 4' | Core | Core | 3032 | 400 | 55.14 ms | 37.837 ms | 2.0740 ms | 0.41 | 0.01 | - | - | - | 19592 B |
'ImageSharp, MaxDegreeOfParallelism = 8' | Core | Core | 3032 | 400 | 55.13 ms | 7.451 ms | 0.4084 ms | 0.41 | 0.02 | - | - | - | 20028 B | |
and disable parallel benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really hoped that my review will find you in the morning after a good sleep, but looks like it's quite impossible now 😄
Good job! There is no regression for the default path, the companding is actually a bit faster on full CLR (Core: same in my benchmarks).
So it wasn't actually easy to catch you but I made it! You wired out the optimizations for the non-Rgba32 pixel types. I pushed benchmark changes to enable the testing of resize for Bgra32.
Let's find a design that keeps Vector4Converters.RgbaCompatible in the loop!
Resize_Bicubic_Bgra32
Before
Method | Job | Runtime | SourceSize | DestSize | Mean | Error | StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
SystemDrawing | Clr | Clr | 3032 | 400 | 118.08 ms | 19.353 ms | 1.0608 ms | 1.00 | 0.00 | - | - | - | 1638 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Clr | Clr | 3032 | 400 | 105.21 ms | 5.493 ms | 0.3011 ms | 0.89 | 0.01 | - | - | - | 45875 B |
| | | | | | | | | | | | | |
SystemDrawing | Core | Core | 3032 | 400 | 118.96 ms | 49.772 ms | 2.7282 ms | 1.00 | 0.00 | - | - | - | 96 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Core | Core | 3032 | 400 | 98.88 ms | 7.899 ms | 0.4329 ms | 0.83 | 0.02 | - | - | - | 44504 B |
After
Method | Job | Runtime | SourceSize | DestSize | Mean | Error | StdDev | Ratio | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |---------:|----------:|----------:|------:|------------:|------------:|------------:|--------------------:|
SystemDrawing | Clr | Clr | 3032 | 400 | 116.8 ms | 9.454 ms | 0.5182 ms | 1.00 | - | - | - | 1638 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Clr | Clr | 3032 | 400 | 122.7 ms | 26.933 ms | 1.4763 ms | 1.05 | - | - | - | 16384 B |
| | | | | | | | | | | | |
SystemDrawing | Core | Core | 3032 | 400 | 116.6 ms | 1.634 ms | 0.0895 ms | 1.00 | - | - | - | 96 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Core | Core | 3032 | 400 | 111.8 ms | 3.167 ms | 0.1736 ms | 0.96 | - | - | - | 15704 B |
| /// <returns>The <see cref="Vector4"/> representing the linear channel values.</returns> | ||
| [MethodImpl(InliningOptions.ShortMethod)] | ||
| public static Vector4 Expand(Vector4 vector) => new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W); | ||
| public static Vector4 Expand(ref Vector4 vector) => vector = new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slower now because of the additional ctr invocation + copy. Setting coordinates on vector one by one is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least that's an easy fix.
| { | ||
| ref Vector4 sp = ref Unsafe.Add(ref sourceRef, i); | ||
| ref TPixel dp = ref Unsafe.Add(ref destRef, i); | ||
| Vector4Utils.UnPremultiply(ref sp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely OK for the default path, but we should be aware that there is some kind of individual threshold for the number of operations it is OK to put inside a loop without making it slower than doing separate bulk operations instead.
| /// <param name="configuration">A <see cref="Configuration"/> to configure internal operations</param> | ||
| /// <param name="sourcePixels">The <see cref="Span{T}"/> to the source colors.</param> | ||
| /// <param name="destVectors">The <see cref="Span{T}"/> to the destination vectors.</param> | ||
| internal virtual void ToPremultipliedVector4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike ToVector4() this is only implemented by Rgba32.PixelOperations. Currently, using this method in ResizeProcessor wires out the optimizations for the remaining "RGBA - like" pixel types (Argb32, Bgra32, Rgb24, Bgr24). (The optimized variants are delegating to Vector4Converters.RgbaCompatible.ToVector4()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, the fix should be fairly trivial then. We simply need to ensure that all the overloads we use in Rgba32.PixelOperations are also overloaded in our RgbaCompatible generated code.
| /// <returns>The <see cref="Vector4"/> representing the linear channel values.</returns> | ||
| [MethodImpl(InliningOptions.ShortMethod)] | ||
| public static Vector4 Expand(Vector4 vector) => new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W); | ||
| public static void Expand(ref Vector4 vector) => vector = new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unnecessary constructor invocation + copy is still there. What I meant is:
void Expand(ref Vector4 v)
{
v.X =Expand(v.X);
v.Y =Expand(v.Y);
v.Z =Expand(v.Z);
v.W =Expand(v.W);
}It's unlikely that JIT is capable to optimize the simple version. Try a benchmark for a proof/disproof! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I got it. Already committed a change locally. 👍
|
@antonfirsov Good spot with the API. That's much better now! Just realised we somehow use 1/4 of the bytes than previously also. 😄 BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.2.202
[Host] : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3362.0
Core : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
IterationCount=3 LaunchCount=1 WarmupCount=3
Method | Job | Runtime | SourceSize | DestSize | Mean | Error | StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
SystemDrawing | Clr | Clr | 3032 | 400 | 106.93 ms | 5.611 ms | 0.3076 ms | 1.00 | 0.00 | - | - | - | 1638 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Clr | Clr | 3032 | 400 | 100.97 ms | 37.292 ms | 2.0441 ms | 0.94 | 0.02 | - | - | - | 16384 B |
| | | | | | | | | | | | | |
SystemDrawing | Core | Core | 3032 | 400 | 118.23 ms | 24.147 ms | 1.3236 ms | 1.00 | 0.00 | - | - | - | 96 B |
'ImageSharp, MaxDegreeOfParallelism = 1' | Core | Core | 3032 | 400 | 93.31 ms | 12.256 ms | 0.6718 ms | 0.79 | 0.00 | - | - | - | 15704 B | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks!
The From*** overloads are still missing for the premultiplied/compand stuff in RGBA-like types. (FromPremultipliedVector4, FromPremultipliedScaledVector4, FromCompandedScaledVector4, FromCompandedPremultipliedScaledVector4)
I strongly suggest to deal with all this boilerplate in this (or a follow-up) PR to avoid maintenance hell. I think enum flags could do the job without hurting performance. For bulk operations the overhead should be negligable.
enum PixelConversionModifiers
{
None = 0,
Scaled = 1 << 0,
Premultiplied = 1 << 1,
SRgbCompanded = 1 << 2
}This will add a lot of freedom to juggle with optimization techniques! (I have a few basic points in my mind.)
| } | ||
|
|
||
| internal static Vector4[] CreateExpectedVector4Data(TPixel[] source) | ||
| public delegate void RefAction<T1>(ref T1 arg1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 👍
I deliberately didn't add those overloads because they expect If you can think of some workaround please push because I couldn't figure one out and chose to wait for us to optimize the default implementations. |
1.
|
|
@antonfirsov I did actually consider making the I don't know if I have the chops to do the flags based refactoring work on my own (I'd really like to reduce the API surface) so am happy to wait. I have my eye on a couple of other things in the codebase that I can fix up in the interim. |
|
Closing in favour of #869 |
Prerequisites
Description
This adds new methods to
PixelOperationswhich allows optimized bulk conversion to and from premultiplied pixels.I've managed to merge the generic operations into a single loop but not the Rgba SIMD optimized versions. It would be ace if someone could manage it but I don't think we have the tools as is.