Skip to content

Conversation

@JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This adds new methods to PixelOperations which allows optimized bulk conversion to and from premultiplied pixels.

I've managed to merge the generic operations into a single loop but not the Rgba SIMD optimized versions. It would be ace if someone could manage it but I don't think we have the tools as is.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To validate the usability and correctness of the new API we need:

  • Tests
  • Dogfeed it into an existing image processor (eg. resize) and check the effects on performance & code clarity.

We also need to make sure things can be kept maintainable when we also add simdified srgb companding.

{
ref Vector4 sp = ref Unsafe.Add(ref sourceRef, i);
ref TPixel dp = ref Unsafe.Add(ref destRef, i);
Vector4Utils.UnPremultiply(ref sp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mixed loop as default generic implementation faster than a sequence of bulk conversions + premultiplications? Could be only checked by a benchmark.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic would dictate yes but I can add a rough benchmark to test I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU caches and pipelines follow a different kind of logic sometimes :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely OK for the default path, but we should be aware that there is some kind of individual threshold for the number of operations it is OK to put inside a loop without making it slower than doing separate bulk operations instead.

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #847 into master will increase coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #847      +/-   ##
==========================================
+ Coverage   88.89%   89.01%   +0.12%     
==========================================
  Files        1014     1015       +1     
  Lines       44295    44586     +291     
  Branches     3209     3226      +17     
==========================================
+ Hits        39376    39690     +314     
+ Misses       4198     4175      -23     
  Partials      721      721
Impacted Files Coverage Δ
...lFormats/Utils/Vector4Converters.RgbaCompatible.cs 94.91% <100%> (+1.89%) ⬆️
...rp/PixelFormats/Utils/Vector4Converters.Default.cs 100% <100%> (ø) ⬆️
...xelFormats/PixelOperations/PixelOperationsTests.cs 100% <100%> (ø) ⬆️
...ng/Processors/Transforms/Resize/ResizeProcessor.cs 96.47% <100%> (+0.02%) ⬆️
...tions/Generated/Rgb24.PixelOperations.Generated.cs 90.62% <100%> (+25.85%) ⬆️
...ions/Generated/Bgra32.PixelOperations.Generated.cs 91.81% <100%> (+0.64%) ⬆️
...ions/Generated/Argb32.PixelOperations.Generated.cs 83.63% <100%> (+1.28%) ⬆️
...ats/PixelImplementations/Rgba32.PixelOperations.cs 100% <100%> (ø) ⬆️
...Tests/PixelFormats/PixelOperationsTests.Blender.cs 100% <100%> (ø) ⬆️
...mageSharp/ColorSpaces/Companding/SRgbCompanding.cs 100% <100%> (ø) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45b7490...c5ff624. Read the comment docs.

@JimBobSquarePants JimBobSquarePants changed the title Add premultiplied bulk pixel operations WIP: Add premultiplied bulk pixel operations Mar 15, 2019
@JimBobSquarePants
Copy link
Member Author

@antonfirsov Agree with all the above.

Forgot to add WIP. I want as many eyes on this as possible as I develop it and hopefully some assistance as I'm off to Seattle on Sunday.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov We're doing something weird in ResizeProcessor when companding.

Our pattern is.

premultiply => expand => operate => unpremultiply => compress

When it should be (requires change to companding algorithm)
premultiply => expand => operate => compress => unpremultiply

or
expand => premultiply => operate => unpremultiply => compress

namespace SixLabors.ImageSharp.Tests.PixelFormats
{
public class PixelBlenderTests
// Copyright (c) Six Labors and contributors.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing line endings only.

@JimBobSquarePants
Copy link
Member Author

Ready for review. Coverage is good. Codecov just can't compare as the master build failed due to System.Drawing.

@JimBobSquarePants JimBobSquarePants changed the title WIP: Add premultiplied bulk pixel operations Add premultiplied bulk pixel operations Mar 26, 2019
@antonfirsov
Copy link
Member

@JimBobSquarePants gonna review this extensively in the evening or tomorrow. Won't be easy and quick round, I can see many points where perf. could be hurt.

@JimBobSquarePants
Copy link
Member Author

@antonfirsov I think you'll be pleasantly surprised actually. Here's an (unplugged laptop cos it's 6am, I'm jetlagged, no-one else is awake, the cable is in a room where others are sleeping) benchmark where I would expect a potential hit (Resize + Companding)

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.2.202
  [Host] : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3362.0
  Core   : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT

IterationCount=3  LaunchCount=1  WarmupCount=3

                                   Method |  Job | Runtime | SourceSize | DestSize |      Mean |      Error |     StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|-----------:|-----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
                            SystemDrawing |  Clr |     Clr |       3032 |      400 | 115.72 ms |  19.455 ms |  1.0664 ms |  1.00 |    0.00 |           - |           - |           - |              1638 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' |  Clr |     Clr |       3032 |      400 | 135.83 ms | 355.538 ms | 19.4882 ms |  1.17 |    0.18 |           - |           - |           - |             16384 B |
 'ImageSharp, MaxDegreeOfParallelism = 4' |  Clr |     Clr |       3032 |      400 |  58.82 ms |  10.872 ms |  0.5959 ms |  0.51 |    0.01 |           - |           - |           - |             23666 B |
 'ImageSharp, MaxDegreeOfParallelism = 8' |  Clr |     Clr |       3032 |      400 |  66.90 ms |  82.627 ms |  4.5291 ms |  0.58 |    0.04 |           - |           - |           - |             23666 B |
                                          |      |         |            |          |           |            |            |       |         |             |             |             |                     |
                            SystemDrawing | Core |    Core |       3032 |      400 | 134.72 ms | 127.207 ms |  6.9727 ms |  1.00 |    0.00 |           - |           - |           - |                96 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' | Core |    Core |       3032 |      400 | 122.37 ms |  16.997 ms |  0.9317 ms |  0.91 |    0.05 |           - |           - |           - |             15704 B |
 'ImageSharp, MaxDegreeOfParallelism = 4' | Core |    Core |       3032 |      400 |  55.14 ms |  37.837 ms |  2.0740 ms |  0.41 |    0.01 |           - |           - |           - |             19592 B |
 'ImageSharp, MaxDegreeOfParallelism = 8' | Core |    Core |       3032 |      400 |  55.13 ms |   7.451 ms |  0.4084 ms |  0.41 |    0.02 |           - |           - |           - |             20028 B |

and disable parallel benchmarks
Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hoped that my review will find you in the morning after a good sleep, but looks like it's quite impossible now 😄

Good job! There is no regression for the default path, the companding is actually a bit faster on full CLR (Core: same in my benchmarks).

So it wasn't actually easy to catch you but I made it! You wired out the optimizations for the non-Rgba32 pixel types. I pushed benchmark changes to enable the testing of resize for Bgra32.

Let's find a design that keeps Vector4Converters.RgbaCompatible in the loop!

Resize_Bicubic_Bgra32

Before

                                   Method |  Job | Runtime | SourceSize | DestSize |      Mean |     Error |    StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
                            SystemDrawing |  Clr |     Clr |       3032 |      400 | 118.08 ms | 19.353 ms | 1.0608 ms |  1.00 |    0.00 |           - |           - |           - |              1638 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' |  Clr |     Clr |       3032 |      400 | 105.21 ms |  5.493 ms | 0.3011 ms |  0.89 |    0.01 |           - |           - |           - |             45875 B |
                                          |      |         |            |          |           |           |           |       |         |             |             |             |                     |
                            SystemDrawing | Core |    Core |       3032 |      400 | 118.96 ms | 49.772 ms | 2.7282 ms |  1.00 |    0.00 |           - |           - |           - |                96 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' | Core |    Core |       3032 |      400 |  98.88 ms |  7.899 ms | 0.4329 ms |  0.83 |    0.02 |           - |           - |           - |             44504 B |

After

                                   Method |  Job | Runtime | SourceSize | DestSize |     Mean |     Error |    StdDev | Ratio | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |---------:|----------:|----------:|------:|------------:|------------:|------------:|--------------------:|
                            SystemDrawing |  Clr |     Clr |       3032 |      400 | 116.8 ms |  9.454 ms | 0.5182 ms |  1.00 |           - |           - |           - |              1638 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' |  Clr |     Clr |       3032 |      400 | 122.7 ms | 26.933 ms | 1.4763 ms |  1.05 |           - |           - |           - |             16384 B |
                                          |      |         |            |          |          |           |           |       |             |             |             |                     |
                            SystemDrawing | Core |    Core |       3032 |      400 | 116.6 ms |  1.634 ms | 0.0895 ms |  1.00 |           - |           - |           - |                96 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' | Core |    Core |       3032 |      400 | 111.8 ms |  3.167 ms | 0.1736 ms |  0.96 |           - |           - |           - |             15704 B |

/// <returns>The <see cref="Vector4"/> representing the linear channel values.</returns>
[MethodImpl(InliningOptions.ShortMethod)]
public static Vector4 Expand(Vector4 vector) => new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W);
public static Vector4 Expand(ref Vector4 vector) => vector = new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slower now because of the additional ctr invocation + copy. Setting coordinates on vector one by one is faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least that's an easy fix.

{
ref Vector4 sp = ref Unsafe.Add(ref sourceRef, i);
ref TPixel dp = ref Unsafe.Add(ref destRef, i);
Vector4Utils.UnPremultiply(ref sp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely OK for the default path, but we should be aware that there is some kind of individual threshold for the number of operations it is OK to put inside a loop without making it slower than doing separate bulk operations instead.

/// <param name="configuration">A <see cref="Configuration"/> to configure internal operations</param>
/// <param name="sourcePixels">The <see cref="Span{T}"/> to the source colors.</param>
/// <param name="destVectors">The <see cref="Span{T}"/> to the destination vectors.</param>
internal virtual void ToPremultipliedVector4(
Copy link
Member

@antonfirsov antonfirsov Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike ToVector4() this is only implemented by Rgba32.PixelOperations. Currently, using this method in ResizeProcessor wires out the optimizations for the remaining "RGBA - like" pixel types (Argb32, Bgra32, Rgb24, Bgr24). (The optimized variants are delegating to Vector4Converters.RgbaCompatible.ToVector4()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the fix should be fairly trivial then. We simply need to ensure that all the overloads we use in Rgba32.PixelOperations are also overloaded in our RgbaCompatible generated code.

/// <returns>The <see cref="Vector4"/> representing the linear channel values.</returns>
[MethodImpl(InliningOptions.ShortMethod)]
public static Vector4 Expand(Vector4 vector) => new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W);
public static void Expand(ref Vector4 vector) => vector = new Vector4(Expand(vector.X), Expand(vector.Y), Expand(vector.Z), vector.W);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unnecessary constructor invocation + copy is still there. What I meant is:

void Expand(ref Vector4 v) 
{
    v.X =Expand(v.X);
    v.Y =Expand(v.Y);
    v.Z =Expand(v.Z);
    v.W =Expand(v.W);    
}

It's unlikely that JIT is capable to optimize the simple version. Try a benchmark for a proof/disproof! 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I got it. Already committed a change locally. 👍

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Mar 27, 2019

@antonfirsov Good spot with the API. That's much better now! Just realised we somehow use 1/4 of the bytes than previously also. 😄

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.2.202
  [Host] : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3362.0
  Core   : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT

IterationCount=3  LaunchCount=1  WarmupCount=3

                                   Method |  Job | Runtime | SourceSize | DestSize |      Mean |     Error |    StdDev | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
----------------------------------------- |----- |-------- |----------- |--------- |----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
                            SystemDrawing |  Clr |     Clr |       3032 |      400 | 106.93 ms |  5.611 ms | 0.3076 ms |  1.00 |    0.00 |           - |           - |           - |              1638 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' |  Clr |     Clr |       3032 |      400 | 100.97 ms | 37.292 ms | 2.0441 ms |  0.94 |    0.02 |           - |           - |           - |             16384 B |
                                          |      |         |            |          |           |           |           |       |         |             |             |             |                     |
                            SystemDrawing | Core |    Core |       3032 |      400 | 118.23 ms | 24.147 ms | 1.3236 ms |  1.00 |    0.00 |           - |           - |           - |                96 B |
 'ImageSharp, MaxDegreeOfParallelism = 1' | Core |    Core |       3032 |      400 |  93.31 ms | 12.256 ms | 0.6718 ms |  0.79 |    0.00 |           - |           - |           - |             15704 B |

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

The From*** overloads are still missing for the premultiplied/compand stuff in RGBA-like types. (FromPremultipliedVector4, FromPremultipliedScaledVector4, FromCompandedScaledVector4, FromCompandedPremultipliedScaledVector4)

I strongly suggest to deal with all this boilerplate in this (or a follow-up) PR to avoid maintenance hell. I think enum flags could do the job without hurting performance. For bulk operations the overhead should be negligable.

enum PixelConversionModifiers
{
    None = 0,
    Scaled = 1 << 0,
    Premultiplied = 1 << 1,
    SRgbCompanded = 1 << 2
}

This will add a lot of freedom to juggle with optimization techniques! (I have a few basic points in my mind.)

}

internal static Vector4[] CreateExpectedVector4Data(TPixel[] source)
public delegate void RefAction<T1>(ref T1 arg1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

@JimBobSquarePants
Copy link
Member Author

The From*** overloads are still missing for the premultiplied/compand stuff in RGBA-like types. (FromPremultipliedVector4, FromPremultipliedScaledVector4, FromCompandedScaledVector4, FromCompandedPremultipliedScaledVector4)
I strongly suggest to deal with all this boilerplate in this (or a follow-up) PR to avoid maintenance hell. I think enum flags could do the job without hurting performance. For bulk operations the overhead should be negligable.

I deliberately didn't add those overloads because they expect ReadOnlySpan<Vector4> and we have no optimization path in the overloads that allows us intercept the copy and the Vector4 elements before passing to the TPixel. It's all bulk operations.

If you can think of some workaround please push because I couldn't figure one out and chose to wait for us to optimize the default implementations.

@antonfirsov
Copy link
Member

1. From***Vector4(...)

This is more a design question than a question of implementation tricks or workarounds. This is mostly a design PR, so we need to sort out design + API issues before merging.

I have proposal!

Observation

When an image processing code calls .From***Vector4(sourceVectors, destPixels ) it is a final phase of the given algorithm and the data in sourceVectors is no longer needed. To me it looks like this is an invariant which is true for the 100% of our current processors+encoders+decoders.

Conclusion

.From***Vector4(sourceVectors, destPixels ) could be a destructive operation and modifying the contents of the sourceVectors buffer is OK. It could be a Span<T> instead of ReadOnlySpan<T>. What we need is good naming and documentation, to make this obvious on the call site.

2. Eliminating boilerplate

The more I think about this topic the more it feels it should be the part of this design PR, to minimize the amount of technical debt in our codebase. With my flags proposal it's possible to collapse the current amount of 2*2³ = 16 operations (per PixelOperations<T> implementation!) into a pair of .FromVector4(...) + .ToVector4(...) functions. What do you think?

I would be happy to help you out with the implementation, but unfortunately it's not possible for me for another 2 weeks because of family happenings and a daily job business trip 😞

@JimBobSquarePants
Copy link
Member Author

@antonfirsov I did actually consider making the From**Vector4 methods destructive having come to the same conclusion as you but it left me with an uneasy feeling to do so. I could yet be convinced as the current format is far too limiting performance-wise.

I don't know if I have the chops to do the flags based refactoring work on my own (I'd really like to reduce the API surface) so am happy to wait. I have my eye on a couple of other things in the codebase that I can fix up in the interim.

@dlemstra @tocsoft What are your thoughts?

antonfirsov added a commit that referenced this pull request Apr 1, 2019
antonfirsov added a commit that referenced this pull request Apr 1, 2019
@JimBobSquarePants
Copy link
Member Author

Closing in favour of #869

@JimBobSquarePants JimBobSquarePants deleted the js/premultiplication branch September 3, 2019 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants