New find: Big training quality boost for Full Fine Tuning (Flux) #2114

araleza · 2025-06-07T18:31:03Z

araleza
Jun 7, 2025

I think I might have found something pretty special, and I'm seeing the best image quality I've ever seen from training Flux. So I thought I'd write about it here so you can try it out, if you're doing Full Fine Tuning (i.e. dreambooth). It involves literally a two code line change and adding a command line parameter to get it working.

So back when I was training SDXL, I found that setting the max gradient normal size had a very strong positive effect on training. I basically never trained SDXL without it after that. But Flux doesn't seem to work well with this setting (--max_grad_norm). In fact it apparently works so badly that a comment appears during training telling you to switch it off:

because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / max_grad_norm

and even the README.md for sd-scripts example explicitly switches it off with --max_grad_norm 0.0.

I tried switching it on, and it does distort the output. But I really missed it from SDXL, so I looked into whether it could be made to work again. And yes, I've found a way to get it reactivated for sections of the Flux network, and in my tests, this has a large improvement in training quality! I'm seeing:

Stronger learning of trained objects
More stable images with fewer distortions
Able to learn in situations where the training images are brighter than is ideal

Okay, so if you want to try my change, it's pretty simple. Find this section in flux_train.py (around about line 520):

                        def grad_hook(parameter: torch.Tensor):
                            if accelerator.sync_gradients and args.max_grad_norm != 0.0:
                                accelerator.clip_grad_norm_(parameter, args.max_grad_norm)

and change this to:

                        def grad_hook(parameter: torch.Tensor):
                            i = parameter_optimizer_map[parameter]
                            if num_parameters_per_group[i] == 20:
                                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
                                    accelerator.clip_grad_norm_(parameter, args.max_grad_norm)

Adding those two lines is all that's needed, along with passing in --max_grad_norm 1.5 on the command line. (--max_grad_norm 1.0 seems a little too strong).

How it works:

In short, when you look at the num_parameters_per_group[] array, it breaks down into three groups:

Index 0, which has 20 parameters
Indices 1-19, which have 24 parameters.
Indices 20-57, which have 8 parameters.

I think that the 24 and 8 parameter groups are the double and single blocks. My change stops the max_grad_norm from affecting them. But the parameter group in index 0 with 20 parameters seems to be the T5-XXL and CLIP-L management block, and applying a grad norm of 1.0 to them seems to have a very large positive effect on training.

My change only works if you're doing Full Fine Tuning, and I think that section of code assumes you're using --blockwise_fused_optimizer too, which you probably are. I haven't (currently) done a better implementation that would work if you're not using blockwise fused optimizers. It might be possible to adapt this change to work with LoRA too with some more work, as there are LoRA blocks that are applied to the same part of the model that is being clipped by my change.

Anyway, please do give this a try and let me know if you see the same great results I'm seeing.

ja1496 · 2025-08-16T18:38:59Z

ja1496
Aug 16, 2025

I am interested in this suggestion. Before I start practicing, could you please provide me with more training parameters for reference?

0 replies

araleza · 2025-08-16T23:39:20Z

araleza
Aug 16, 2025
Author

Hi @ja1496, since I wrote this, I've actually changed my mind about recommending this. The reason is that using --blockwise_fused_optimizer disables the Adafactor optimizer from doing stochastic rounding for bfloat16 training, which is a critical part of training, if you don't have a giant GPU to train on. I switched to using --fused_backward_pass, and that gives better quality than this change.

Maybe it's of interest to people looking into gradient clipping of models like Flux, who also have large GPUs and can train in float32 mode, but for me, I've stopped using this method now.

1 reply

ja1496 Aug 17, 2025

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New find: Big training quality boost for Full Fine Tuning (Flux) #2114

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

New find: Big training quality boost for Full Fine Tuning (Flux) #2114

Uh oh!

Uh oh!

araleza Jun 7, 2025

Replies: 2 comments · 1 reply

Uh oh!

ja1496 Aug 16, 2025

Uh oh!

araleza Aug 16, 2025 Author

Uh oh!

ja1496 Aug 17, 2025

araleza
Jun 7, 2025

Replies: 2 comments 1 reply

ja1496
Aug 16, 2025

araleza
Aug 16, 2025
Author