Replies: 2 comments 1 reply
-
I am interested in this suggestion. Before I start practicing, could you please provide me with more training parameters for reference? |
Beta Was this translation helpful? Give feedback.
-
Hi @ja1496, since I wrote this, I've actually changed my mind about recommending this. The reason is that using Maybe it's of interest to people looking into gradient clipping of models like Flux, who also have large GPUs and can train in float32 mode, but for me, I've stopped using this method now. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey all, (@kohya-ss)
I think I might have found something pretty special, and I'm seeing the best image quality I've ever seen from training Flux. So I thought I'd write about it here so you can try it out, if you're doing Full Fine Tuning (i.e. dreambooth). It involves literally a two code line change and adding a command line parameter to get it working.
So back when I was training SDXL, I found that setting the max gradient normal size had a very strong positive effect on training. I basically never trained SDXL without it after that. But Flux doesn't seem to work well with this setting (
--max_grad_norm
). In fact it apparently works so badly that a comment appears during training telling you to switch it off:and even the README.md for sd-scripts example explicitly switches it off with
--max_grad_norm 0.0
.I tried switching it on, and it does distort the output. But I really missed it from SDXL, so I looked into whether it could be made to work again. And yes, I've found a way to get it reactivated for sections of the Flux network, and in my tests, this has a large improvement in training quality! I'm seeing:
Okay, so if you want to try my change, it's pretty simple. Find this section in flux_train.py (around about line 520):
and change this to:
Adding those two lines is all that's needed, along with passing in
--max_grad_norm 1.5
on the command line. (--max_grad_norm 1.0
seems a little too strong).How it works:
In short, when you look at the
num_parameters_per_group[]
array, it breaks down into three groups:Index 0, which has 20 parameters
Indices 1-19, which have 24 parameters.
Indices 20-57, which have 8 parameters.
I think that the 24 and 8 parameter groups are the double and single blocks. My change stops the max_grad_norm from affecting them. But the parameter group in index 0 with 20 parameters seems to be the T5-XXL and CLIP-L management block, and applying a grad norm of 1.0 to them seems to have a very large positive effect on training.
My change only works if you're doing Full Fine Tuning, and I think that section of code assumes you're using
--blockwise_fused_optimizer
too, which you probably are. I haven't (currently) done a better implementation that would work if you're not using blockwise fused optimizers. It might be possible to adapt this change to work with LoRA too with some more work, as there are LoRA blocks that are applied to the same part of the model that is being clipped by my change.Anyway, please do give this a try and let me know if you see the same great results I'm seeing.
Beta Was this translation helpful? Give feedback.
All reactions