MrDeepFakes Forums

Some content may not be available to Guests. Consider registering an account to enjoy unrestricted access to guides, support and tools

  • We are looking for community members who are intested in helping out. See our HELP WANTED post.

Questions about new optimizer

avalentino93

DF Admirer
Optimizer now exists and Ive been messing around with it trying to figure it out.  I dont understand if it actually helps speed up the process, or its meant to do more, but slow the process down.

Specs:
1080SC - 8GB
8790k
32GB RAM

Current tests:
SAE
Batch - 4
Resolution - 256
Dims - Standard
Light Encoder - No
Multi - Yes

Optimizer Settings:
1 ~ Fails with OOM
2 ~ 11-14GB with 20% CPU and 50%GPU - around 1500ms
3 ~ 17-20GB of RAM with about 80% CPU and 50%GPU

So to me it seems like the optimizer isnt there to help speed things up in the sense that many of us think of the word "optimize".   But it appears to be for those of us that want to push the limits and be able to do 256 resolution on a 8GB card.  Since if I choose no optimizer with the above settings, it fails.

Anyone else understanding this thing yet?
 

iperov

DF Enthusiast
Developer
new optimizer brings deepfakes to new era.
it allows to train bigger network on same VRAM.

For example 256res with not reduced dimensions 512-42 was impossible to train on 6GB before.
 

avalentino93

DF Admirer
iperov said:
new optimizer brings deepfakes to new era.
it allows to train bigger network on same VRAM.

For example 256res with not reduced dimensions 512-42 was impossible to train on 6GB before.

Right.  I get that part.  But my question is at what cost?   
For example, if I have a setup that can run 256 resolution, with higher dimms and 16-24 batch sizes, without the optimizer, should I use that instead?  Or would the optimizer make it faster?  Is the optimizer only there to assist with lack of specs?  

If you used all the same settings and your hardware isnt an issue, which is fastest?  1, 2 or 3 ?
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
User @"titan_rw" did a quick test, and came out with these results:

All done at BS=16. Test model was 160 res 640x48 dims, multiscale=y

mode1 = 1258ms/iter using 3gb ram, small cpu usage
mode2 = 1400ms/iter using 12gb ram, small cpu usage
mode3 = 1840ms/iter using 12gb ram, 80% cpu! usage

So from this small test, we can conclude that if you're happy with your current settings, mode 1 is the fastest. Of course this should be tested on the exact setup you have to confirm. So in a sense I guess "using optimizer only if you're limited by your hardware" is correct.
 

iperov

DF Enthusiast
Developer
dpfks said:
So in a sense I guess "using optimizer only if you're limited by your hardware" is correct.

I am not limited by hardware with my 6GB.
But I cannot train 256 res with not reduced dims and batch size.
If you want to spend more time to achieve better resolution - optimizer modes is a solution.
 

titan_rw

DF Pleb
This is what I would call a memory optimizer.  Every case I've tested is slower with higher modes.  The advantage is that you can run a more complicated model on the same hardware.

I did some more testing at 256 res, and default NN size.

My 12 gig Titan can run native 256 res (mode 1):

mode 1 - bs6 - 1100ms/iter


I tried mode 2 on the Titan, but it only gained me bs 7 or something.  Not really worth it.  Mode 3 would definitely let me up the bs, but it would be so much slower I don't think the higher bs is worth the extra time per iteration.

For this card, at 256 res, and default NN size, I don't need the optimization modes.  Theoretically they'd let me run a bigger NN at the same bs.  I think it's still unknown if this is needed.


Where you need the optimization modes is lesser memory cards.  A comparison with my 980ti (6 gigs):

mode 1 - oom (wont even start on bs 1)
mode 2 - bs2 - 1400ms
mode 2 - bs4 - 1700ms
mode 2 - bs6 - 2200ms (oom) (ran 10 iters them oom'd)
mode 3 - bs6 - 2500ms
mode 3 - bs8 - 2800ms
mode 3 - bs10 - 3100ms
mode 3 - bs12 - oom (won't start)

Here is my 6 gig card actually managing to do the same work as my 12 gig card, but it needs mode 3, which slows iterations down from 1100ms of the Titan to 2500ms.  But it can do it, and the final quality will be the same.  It'll just take longer.
 

iperov

DF Enthusiast
Developer
mode 3 - bs6 - 2500ms
mode 3 - bs8 - 2800ms
mode 3 - bs10 - 3100ms


bs6 - 416ms per sample
bs10 - 310ms per sample

bs10 actually faster, because more samples per second will be feeded to the network.
 

titan_rw

DF Pleb
iperov said:
mode 3 - bs6 - 2500ms
mode 3 - bs8 - 2800ms
mode 3 - bs10 - 3100ms


bs6 - 416ms per sample
bs10 - 310ms per sample

bs10 actually faster, because more samples per second will be feeded to the network.

That was the case with my 980ti, which I rarely train on.  It's mostly used for face extraction, and conversion.

I ran some more tests on my Titan.  This is a 192 res model, mask on, multiscale on.  SAE-DF, default dims.


mode 1 bs 6 @ 1060ms = 176ms/sample
mode 1 bs 7 @ 1190ms = 170ms/sample
mode 1 bs 8 (oom)
mode 2 bs 8 @ 1750ms = 218ms/sample
mode 2 bs 10 @ 1920ms = 192ms/sample
mode 2 bs 12 @ 2150ms = 179ms/sample
mode 2 bs 14 @ 2300ms = 164ms/sample
mode 2 bs 16 @ 2450ms = 153ms/sample (oom'd)
mode 3 bs 18 @ 2980ms = 165ms/sample
mode 3 bs 20 (oom)

What's the best here?  1/7 - 2/14 and 3/18 are all very similar ms/sample.

What I was doing was running the first 50k or so on mode 1, then switching up to a higher mode and higher BS later.
 

Akanee

DF Pleb
hi everybody !
I have a question about optimiser mode, actually i ran mode 3.
But my question is, i have problems with 256 Resolution face.
How did you do guyz to run the trainer (for me, it's SAE in DF mode) working ?
Usually i use 128.

My spec are :
I7 7700K
GTX 1070
32Go Ram DDR4 3200Mhz.

Thanks for the help !
 

titan_rw

DF Pleb
Is it OOM-ing? If so, have you tried lowering the batch size to where it doesn't OOM?

Mode 3 combined with a really small batch size means it's going to take a very long time to train.
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
Akanee said:
hi everybody !
I have a question about optimiser mode, actually i ran mode 3.
But my question is, i have problems with 256 Resolution face.
How did you do guyz to run the trainer (for me, it's SAE in DF mode) working ?
Usually i use 128.

My spec are :
I7 7700K
GTX 1070
32Go Ram DDR4 3200Mhz.

Thanks for the help !

Likely need to lower batch and DIMS
 

Akanee

DF Pleb
dpfks said:
Akanee said:
hi everybody !
I have a question about optimiser mode, actually i ran mode 3.
But my question is, i have problems with 256 Resolution face.
How did you do guyz to run the trainer (for me, it's SAE in DF mode) working ?
Usually i use 128.

My spec are :
I7 7700K
GTX 1070
32Go Ram DDR4 3200Mhz.

Thanks for the help !

Likely need to lower batch and DIMS

batch size already at 4 and still oom problem :O
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
Akanee said:
dpfks said:
Akanee said:
hi everybody !
I have a question about optimiser mode, actually i ran mode 3.
But my question is, i have problems with 256 Resolution face.
How did you do guyz to run the trainer (for me, it's SAE in DF mode) working ?
Usually i use 128.

My spec are :
I7 7700K
GTX 1070
32Go Ram DDR4 3200Mhz.

Thanks for the help !

Likely need to lower batch and DIMS

batch size already at 4 and still oom problem :O

then lower your dims
 

Akanee

DF Pleb
dpfks said:
then lower your dims
Working with that configuration.
@"dpfks" can you please if my configuration is correct for photos ?
Maybe i can change some parameters for better result ? You are the master at this point :)

== Model options:
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== resolution : 256
== |== face_type : f
== |== learn_mask : True
== |== optimizer_mode : 1
== |== archi : liae
== |== ae_dims : 64
== |== e_ch_dims : 42
== |== d_ch_dims : 21
== |== remove_gray_border : False
== |== multiscale_decoder : True
== |== pixel_loss : False
== |== face_style_power : 0.0
== |== bg_style_power : 0.0
== Running on:
== |== [0 : GeForce GTX 1070]
 

avalentino93

DF Admirer
Akanee said:
dpfks said:
then lower your dims
Working with that configuration.
@"dpfks" can you please if my configuration is correct for photos ?
Maybe i can change some parameters for better result ? You are the master at this point :)

== Model options:
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== resolution : 256
== |== face_type : f
== |== learn_mask : True
== |== optimizer_mode : 1
== |== archi : liae
== |== ae_dims : 64
== |== e_ch_dims : 42
== |== d_ch_dims : 21
== |== remove_gray_border : False
== |== multiscale_decoder : True
== |== pixel_loss : False
== |== face_style_power : 0.0
== |== bg_style_power : 0.0
== Running on:
== |== [0 : GeForce GTX 1070]

I dont really see 256 res at 64 ae_dims being worthwhile imo.   Im running 224 res with 777 ae_dimms and still debating whats better between higher res and lower dims.  But 64 is insanely low.
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
I agree, I never use 256, but then again I don't do just photos
 

Akanee

DF Pleb
dpfks said:
I agree, I never use 256, but then again I don't do just photos

You right all.
I found that too low too. 
I return to 128 cause i don't think there will be a big difference with 128 honestly. 

I asked you that question cause with photos we meet a problem more recurent that you can found for video it's the closest shot.
I mean often on photos the face is very near the camera and i always ask myself if SAE with DF or mays LIAE will do good result on definition or resolution terms.
it's like my nightmare, you know..
 

titan_rw

DF Pleb
I figured I'd throw this in here since this thread has a lot of info about the memory optimizer modes in DFL.

If you use mode 2 or 3, PCI-E bandwidth is king.  Especially when using mode 2.

I juggled around the pci-e cards in my computer, and got my Titan card which was running at x8, up to x16.  This got me 5-15% faster iterations, depending on batch size, and mode 2 or 3.

This makes sense thinking back on it.  From what I gather: mode 1 just runs everything in vram all the time.  Mode 2 juggles stuff between vram and system ram in order to allow bigger models, or higher batch size.  Depending on the speed of the gpu, how much vram it has, and the size of the model, this swapping of system ram to graphics ram can be limited by the bus speed.  Mode 3 seems to be less affected as I'm assuming it uses the CPU for some of the processing so there's more time available for bus transfers.

I was seeing 80-90% peak bus usage at x8, but only 60-70% at x16.  These are peaks, not sustained continuous usage.  But if I'm seeing 70% peak at x16, that would be the equivalent to a 140% peak at x8.  Obviously that's not possible, so training is slowed down for a fraction of a second while the pci-e transfer finishes.
 

avalentino93

DF Admirer
Akanee said:
dpfks said:
I agree, I never use 256, but then again I don't do just photos

You right all.
I found that too low too. 
I return to 128 cause i don't think there will be a big difference with 128 honestly. 

I asked you that question cause with photos we meet a problem more recurent that you can found for video it's the closest shot.
I mean often on photos the face is very near the camera and i always ask myself if SAE with DF or mays LIAE will do good result on definition or resolution terms.
it's like my nightmare, you know..

I do photos all the time.  Its fucking tedious.  However, Ive found some things that help speed up the process and make better results.
I used to go photo by photo sizing them all so they would fit within my 128-224 face resolutions.  1 by 1 by 1.  Sometimes up to 4000 at a time.  It would take me about 6-7 hours to go through 2000 photos.  Way too long.  Here is my process now.
Make 2-4 copies of each photo, make it so the naming convention is something like:
photo1, photo1_1, photo1_2, photo2_3
photo2, photo2_1, photo2_2, etc

So now you have 3-4 copies of each photo.  In Premiere import all of them and make a new sequence with just photox.
Now make a new sequence after it, with just photo_x_1,  then another with photox_2.  So on and so forth so that you have multiple sequences of the same photos.
Then for sequence 1, leave it.   Sequence 2, select all frames and scale -50%.   Sequence 3, select all and scale 80%.  Sequence 4, select all and scale 125%.   So on and so forth.
Lastly, if you want you can even make contrast, color, denoise changes to each even more sequences.

Yes, this means you sometimes end up with 12,000-16,000 frames.  Yes, it will take time finding faces, but far far less time than you will spend manually manipulating each frame/photo.   However, if you already have a well trained src model, it doesnt matter.  If your model is near perfect, it only takes maybe an hour of running for it to match the new dst.  Now you convert all photos (yes all 12k-20k) in both rct, then lct.  What you end up with is the most amount of possible good photos that you can use, without having to do hardly any pre-processing or post-processing.  Since some face sizes and different colors work better than others, using this method you are covering almost every possible output for each photo, to ensure you have a good end product.   

tl;dr probably just dont do photos, its ridiculously tedious, but teaches you a lot about everything

titan_rw said:
I figured I'd throw this in here since this thread has a lot of info about the memory optimizer modes in DFL.

If you use mode 2 or 3, PCI-E bandwidth is king.  Especially when using mode 2.

I juggled around the pci-e cards in my computer, and got my Titan card which was running at x8, up to x16.  This got me 5-15% faster iterations, depending on batch size, and mode 2 or 3.

This makes sense thinking back on it.  From what I gather: mode 1 just runs everything in vram all the time.  Mode 2 juggles stuff between vram and system ram in order to allow bigger models, or higher batch size.  Depending on the speed of the gpu, how much vram it has, and the size of the model, this swapping of system ram to graphics ram can be limited by the bus speed.  Mode 3 seems to be less affected as I'm assuming it uses the CPU for some of the processing so there's more time available for bus transfers.

I was seeing 80-90% peak bus usage at x8, but only 60-70% at x16.  These are peaks, not sustained continuous usage.  But if I'm seeing 70% peak at x16, that would be the equivalent to a 140% peak at x8.  Obviously that's not possible, so training is slowed down for a fraction of a second while the pci-e transfer finishes.

Im still not understanding the iteration times.  It seems like a while back I could push to 20 batch sizes at 128res and still be at like 1,800ms.  Every since optimizer, and yes I realize the trade off of offloading memory, but even something like 145 res, optimizer 2, is sometimes like 3 seconds!  I try to do everything I can to get rid of optimizer, but doesnt seem like there is a way to anymore.

My baseline default training model settings are now:
batch size: 4-5
resolution: 224
ae_dimms: 777
e_ch_dims: 55
d_ch_dims: 33
optimizer: 2

iteration = 2,600ms

sometimes when Im going to let it run for a few hours straight Ill set it to batch size of 6-7 and optimizer to 3, but then iterations are like 4-5 seconds.

Im still trying to figure out if (after say 25k) larger batch sizes at slower times are better, or smaller batch sizes at higher times.  Hard to determine whats better.
 
Top