MrDeepFakes Forums

Some content may not be available to Guests. Consider registering an account to enjoy unrestricted access to guides, support and tools

  • We are looking for community members who are intested in helping out. See our HELP WANTED post.

DeepFaceLab wont train

klemon

DF Vagrant
Hello, I was hoping to get some help.
I have followed the tutorial video through without error up until the training step were i receive a bunch of error text which i think can be summarized into what's below:

E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED



im completely lost when it comes to what software to install and how to type in bits of code but im pretty sure i have the correct stuff installed.

cuda 9.0, cudnn 7.0.5 (for cuda 9.0), tensorflow 1.8, python 3.6.0.


Thanks
 

Pocketspeed

DF Admirer
Verified Video Creator
If you are using DeepFaceLab, you don't need to install anything. It's a self-contained application. Just download the latest version, and follow @dpfks guide in the Guides section. 

Make sure you meet the necessary system requirements.
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
That's right. To fix this, uninstall all previous CUDA installations on PC.

Do a fresh install of DFL (unzip a new copy)
 

klemon

DF Vagrant
dpfks said:
That's right. To fix this, uninstall all previous CUDA installations on PC.

Do a fresh install of DFL (unzip a new copy)
Oh wow that worked perfectly thank you very much it was driving me crazy yesterday :)


Pocketspeed said:
If you are using DeepFaceLab, you don't need to install anything. It's a self-contained application. Just download the latest version, and follow @dpfks guide in the Guides section. 

Make sure you meet the necessary system requirements.

I thought it was but i had everything downloaded for fakeapp2.22 so thought id mention it incase it was relevant, turns out it was causing the problem hah.  I have it working now though thanks for the reply :)
 

klemon

DF Vagrant
Ah the problem came back. I uninstalled cuda* from my pc and reinstall the program as suggested by dpfks and managed to train for 40,000+ iterations without any issue, then saved the model and returned to it later and trained again for 15 minutes or so until white masks started appearing in the preview window. I tried closing and restarting but got the same error as before. Ive tried deleting and reinstalling the same and different versions of DFL but its just the same error message with each. i restarted my pc in between the installs as well in case that could help but no change.

I read on the @dpfks profile that he created a video using a 2GB GTX850m, im running on an i7 8th gen, GTX1080 and 16GBs of ram.

While the training was working task manager didn't show my gpu going over 15% usage with 4 batchsize and my cpu was around 15%-20% usage with my memory on 50% give or take :/


im lost


Running trainer.

Loading model...

Model first run. Enter model options as default for each run.
Write preview history? (y/n ?:help skip:n) : y
Target iteration (skip:unlimited/default) :
0
Batch_size (?:help skip:0) : 4
Feed faces to network sorted by yaw? (y/n ?:help skip:n) :
n
Flip faces randomly? (y/n ?:help skip:y) :
y
Src face scale modifier % ( -30...30, ?:help skip:0) :
0
Use lightweight autoencoder? (y/n, ?:help skip:n) :
n
Use pixel loss? (y/n, ?:help skip: n/default ) :
n
Using TensorFlow backend.
Loading: 100%|#####################################################################################################################################################################################################################################################################################################################################################################################| 231/231 [00:00<00:00, 785.11it/s]
Loading: 100%|###################################################################################################################################################################################################################################################################################################################################################################################| 1023/1023 [00:01<00:00, 750.90it/s]
===== Model summary =====
== Model name: H64
==
== Current iteration: 0
==
== Model options:
== |== write_preview_history : True
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== lighter_ae : False
== |== pixel_loss : False
== Running on:
== |== [0 : GeForce GTX 1080]
=========================
Starting. Press "Enter" to stop training and save model.
2019-04-18 20:03:42.732070: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.738832: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.746186: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.752778: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.778086: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.784890: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.791580: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.798821: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.805658: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.811714: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.817774: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.824531: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.698430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.701544: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.704184: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.707206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:mad:train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Traceback (most recent call last):
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\mainscripts\Trainer.py", line 93, in trainerThread
iter, iter_time = model.train_one_iter()
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\ModelBase.py", line 362, in train_one_iter
losses = self.onTrainOneIter(sample, self.generator_list)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\Model_H64\Model.py", line 85, in onTrainOneIter
total, loss_src_bgr, loss_src_mask, loss_dst_bgr, loss_dst_mask = self.ae.train_on_batch( [warped_src, target_src_full_mask, warped_dst, target_dst_full_mask], [target_src, target_src_full_mask, target_dst, target_dst_full_mask] )
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:mad:train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Done.
Press any key to continue . . .
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
klemon said:
Ah the problem came back.  I uninstalled cuba from my pc and reinstall the program as suggested by dpfks and managed to train for 40,000+ iterations without any issue, then saved the model and returned to it later and trained again for 15 minutes or so until white masks started appearing in the preview window.  I tried closing and restarting but got the same error as before.  Ive tried deleting and reinstalling the same and different versions of DFL but its just the same error message with each.    i restarted my pc in between the installs as well in case that could help but no change.

I read on the @dpfks profile that he created a video using a 2GB GTX850m, im running on an i7 8th gen, GTX1080 and 16GBs of ram.

While the training was working task manager didn't show my gpu going over 15% usage with 4 batchsize and my cpu was around 15%-20% usage with my memory on 50% give or take :/


im lost


Running trainer.

Loading model...

Model first run. Enter model options as default for each run.
Write preview history? (y/n ?:help skip:n) : y
Target iteration (skip:unlimited/default) :
0
Batch_size (?:help skip:0) : 4
Feed faces to network sorted by yaw? (y/n ?:help skip:n) :
n
Flip faces randomly? (y/n ?:help skip:y) :
y
Src face scale modifier % ( -30...30, ?:help skip:0) :
0
Use lightweight autoencoder? (y/n, ?:help skip:n) :
n
Use pixel loss? (y/n, ?:help skip: n/default ) :
n
Using TensorFlow backend.
Loading: 100%|#####################################################################################################################################################################################################################################################################################################################################################################################| 231/231 [00:00<00:00, 785.11it/s]
Loading: 100%|###################################################################################################################################################################################################################################################################################################################################################################################| 1023/1023 [00:01<00:00, 750.90it/s]
===== Model summary =====
== Model name: H64
==
== Current iteration: 0
==
== Model options:
== |== write_preview_history : True
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== lighter_ae : False
== |== pixel_loss : False
== Running on:
== |== [0 : GeForce GTX 1080]
=========================
Starting. Press "Enter" to stop training and save model.
2019-04-18 20:03:42.732070: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.738832: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.746186: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.752778: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.778086: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.784890: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.791580: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.798821: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.805658: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.811714: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.817774: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.824531: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.698430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.701544: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.704184: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.707206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
        [[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:mad:train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
        [[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Traceback (most recent call last):
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\mainscripts\Trainer.py", line 93, in trainerThread
   iter, iter_time = model.train_one_iter()
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\ModelBase.py", line 362, in train_one_iter
   losses = self.onTrainOneIter(sample, self.generator_list)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\Model_H64\Model.py", line 85, in onTrainOneIter
   total, loss_src_bgr, loss_src_mask, loss_dst_bgr, loss_dst_mask = self.ae.train_on_batch( [warped_src, target_src_full_mask, warped_dst, target_dst_full_mask], [target_src, target_src_full_mask, target_dst, target_dst_full_mask] )
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
   outputs = self.train_function(ins)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
   return self._call(inputs)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
   fetched = self._callable_fn(*array_vals)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
   run_metadata_ptr)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
   c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
        [[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:mad:train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
        [[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Done.
Press any key to continue . . .



Uninstall all previous CUDA installs. If you see CUDA in control panel -> Uninstall a program, uninstall it.

Then download the latest CUDA 9.2 version and do a fresh extraction. You then can take your old workspace folder and replace it in the new extract.

then try again.
 
I have had this issue a few times in the past, if there are no CUDA installs causing issues and your drivers are up to date I add the word OLD to the current DFL folder, re extract the newest DFL installer, reboot the machine, copy the workspace folder from the OLD to the folder just extracted then go back to training. This work around is a pain in the arse but it seems to solve the issue pretty well long term... Hope this helps
 

klemon

DF Vagrant
I did 1 last attempt before going to bed yesterday and it just started working. The only thing i changed was the batch size from 4 to 42, using the exact same DFL extract which gave me the error message, crazy . dpfks instructions seem to have worked great it just took a while to kick in, if thats a thing lol


Thanks for all the help guys :)
 

klemon

DF Vagrant
yea its back again :/ this seems like a super temperamental program tbh. ive made a couple vids fine, faster than i expected tbh but then the program randomly stops wanting to train anymore and gives me the same error message. 0 changes to software or settings.

i started a new project in the same extract of DFL and encountered the error again, i fixed it for awhile by copying over the encoder and data files(only worked when i copied both), from a past model which starts me off on 20,000 iterations and trains at about a 3rd/4th the original speed, the previews show it working with the new images. It managed about 15,000 iterations in 8 hours when the exact same files trained for near 60,000 over the same amount of time the day before.

Thennnnn it broke again :/ i tried @"fresh_gumbo"'s solutions but to no luck.


i think im just going to give up for a while maybe try again in the future unfortunately
 

dpfks

DF Enthusiast
Staff member
Administrator
Verified Video Creator
You stated you were using these before: cuda 9.0, cudnn 7.0.5 (for cuda 9.0), tensorflow 1.8, python 3.6.0.

I recommended uninstalling system installs for CUDA, but please just uninstall all system installs of CUDA + Python (I know for sure those gave me errors)

Then just use a fresh unzipped copy of DFL
 
@"klemon" OK so I have found a weird thing that may be coincidence... I have had porn tabs open looking for a model and noticed the error coming up, it was driving me insane, I closed Firefox (there were a lot of tabs) and it ran perfectly, I am not sure if it is something to do with the amount of memory that FF is using having 15 tabs open or less likely, something on one of those tabs was interfering, p.s. there's a new version fresh for install, this may just solve the issue
 

klemon

DF Vagrant
I think its working now.

I uninstalled python, so there's now no cuda or python coming up on my pcs uninstall page. I moved out my workshop and deleted DFL and extracted another from the most recent version: DeepFaceLabCUDA9.2SSE_build_04_23_2019, and it worked, for the first training attempt then broke again. Same error message wouldn't let me start up the model for training again.

Soo i moved out my workshop, deleted DFL, restarted my PC, didn't have any browsers open as fresh_gumbo suggested just in case, extracted a new DFL, moved in my workshop, clicked train: which worked, saved and closed it, re opened up training again and it still works.

Hopefully its all good this time with python deleted and il test out the browser thing in more depth tomorrow when i restart the training since last time it didn't break until a couple of days later.

Thanks again guys i appreciate the patience :)
 
Top