Mr DeepFakes Forums
  • New and improved dark forum theme!
  • Guests can now comment on videos on the tube.
   
klemonDeepFaceLab wont train
#1
Hello, I was hoping to get some help.
I have followed the tutorial video through without error up until the training step were i receive a bunch of error text which i think can be summarized into what's below:

E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED



im completely lost when it comes to what software to install and how to type in bits of code but im pretty sure i have the correct stuff installed.

cuda 9.0, cudnn 7.0.5 (for cuda 9.0), tensorflow 1.8, python 3.6.0.


Thanks
#2
If you are using DeepFaceLab, you don't need to install anything. It's a self-contained application. Just download the latest version, and follow @dpfks guide in the Guides section. 

Make sure you meet the necessary system requirements.
#3
That's right. To fix this, uninstall all previous CUDA installations on PC.

Do a fresh install of DFL (unzip a new copy)
#4
(04-17-2019, 11:13 PM)dpfks Wrote: You are not allowed to view links. Register or Login to view.That's right. To fix this, uninstall all previous CUDA installations on PC.

Do a fresh install of DFL (unzip a new copy)
Oh wow that worked perfectly thank you very much it was driving me crazy yesterday Smile

(04-17-2019, 10:25 PM)Pocketspeed Wrote: You are not allowed to view links. Register or Login to view.If you are using DeepFaceLab, you don't need to install anything. It's a self-contained application. Just download the latest version, and follow @dpfks guide in the Guides section. 

Make sure you meet the necessary system requirements.

I thought it was but i had everything downloaded for fakeapp2.22 so thought id mention it incase it was relevant, turns out it was causing the problem hah.  I have it working now though thanks for the reply Smile
#5
Ah the problem came back. I uninstalled cuda* from my pc and reinstall the program as suggested by dpfks and managed to train for 40,000+ iterations without any issue, then saved the model and returned to it later and trained again for 15 minutes or so until white masks started appearing in the preview window. I tried closing and restarting but got the same error as before. Ive tried deleting and reinstalling the same and different versions of DFL but its just the same error message with each. i restarted my pc in between the installs as well in case that could help but no change.

I read on the @dpfks profile that he created a video using a 2GB GTX850m, im running on an i7 8th gen, GTX1080 and 16GBs of ram.

While the training was working task manager didn't show my gpu going over 15% usage with 4 batchsize and my cpu was around 15%-20% usage with my memory on 50% give or take :/


im lost

Running trainer.

Loading model...

Model first run. Enter model options as default for each run.
Write preview history? (y/n ?:help skip:n) : y
Target iteration (skip:unlimited/default) :
0
Batch_size (?:help skip:0) : 4
Feed faces to network sorted by yaw? (y/n ?:help skip:n) :
n
Flip faces randomly? (y/n ?:help skip:y) :
y
Src face scale modifier % ( -30...30, ?:help skip:0) :
0
Use lightweight autoencoder? (y/n, ?:help skip:n) :
n
Use pixel loss? (y/n, ?:help skip: n/default ) :
n
Using TensorFlow backend.
Loading: 100%|#####################################################################################################################################################################################################################################################################################################################################################################################| 231/231 [00:00<00:00, 785.11it/s]
Loading: 100%|###################################################################################################################################################################################################################################################################################################################################################################################| 1023/1023 [00:01<00:00, 750.90it/s]
===== Model summary =====
== Model name: H64
==
== Current iteration: 0
==
== Model options:
== |== write_preview_history : True
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== lighter_ae : False
== |== pixel_loss : False
== Running on:
== |== [0 : GeForce GTX 1080]
=========================
Starting. Press "Enter" to stop training and save model.
2019-04-18 20:03:42.732070: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.738832: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.746186: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.752778: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.778086: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.784890: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.791580: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.798821: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.805658: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.811714: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.817774: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.824531: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.698430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.701544: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.704184: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.707206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Traceback (most recent call last):
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\mainscripts\Trainer.py", line 93, in trainerThread
iter, iter_time = model.train_one_iter()
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\ModelBase.py", line 362, in train_one_iter
losses = self.onTrainOneIter(sample, self.generator_list)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\Model_H64\Model.py", line 85, in onTrainOneIter
total, loss_src_bgr, loss_src_mask, loss_dst_bgr, loss_dst_mask = self.ae.train_on_batch( [warped_src, target_src_full_mask, warped_dst, target_dst_full_mask], [target_src, target_src_full_mask, target_dst, target_dst_full_mask] )
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@train...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Done.
Press any key to continue . . .
#6
(8 hours ago)klemon Wrote: You are not allowed to view links. Register or Login to view.Ah the problem came back.  I uninstalled cuba from my pc and reinstall the program as suggested by dpfks and managed to train for 40,000+ iterations without any issue, then saved the model and returned to it later and trained again for 15 minutes or so until white masks started appearing in the preview window.  I tried closing and restarting but got the same error as before.  Ive tried deleting and reinstalling the same and different versions of DFL but its just the same error message with each.    i restarted my pc in between the installs as well in case that could help but no change.

I read on the @dpfks profile that he created a video using a 2GB GTX850m, im running on an i7 8th gen, GTX1080 and 16GBs of ram.

While the training was working task manager didn't show my gpu going over 15% usage with 4 batchsize and my cpu was around 15%-20% usage with my memory on 50% give or take :/


im lost

Running trainer.

Loading model...

Model first run. Enter model options as default for each run.
Write preview history? (y/n ?:help skip:n) : y
Target iteration (skip:unlimited/default) :
0
Batch_size (?:help skip:0) : 4
Feed faces to network sorted by yaw? (y/n ?:help skip:n) :
n
Flip faces randomly? (y/n ?:help skip:y) :
y
Src face scale modifier % ( -30...30, ?:help skip:0) :
0
Use lightweight autoencoder? (y/n, ?:help skip:n) :
n
Use pixel loss? (y/n, ?:help skip: n/default ) :
n
Using TensorFlow backend.
Loading: 100%|#####################################################################################################################################################################################################################################################################################################################################################################################| 231/231 [00:00<00:00, 785.11it/s]
Loading: 100%|###################################################################################################################################################################################################################################################################################################################################################################################| 1023/1023 [00:01<00:00, 750.90it/s]
===== Model summary =====
== Model name: H64
==
== Current iteration: 0
==
== Model options:
== |== write_preview_history : True
== |== batch_size : 4
== |== sort_by_yaw : False
== |== random_flip : True
== |== lighter_ae : False
== |== pixel_loss : False
== Running on:
== |== [0 : GeForce GTX 1080]
=========================
Starting. Press "Enter" to stop training and save model.
2019-04-18 20:03:42.732070: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.738832: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.746186: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.752778: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.778086: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.784890: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.791580: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.798821: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.805658: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.811714: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.817774: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:42.824531: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.698430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.701544: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.704184: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-18 20:03:43.707206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Error: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
        [[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["locAngrytrain...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
        [[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Traceback (most recent call last):
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\mainscripts\Trainer.py", line 93, in trainerThread
   iter, iter_time = model.train_one_iter()
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\ModelBase.py", line 362, in train_one_iter
   losses = self.onTrainOneIter(sample, self.generator_list)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\DeepFaceLab\models\Model_H64\Model.py", line 85, in onTrainOneIter
   total, loss_src_bgr, loss_src_mask, loss_dst_bgr, loss_dst_mask = self.ae.train_on_batch( [warped_src, target_src_full_mask, warped_dst, target_dst_full_mask], [target_src, target_src_full_mask, target_dst, target_dst_full_mask] )
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
   outputs = self.train_function(ins)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
   return self._call(inputs)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
   fetched = self._callable_fn(*array_vals)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
   run_metadata_ptr)
 File "F:\DeepFaceLabCUDA9.2SSE\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
   c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
        [[{{node model_1/conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["locAngrytrain...propFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/model_1/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
        [[{{node loss/model_3_loss_1/Mean_3/_571}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4386_loss/model_3_loss_1/Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Done.
Press any key to continue . . .

Uninstall all previous CUDA installs. If you see CUDA in control panel -> Uninstall a program, uninstall it.

Then download the latest CUDA 9.2 version and do a fresh extraction. You then can take your old workspace folder and replace it in the new extract.

then try again.
#7
No change still getting the same error message
#8
GPU driver's up to date?
#9
yep according to geforce. version 425.31

Forum Jump:

Users browsing this thread: 1 Guest(s)