I just recently started reading about machine learning so maybe I can explain it a bit.
Basically what a neural network does is take some inputs (src images in this case) and creates a function that will produce a desired output (dst img). It makes this conversion function more and more precise the longer its trained. But it also needs some way to know how close the output it generates is to the desired output. Essentially it compares the output it produced to the desired output via what's called a cost function.
The cost function optimizes this output by minimizing the loss created by this function. It uses something called a gradient, which you would recognize if you've taken multi-variable calculus. If not, the gradient tells you in what direction the functions rate of change is the steepest for a given input. The simple explanation is, it tells how to change the inputs to achieve relative minimum loss.
In these images imagine x and y are some information about the image that will be modified (there will be way more than just two), and z is the loss created by changing these pieces of information.
The image on the left is the ideal result of our cost function, we adjust our inputs (x,y) until the resulting z(loss) is at the lowest point (relative minimum). The gradient tells us which way our inputs need to be changed.
The image on the right shows what happens when we follow the gradient too far at each point. If we make our steps too big we will keep overshooting our target. But if our steps are very small, it will take a long time to converge.
So I think what gradient clipping does, is it limits the distance you follow that path for each step, not allowing your steps to be too big. I'm guessing the model corruption is a result of "gradient explosion" which is why iperov chose to limit the gradient by "clipping" it.
https://www.quora.com/What-is-gradient-clipping-and-why-is-it-necessary
I'm really new to this topic too, but it sounds like your 'steps' can get way too large and your loss gets stuck at a maximum, which causes corruption?
This site explains it pretty well, but might be difficult to understand if you have no background in calculus.
https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/
I should note, I don't know how this process works for the combined image, but I believe this is what happens when it is generating its version of the source and dst images. If anyone has a better understanding please feel free to correct this as this is my best understanding from self learning.