Adjoint Equation of a Linear System of Equations - by implicit derivative

preview_player
Показать описание

How can you efficiently get the derivatives of a loss function with respect to some parameter vectors if intermediate computation stages involve the solution of a linear system of equations, which is an implicit problem. If you use Automatic Differentiation, you might be inclined to just propagate through the solver of the linear system, e.g., by LU or QR decomposition for dense matrices or an iterative solver for sparse systems. However, this approach introduces many problems that make an efficient application of it infeasible (e.g., vanishing/exploding gradients, high memory requirements for solving intermediary values in reverse-mode). The remedy is to perform some clever bracketing which allows framing the task of taking derivatives as the solution to another linear system of equations, the adjoint system. This adjoint system is created easily and involves the transpose of the system matrix, which is by the way the reason the transpose of a matrix is also called its adjoint.

-------

-------

Timestamps:
00:00 Introduction
01:50 Sensitivities
03:04 Implicit Relations vs. Automatics Differentiation
03:57 Dimensions of the variables
04:22 A (scalar-valued) loss function
05:00 Example for a loss function
05:39 Solution also depends on parameters
06:14 Gradient as Total Derivative
07:37 Gradient is a row vector
09:05 The difficult quantity
09:39 Implicit Derivation
11:06 A naive approach
13:41 Problem of the naive approach
16:49 Remedy: Adjoint Method
19:04 Clever Bracketing
19:54 The adjoint variable
21:18 The adjoint system
22:13 Similar Complexity
22:44 Dimension of the adjoint
23:24 Strategy for loss gradient
25:03 Important finding
25:52 When to use adjoint?
26:33 How to get the other derivatives?
27:55 Outlook: Nested linear systems
28:09 Outro
Рекомендации по теме
Комментарии
Автор

This is great! Working on a package that uses the adjoint method for implicit differentiation and this video explained things really well :D

SultanAlHassanieh
Автор

Amazing! I was confused while reading the Neural ODEs paper about how exactly they are doing the adjoint method and your two videos on this helped a lot in understanding it better. Totally appreciated! I wonder if it is possible to derive an explicit system of ODEs that can model a given simple ResNet or feedforward network.

MrRajiv
Автор

Cool, is there an example or code for discrete accompanied CFD shape optimization using automatic differentiation, like SU2, but the instructional version?

yixincfd
Автор

Thank you so much for providing such a wonderful series which is really insightful! Would you be so kind as to elaborate a bit on two aspects: 1. the original computation issue of dx/dθ (starting 14:20). It is understandable that the naive calculation of dx/dθ is much more complicated than solving x=A^{-1}b as dx/dθ involves matrix-matrix multiplication while the later one is just matrix-vector multiplication. But what is a bit puzzling is how to view it as a "batch linear system of equations", in my naive thinking, since the inverse of A^{-1}(θ) only needs to be calculated once, it is typically O(N^3) and does not deal with P at all. Hence the rest is a matrix multiplication of N*N with N*P (the bracket), and from there, indeed the θ's dimensionality starts to have an effect on this multiplication's complexity, but this seems does not answer where the O(PN^3) (15:55) comes from. 2. it seems the key insight is to simply switch the order of calculating by calculating the vector-Jacobian product first so that this helps mitigate the computation complexity of the original Jacobian-vector product formulation. If this is the case, in this specific system, what is the purpose of defining this additional "Adjoint term", as seems all it introduces is just to remind us to calculate vector-Jacobian product instead of Jacobian-vector product, which can originally be done with a careful order of matrix multiplication.

willtsing
Автор

The video is excellent. Can you suggest how to use the adjoint methods for FEA static problems when the stiffness matrix depends on a large number of parameters? Any available Matlab codes for adjoint methods of linear system

navaneethavg
Автор

Hi. I am having a question regarding the use of Leibniz rule. When you formulate the problem as a constrained optimization with ODE constraint, you take the total derivative of the Lagrangian with respect to the parameter \theta. By the Leibniz rule, you brought the derivative d/d\theta inside the integral, but when you brought it inside, it should become a partial derivative wrt \theta, is it not?

tuo
Автор

For the equation dx/du = (A^-1)*(db/dTheta - dA/dTheta*x), I had a question. Ideally, dx/du should be a function of u alone (with no remaining x-terms). However, the additional x coefficient of dA/dTheta does not cancel out, leaving dx/du in terms of both x and u. How do we deal with this? Are we supposed to use observed values (xObs) in place of a purely symbolic x here? If the x remains in the solution, it is not very useful. Thanks.

AnthonyDeesePhD
Автор

Hello Mr, thanks for your video.I still have one question could you please give a example of how to get ∂J/∂θ?J is a combination of θ, right?

Recordingization
Автор

Do you have any video about adjoint in second or arbitrary-order?

juandavidnavarro
Автор

For a FEM simulation, can the parameters be the mesh nodes or some matrices built from the mesh nodes?

kapilkhanal
Автор

I have a question regarding your statement that lamba needs to be calculated once. Lambda depends on A, which depends on theta. Usually, theta is found iteratively for optimization problems, which means that theta changes in each iteration, which also causes A to change, and hence lamba. Of course, if A doesn't depend on lamba, then we don't have this problem. What do you think?

usefulknowledge
Автор

I still don't understand the relationship between θ and A?Could you please provide a simple example?Can I take the example?A(θ) = | cos(θ) 0 sin(θ) |, | 0 1 0 |, | -sin(θ) 0 cos(θ) |)

Recordingization
Автор

Solving a batch of linear equation does not take O(PN^3), You do LU decomposition once which is O(2/3N^3) and then for each P columns, we need to do forward and backward substitution which is O(PN^2)

prateekpatel
Автор

Can you point me to resources on this?

usefulknowledge