Adjoint Equation of a Linear System of Equations - by implicit derivative

Показать описание

How can you efficiently get the derivatives of a loss function with respect to some parameter vectors if intermediate computation stages involve the solution of a linear system of equations, which is an implicit problem. If you use Automatic Differentiation, you might be inclined to just propagate through the solver of the linear system, e.g., by LU or QR decomposition for dense matrices or an iterative solver for sparse systems. However, this approach introduces many problems that make an efficient application of it infeasible (e.g., vanishing/exploding gradients, high memory requirements for solving intermediary values in reverse-mode). The remedy is to perform some clever bracketing which allows framing the task of taking derivatives as the solution to another linear system of equations, the adjoint system. This adjoint system is created easily and involves the transpose of the system matrix, which is by the way the reason the transpose of a matrix is also called its adjoint.

-------

-------

Timestamps:
00:00 Introduction
01:50 Sensitivities
03:04 Implicit Relations vs. Automatics Differentiation
03:57 Dimensions of the variables
04:22 A (scalar-valued) loss function
05:00 Example for a loss function
05:39 Solution also depends on parameters
06:14 Gradient as Total Derivative
07:37 Gradient is a row vector
09:05 The difficult quantity
09:39 Implicit Derivation
11:06 A naive approach
13:41 Problem of the naive approach
16:49 Remedy: Adjoint Method
19:04 Clever Bracketing
19:54 The adjoint variable
21:18 The adjoint system
22:13 Similar Complexity
22:44 Dimension of the adjoint
23:24 Strategy for loss gradient
25:03 Important finding
25:52 When to use adjoint?
26:33 How to get the other derivatives?
27:55 Outlook: Nested linear systems
28:09 Outro

Machine Learning & Simulation

Рекомендации по теме

Комментарии

This is great! Working on a package that uses the adjoint method for implicit differentiation and this video explained things really well :D

SultanAlHassanieh

Amazing! I was confused while reading the Neural ODEs paper about how exactly they are doing the adjoint method and your two videos on this helped a lot in understanding it better. Totally appreciated! I wonder if it is possible to derive an explicit system of ODEs that can model a given simple ResNet or feedforward network.

MrRajiv

Cool, is there an example or code for discrete accompanied CFD shape optimization using automatic differentiation, like SU2, but the instructional version?

yixincfd

Thank you so much for providing such a wonderful series which is really insightful! Would you be so kind as to elaborate a bit on two aspects: 1. the original computation issue of dx/dθ (starting 14:20). It is understandable that the naive calculation of dx/dθ is much more complicated than solving x=A^{-1}b as dx/dθ involves matrix-matrix multiplication while the later one is just matrix-vector multiplication. But what is a bit puzzling is how to view it as a "batch linear system of equations", in my naive thinking, since the inverse of A^{-1}(θ) only needs to be calculated once, it is typically O(N^3) and does not deal with P at all. Hence the rest is a matrix multiplication of N*N with N*P (the bracket), and from there, indeed the θ's dimensionality starts to have an effect on this multiplication's complexity, but this seems does not answer where the O(PN^3) (15:55) comes from. 2. it seems the key insight is to simply switch the order of calculating by calculating the vector-Jacobian product first so that this helps mitigate the computation complexity of the original Jacobian-vector product formulation. If this is the case, in this specific system, what is the purpose of defining this additional "Adjoint term", as seems all it introduces is just to remind us to calculate vector-Jacobian product instead of Jacobian-vector product, which can originally be done with a careful order of matrix multiplication.

willtsing

The video is excellent. Can you suggest how to use the adjoint methods for FEA static problems when the stiffness matrix depends on a large number of parameters? Any available Matlab codes for adjoint methods of linear system

navaneethavg

Hi. I am having a question regarding the use of Leibniz rule. When you formulate the problem as a constrained optimization with ODE constraint, you take the total derivative of the Lagrangian with respect to the parameter \theta. By the Leibniz rule, you brought the derivative d/d\theta inside the integral, but when you brought it inside, it should become a partial derivative wrt \theta, is it not?

tuo

For the equation dx/du = (A^-1)*(db/dTheta - dA/dTheta*x), I had a question. Ideally, dx/du should be a function of u alone (with no remaining x-terms). However, the additional x coefficient of dA/dTheta does not cancel out, leaving dx/du in terms of both x and u. How do we deal with this? Are we supposed to use observed values (xObs) in place of a purely symbolic x here? If the x remains in the solution, it is not very useful. Thanks.

AnthonyDeesePhD

Hello Mr, thanks for your video.I still have one question could you please give a example of how to get ∂J/∂θ?J is a combination of θ, right?

Recordingization

Do you have any video about adjoint in second or arbitrary-order?

juandavidnavarro

For a FEM simulation, can the parameters be the mesh nodes or some matrices built from the mesh nodes?

kapilkhanal

I have a question regarding your statement that lamba needs to be calculated once. Lambda depends on A, which depends on theta. Usually, theta is found iteratively for optimization problems, which means that theta changes in each iteration, which also causes A to change, and hence lamba. Of course, if A doesn't depend on lamba, then we don't have this problem. What do you think?

usefulknowledge

I still don't understand the relationship between θ and A？Could you please provide a simple example?Can I take the example?A(θ) = | cos(θ) 0 sin(θ) |, | 0 1 0 |, | -sin(θ) 0 cos(θ) |)

Recordingization

Solving a batch of linear equation does not take O(PN^3), You do LU decomposition once which is O(2/3N^3) and then for each P columns, we need to do forward and backward substitution which is O(PN^2)

prateekpatel

Can you point me to resources on this?

usefulknowledge

Adjoint Equation of a Linear System of Equations - by implicit derivative

Adjoint Equation of a Linear System of Equations - by implicit derivative

Adjoints

Solution of system of linear equation by adjoint method

Lecture 6 Part 1: Adjoint Differentiation of ODE Solutions

Self-Adjoint Operators

Lecture 4 Part 2: Nonlinear Root Finding, Optimization, and Adjoint Gradient Methods

Linear functionals and adjoints part 1: Riesz representation, adjoint

Adjoint for implicit governing equations

Green's function for self adjoint linear differential equations

Differential Equations Question on Self Adjoint Equation and one question on Greens Function

Definition of adjoint operator

Adjoint Linear Differential Operator

Introduction to the adjoint method

How to Solve 𝑨𝑿=𝑩 Using Adjoint Method

Adjoint State Method for an ODE | Adjoint Sensitivity Analysis

Self-Adjoint Transformations

Example of adjoint operator 1

3.1.14-Linear Algebra: Matrix Adjoint Operator

Short Trick to Find Adjoint of A Matrix in 60 Seconds

Adjoint Equation and related Questions ,Mathematical Method of Physics (MMP) by Dr Nawazish Ali Shah

3.1.15-Linear Algebra: Self-Adjoint and Symmetric Matrices

Matrix: Solution Of System Of Linear Equations By Adjoint Method Using Calculator

Advanced Linear Algebra Lecture 6.5: Self-adjoint differential operators

ADJOINT OF A LINEAR OPERATOR