What is my misconception about these two Matrix Calculus formulas for the differential?

Metronome

Junior Member
Joined
Jun 12, 2018
Messages
150
If XX is a matrix of variables, g(X)g(X) is a scalar-valued function of XX, and <, >F<\cdot,\ \cdot>_F is the Frobenius Inner Product, then dg = <g, dX>Fdg\ =\ <\nabla g,\ dX>_F. Some examples I've seen derived are d(XF) = <XXF, dX>Fd(||X||_F)\ =\ <\frac{X}{||X||_F},\ dX>_F and d(vTXw) = <vwT, dX>Fd(\vec v^T X \vec w)\ =\ <\vec v \vec w^T,\ dX>_F.

In general (and thus if g(X)g(X) is still scalar-valued in particular), if J(g)J(g) is the Jacobian Matrix for g(X)g(X), then dg = J(g)dXdg\ =\ J(g)dX.

However, the Frobenius Inner Product returns a scalar, and therefore the RHS of the first equation and thus the LHS of the first equation must be a scalar, while for all XX which are "non-trivial matrices" (22 by 22 or larger), each of J(g)J(g) and dXdX should also be a non-trivial matrix, and since the matrix multiplication of two non-trivial matrices is never a scalar, the RHS of the second equation and thus the RHS of the second equation must not be a scalar. Therefore dgdg is both a scalar and not a scalar.

I have not referenced the fact that g=(J(g))T\nabla g = (J(g))^T. This is also true of course, but I can't even get the shapes to make sense. The problem seems to be that the Frobenius Inner Product returns a very different shape than matrix multiplication. What is my misconception here?
 
and since the matrix multiplication of two non-trivial matrices is never a scalar
Where does matrix multiplication come from? You call XX a matrix, but only use it as a vector in your formulae.
 
Where does matrix multiplication come from? You call XX a matrix, but only use it as a vector in your formulae.
I am definitely thinking of XX as 22 by 22 or larger and not necessarily square. I did forget to specify that by \nabla, I mean the more general version of the Gradient used in Matrix Calculus which returns a matrix if appropriate, so that may be the issue.
 
I am definitely thinking of XX as 22 by 22 or larger and not necessarily square. I did forget to specify that by \nabla, I mean the more general version of the Gradient used in Matrix Calculus which returns a matrix if appropriate, so that may be the issue.
You can think of XX as higher order tensors for all I care. But I still don't see where in your post you actually used matrix properties of XX, as opposed to "plain" vector properties?
 
You can think of XX as higher order tensors for all I care. But I still don't see where in your post you actually used matrix properties of XX, as opposed to "plain" vector properties?
If you frame my post as an (obviously hopeless) attempt to show a contradiction between the two formulas, then I have used matrix properties of XX where I say "since the matrix multiplication of two non-trivial matrices is never a scalar, the RHS of the second equation and thus the LHS [typo in original] of the second equation must not be a scalar." For example, because XX is a non-trivial matrix, it can be inferred that dXdX is also a non-trivial matrix, and (unlike for vectors) there is nothing of any shape that could multiply dXdX on the left which would output a scalar.
 
If XX is a matrix of variables, g(X)g(X) is a scalar-valued function of XX, ...

e.g. g=det g=\det

... and <, >F<\cdot,\ \cdot>_F is the Frobenius Inner Product, ...

which means X,YF=ijXijYij \bigl\langle X,Y \bigr\rangle_F=\sum_{ij}X_{ij}Y_{ij}

... then dg = <g, dX>Fdg\ =\ <\nabla g,\ dX>_F.

What do you mean by dX dX ? If g:Rn2R g\, : \,\mathbb{R}^{n^2}\to \mathbb{R} then dg:Rn2R dg\, : \,\mathbb{R}^{n^2}\to \mathbb{R} . What is the difference between dg dg and g \nabla g ? How can the RHS depend on X X whereas the LHS does not?
 
I have used matrix properties of XXX where I say "since the matrix multiplication of two non-trivial matrices is never a scalar
Yes, you said that, but all your formulae treat XX as a "plain" vector.
 
Yes, you said that, but all your formulae treat XX as a "plain" vector.
If the problem purported is that my use of XX is inconsistent with XX being a matrix, then I think this is just false. The Forbenius Inner Product takes matrices as input, the Frobenius Norm takes matrices as input, I clarified that I'm using the Matrix Calculus extension of \nabla, vTXw\vec v^T X \vec w is a bilinear form, etc. If you think I have still made a mistake, feel free to pinpoint it.

If the problem purported is that my use of XX does not require XX to be a matrix, then I think this would be irrelevant even if true. The formulas are supposed to hold for all shapes of XX (at least up to matrices, if not higher order tensors). Therefore, they should hold for whatever particular shape I think of XX having.
 
e.g. g=det g=\det
Indeed
which means X,YF=ijXijYij \bigl\langle X,Y \bigr\rangle_F=\sum_{ij}X_{ij}Y_{ij}
Indeed
What do you mean by dX dX ? If g:Rn2R g\, : \,\mathbb{R}^{n^2}\to \mathbb{R} then dg:Rn2R dg\, : \,\mathbb{R}^{n^2}\to \mathbb{R} . What is the difference between dg dg and g \nabla g ? How can the RHS depend on X X whereas the LHS does not?
By dXdX, I mean the differential of the matrix XX. Similarly, dgdg is the differential of g(X)g(X), whereas g\nabla g is the Gradient of g(X)g(X). Both the RHS and LHS of both equations depend on XX; I just left the dependence of gg on XX implicit in the formulas since I had specified it earlier.
 
By dXdX, I mean the differential of the matrix XX. Similarly, dgdg is the differential of g(X)g(X), whereas g\nabla g is the Gradient of g(X)g(X). Both the RHS and LHS of both equations depend on XX; I just left the dependence of gg on XX implicit in the formulas since I had specified it earlier.
Still, what is dX? dX? Is it the matrix with xij x_{ij} at position (i,j) (i,j) and zero elsewhere, i.e. the Jacobi matrix of its coordinate functions? What are the variables?

g g is a function from a Euclidean space into the field of (I assume) real numbers, which would save us from dealing with conjugates. This means, the differential dg dg of g g is the same, a function from the Euclidean space to the real numbers, the Jacobi matrix, only with the size n2×1. n^2\times 1. I assume we can write it as dg=g,XF dg=\bigl\langle \nabla g,X \bigr\rangle_F since it is all about arrangement and nothing seriously happens.

Maybe I didn't get your question, and an example would be helpful. Say we use n=2, n=2, the variables x,y,u,v x,y,u,v to avoid confusion with the indices of X, X, and the determinant
g(X)=det(X)=det((X11X12X21X22)). g(X)=\det(X)=\det\left(\begin{pmatrix}X_{11}&X_{12}\\X_{21}&X_{22}\end{pmatrix}\right) .Then
g(x,y,u,v)=g(xyuv)=xvyu g(x,y,u,v)=g\begin{pmatrix}x&y\\u&v\end{pmatrix}= xv-yu and
JX(g)=X(g)=(gx(X),gy(X),gu(X),gv(X))=(X22,X21,X12,X11). J_X(g)=\nabla_X(g)=\left(\dfrac{\partial g}{\partial x}(X)\, , \,\dfrac{\partial g}{\partial y}(X) \, , \, \dfrac{\partial g}{\partial u}(X) \, , \, \dfrac{\partial g}{\partial v}(X)\right) =\left(X_{22}\, , \,-X_{21}\, , \,-X_{12}\, , \,X_{11}\right).
Do we agree so far?
 
Still, what is dX? dX? Is it the matrix with xij x_{ij} at position (i,j) (i,j) and zero elsewhere, i.e. the Jacobi matrix of its coordinate functions? What are the variables?

g g is a function from a Euclidean space into the field of (I assume) real numbers, which would save us from dealing with conjugates. This means, the differential dg dg of g g is the same, a function from the Euclidean space to the real numbers, the Jacobi matrix, only with the size n2×1. n^2\times 1. I assume we can write it as dg=g,XF dg=\bigl\langle \nabla g,X \bigr\rangle_F since it is all about arrangement and nothing seriously happens.

Maybe I didn't get your question, and an example would be helpful. Say we use n=2, n=2, the variables x,y,u,v x,y,u,v to avoid confusion with the indices of X, X, and the determinant
g(X)=det(X)=det((X11X12X21X22)). g(X)=\det(X)=\det\left(\begin{pmatrix}X_{11}&X_{12}\\X_{21}&X_{22}\end{pmatrix}\right) .Then
g(x,y,u,v)=g(xyuv)=xvyu g(x,y,u,v)=g\begin{pmatrix}x&y\\u&v\end{pmatrix}= xv-yu and
JX(g)=X(g)=(gx(X),gy(X),gu(X),gv(X))=(X22,X21,X12,X11). J_X(g)=\nabla_X(g)=\left(\dfrac{\partial g}{\partial x}(X)\, , \,\dfrac{\partial g}{\partial y}(X) \, , \, \dfrac{\partial g}{\partial u}(X) \, , \, \dfrac{\partial g}{\partial v}(X)\right) =\left(X_{22}\, , \,-X_{21}\, , \,-X_{12}\, , \,X_{11}\right).
Do we agree so far?
As I understand, dXdX is a matrix shaped the same as XX with elements each approaching 00, possibly independently (i.e., a computer approximation might assign the elements I.I.D. randomly). We can indeed assume that gg is a function from matrices (with real elements) to real numbers, but the equation should be dg=g, dXFdg = \bigl\langle \nabla g,\ dX \bigr\rangle_F rather than dg=g, XFdg = \bigl\langle \nabla g,\ X \bigr\rangle_F

The example seems close to right so far; X=(xyuv)X = \begin{pmatrix}x & y \\ u & v\end{pmatrix} and g(X)=det(X)=xvyug(X) = \det(X) = xv - yu, but I think it should be that the Jacobian and the (generalized) Gradient are each other's transposes. In particular I have that X(g)=det(X)(X1)T\nabla_X (g) = \det(X)(X^{-1})^T, so I guess JX(g)J_X (g) would be det(X)X1\det(X)X^{-1}.
 
As I understand, dXdX is a matrix shaped the same as XX with elements each approaching 00, possibly independently (i.e., a computer approximation might assign the elements I.I.D. randomly).

The shape is not the question. The variables are! If you expect dX dX not to be straight away zero, then its matrix entries have to be functions with variables along which we can differentiate them. One possibility is to consider each matrix entry to be a coordinate function:
X11=x,X12=y,X21=u,X22=v. X_{11}=x\, , \,X_{12}=y\, , \,X_{21}=u\, , \,X_{22}=v. Another possibility would be to consider the domain from which the matrices are taken as a manifold and consider paths on this manifold. In this case, we have
X11=x(t),X12=y(t),X21=u(t),X22=v(t) X_{11}=x(t)\, , \,X_{12}=y(t)\, , \,X_{21}=u(t)\, , \,X_{22}=v(t) where tX=X(t) t\mapsto X=X(t) is the parameterization of such a path and the parameter t t the variable along which we differentiate. In this case
dX=(x(t)y(t)u(t)v(t))dt. dX=\begin{pmatrix}x'(t)&y'(t)\\u'(t)&v'(t)\end{pmatrix}dt.
However, I suspect that we have the first case here: four variables x,y,u,v, x,y,u,v, the coordinates of the matrix.

We can indeed assume that gg is a function from matrices (with real elements) to real numbers, but the equation should be dg=g, dXFdg = \bigl\langle \nabla g,\ dX \bigr\rangle_F rather than dg=g, XFdg = \bigl\langle \nabla g,\ X \bigr\rangle_F
I do not understand this formula without knowing what g g and dX dX are.
The example seems close to right so far; X=(xyuv)X = \begin{pmatrix}x & y \\ u & v\end{pmatrix} and g(X)=det(X)=xvyug(X) = \det(X) = xv - yu, but I think it should be that the Jacobian and the (generalized) Gradient are each other's transposes. In particular I have that X(g)=det(X)(X1)T\nabla_X (g) = \det(X)(X^{-1})^T, so I guess JX(g)J_X (g) would be det(X)X1\det(X)X^{-1}.

You have to distinguish between the directions along which we differentiate and the location at which the derivative is evaluated! The gradient and so the Jacobian is evaluated at a certain position. Do not use the same letters for variables and location, this is confusing.

Maybe I should have been clearer and should have written

JP(g)=P(g)=(gx(P),gy(P),gu(P),gv(P))=(P22,P21,P12,P11). J_P(g)=\nabla_P(g)=\left(\dfrac{\partial g}{\partial x}(P)\, , \,\dfrac{\partial g}{\partial y}(P) \, , \, \dfrac{\partial g}{\partial u}(P) \, , \, \dfrac{\partial g}{\partial v}(P)\right) =\left(P_{22}\, , \,-P_{21}\, , \,-P_{12}\, , \,P_{11}\right).where P P is the location where we evaluate the derivative, a point P. P.

This means for my example that
dPdet=P(det),dXF=P(det),dX=(P22,P21,P12,P11)(dxdydudv)=P22dxP21dyP12du+P11dv.\begin{array}{lll} d_P\det=\bigl\langle \nabla_P(\det),dX \bigr\rangle_F&= \bigl\langle \nabla_P(\det),dX \bigr\rangle=\left(P_{22}\, , \,-P_{21}\, , \,-P_{12}\, , \,P_{11}\right)\cdot \begin{pmatrix}dx\\dy\\du\\dv\end{pmatrix}=P_{22}dx-P_{21}dy-P_{12}du+P_{11}dv \end{array} .Of course, we can read this as a function of location PdPdet. P\longmapsto d_P\det. And then we get to the point where confusion with derivatives often arises. The coordinates Pij P_{ij} of the location become variables again, and people write them as such, neglecting the changed meaning. Therefore the formula becomes
X=(xyuv)dX(det)=vdxudyydu+xdv. X=\begin{pmatrix}x&y\\u&v\end{pmatrix} \longmapsto d_X(\det)=v\,dx-u\,dy-y\,du+x\,dv.
The case g=.F g=\| \, . \, \|_F with XF=x+y+u+v \| \, X \, \|_F=x+y+u+v is probably easier. Let's see.
P(.F)=(xP.F,yP.F,uP.F,vP.F)=(1,1,1,1) \nabla_P(\| \, . \, \|_F)=\left(\left.\dfrac{\partial }{\partial x}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial y}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial u}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial v}\right|_P\| \, . \, \|_F \right)=(1,1,1,1) and
dP(.F)=P(.F),dXF=(1,1,1,1)(dxdydudv)=dx+dy+du+dv. d_P\left(\| \, . \, \|_F\right) =\bigl\langle \nabla_P\left(\| \, . \, \|_F\right),dX \bigr\rangle_F=(1,1,1,1)\begin{pmatrix}dx\\dy\\du\\dv\end{pmatrix}=dx+dy+du+dv.
We do not need to consider the point of evaluation anymore in this case because the derivative of the Frobenius norm is constantly one in every coordinate.
 
I think I have proceeded a bit differently (without vectorizing as you appear to), but I did get the same answer for the Frobenius Inner Product. Zooming out a bit, I've made some progress on an answer to the original question. My two formulas for dgdg seem to differ only by composition with a trace. In other words, dg = <g, dX>Fdg\ =\ <\nabla g,\ dX>_F and dg = trace(J(g)dX)dg\ =\ trace(J(g)dX) agree! The only mystery now is that I have not seen dg = trace(J(g)dX)dg\ =\ trace(J(g)dX) explicated as a correct formula, and I have seen dg = J(g)dXdg\ =\ J(g)dX.
 
I think I have proceeded a bit differently (without vectorizing as you appear to), but I did get the same answer for the Frobenius Inner Product. Zooming out a bit, I've made some progress on an answer to the original question. My two formulas for dgdg seem to differ only by composition with a trace. In other words, dg = <g, dX>Fdg\ =\ <\nabla g,\ dX>_F and dg = trace(J(g)dX)dg\ =\ trace(J(g)dX) agree! The only mystery now is that I have not seen dg = trace(J(g)dX)dg\ =\ trace(J(g)dX) explicated as a correct formula, and I have seen dg = J(g)dXdg\ =\ J(g)dX.
Where do you see a trace? The Frobenius product considers all matrix entries. Traces are obtained if we differentiate the determinant and evaluate it at the identity matrix, see my formula in post #12. If
P=I=(1001) P=I=\begin{pmatrix}1&0\\0&1\end{pmatrix} then
dI(det)=1dx0dy0du+1dv=dx+dv=trace(dX). d_I(\det)=1\cdot dx-0\cdot dy-0\cdot du +1\cdot dv=dx+dv=\operatorname{trace}(dX).
This is why we get the tangential space of SL(2)={XM(2,R)det(X)=1} \operatorname{SL}(2)=\left\{X\in \mathbb{M}(2,\mathbb{R})\,|\,\det(X)=1\right\} as
sl(2)={dXM(2,R)dI(det(X))=d(1)=0=trace(dX)}, \mathfrak{sl}(2)=\left\{dX\in \mathbb{M}(2,\mathbb{R})\,|\, d_I(\det(X))=d(1)=0=\operatorname{trace}(dX)\right\}, the vector space of 2×2 2\times 2 matrices with vanishing trace.
 
Last edited:
Where do you see a trace? The Frobenius product considers all matrix entries. Traces are obtained if we differentiate the determinant and evaluate it at the identity matrix, see my formula in post #12. If
P=I=(1001) P=I=\begin{pmatrix}1&0\\0&1\end{pmatrix} then
dI(det)=1dx0dy0du+1dv=dx+dv=trace(dX). d_I(\det)=1\cdot dx-0\cdot dy-0\cdot du +1\cdot dv=dx+dv=\operatorname{trace}(dX).
This is why we get the tangential space of SL(2)={XM(2,R)det(X)=1} \operatorname{SL}(2)=\left\{X\in \mathbb{M}(2,\mathbb{R})\,|\,\det(X)=1\right\} as
sl(2)={dXM(2,R)dI(det(X))=d(1)=0=trace(dX)}, \mathfrak{sl}(2)=\left\{dX\in \mathbb{M}(2,\mathbb{R})\,|\, d_I(\det(X))=d(1)=0=\operatorname{trace}(dX)\right\}, the vector space of 2×2 2\times 2 matrices with vanishing trace.
I don't think the trace is associated with the determinant from this specific example. It seems to appear generally (at least for all square XX), and can be seen with very little reference to the underlying calculus problem. Say we have some XX and g(X)g(X) as before. Then if XX is square, then the Jacobian and Gradient are each also square, all three of the same size (the Gradient is the Matrix Calculus version of the Gradient as defined here, which is the transpose of the Jacobian). Call the Jacobian JX(g)=(J11J12J21J22)J_{X}(g) = \begin{pmatrix}J_{11} & J_{12} \\ J_{21} & J_{22}\end{pmatrix} and the differential dX=(dX11dX12dX21dX22)dX = \begin{pmatrix}dX_{11} & dX_{12} \\ dX_{21} & dX_{22}\end{pmatrix}.

The first formula is dg=X(g), dXFdg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F. Transposing the Jacobian then doing the Forbenius Inner Product yields dg=J11 dX11+J12 dX21+J21 dX12+J22 dX22dg = J_{11}\ dX_{11} + J_{12}\ dX_{21} + J_{21}\ dX_{12} + J_{22}\ dX_{22}.

The second formula is dg=JX(g) dXdg = J_X(g)\ dX. Doing the matrix multiplication yields (J11 dX11+J12 dX12J11 dX12+J12 dX22J21 dX11+J22 dX21J21 dX12+J22 dX22)\begin{pmatrix}J_{11}\ dX_{11} + J_{12}\ dX_{12} & J_{11}\ dX_{12} + J_{12}\ dX_{22} \\ J_{21}\ dX_{11} + J_{22}\ dX_{21} & J_{21}\ dX_{12} + J_{22}\ dX_{22}\end{pmatrix}. The first result can be seen in the trace of this second result.
 
The link helps a lot to clarify the language that is used in your source. I wouldn't write df=f(x+dx)f(x) df =f(x+dx)-f(x) since it is a bit sloppy, but ok.

Now that I can look up what you actually mean, can we restart the discussion? What is your question? To answer the only sentence with a question mark in your post #1 ...
The problem seems to be that the Frobenius Inner Product returns a very different shape than matrix multiplication. What is my misconception here?
... I did the following calculation to verify the formula A,BF=tr(ATB). \bigl\langle A,B \bigr\rangle_F=\operatorname{tr}(A^TB).

(ATB)ij=k=1n(AT)ikBkj=k=1nAkiBkjtr(ATB)=m=1n(ATB)mm=m=1n(k=1nAkmBkm)=A,BF\begin{array}{lll} (A^TB)_{ij}&=\displaystyle{\sum_{k=1}^n (A^T)_{ik}B_{kj}=\sum_{k=1}^n A_{ki}B_{kj}}\\[12pt] tr(A^TB)&=\displaystyle{\sum_{m=1}^n(A^TB)_{mm}=\sum_{m=1}^n\left(\sum_{k=1}^n A_{km}B_{km}\right)=\bigl\langle A,B \bigr\rangle_F } \end{array}
The notation A=vec(A) A=\operatorname{vec(A}) is simply the vector we obtain from reading the matrix from left to right, and row by row, e.g. vec(X)=(x,y,u,v) in case X=(xyuv). \operatorname{vec}(X)=(x,y,u,v) \text{ in case }X=\begin{pmatrix}x&y\\u&v\end{pmatrix}.
The Frobenius product of two matrices A,B A,B is thus the vector product vec(A)Tvec(B) \operatorname{vec(A)}^T\cdot \operatorname{vec}(B) because the vector product matches the indices:
vTw=k=1nvkwk \vec{v}^T\cdot \vec{w}=\displaystyle{\sum_{k=1}^{n}v_k w_k} or in our case
vec(A)Tvec(B)=(i,j)=(1,1)(n,n)vijwij \operatorname{vec}(A)^T\cdot \operatorname{vec}(B)=\displaystyle{\sum_{(i,j)=(1,1)}^{(n,n)}v_{ij} w_{ij}} Please note that the "T" at a matrix means the transposed matrix, whereas the "T" at a vector only means that we write it as a row. Vectors without that "T" are column vectors.

Does that answer your question about the connection of the ordinary matrix product and the Frobenius product?
 
Last edited:
Now that I can look up what you actually mean, can we restart the discussion? What is your question?
Okay, I think we have most of the grounding now. Starting again, we have two formulas, dg=X(g), dXFdg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F and dg=JX(g) dXdg = J_X (g)\ dX (the latter being the generalization of df = f(x) dxdf\ =\ f'(x)\ dx from ordinary calculus).

What I believe you have shown is that dg=X(g), dXFdg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F is equivalent to dg=vec(JX(g))T vec(dX)dg = vec(J_X (g))^T\ vec(dX), where I'm amending your definition of vectorization to stacking the columns of a matrix into a column vector (I suspect this is what you wanted anyhow, or else some of your vector products appear to be outer products rather than inner products). I think the last two subquestions to answer the main question are...

1) What justifies vectorization in this way? I could buy the idea of vectorizing both sides of an equation, i.e., dg=JX(g) dX      vec(dg)=vec(JX(g) dX)dg = J_X (g)\ dX\ \implies\ vec(dg) = vec(J_X (g)\ dX). I've done this in other contexts. But dg=JX(g) dX      dg=vec(JX(g))T vec(dX))dg = J_X (g)\ dX\ \implies\ dg = vec(J_X (g))^T\ vec(dX)) is mysterious.

2) We still have the apparent contradiction that dgdg is both a scalar and a matrix. Even vectorizing dgdg wouldn't make it a scalar as needed. Are there two genuinely different concepts of dgdg here, and if so, how is each to be interpreted?
 

It is he same resulting number, only written differently.

A,BF=i=1nj=1nAijBij\begin{array}{lll} \bigl\langle A,B \bigr\rangle_F&=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}
and
vec(A)Tvec(B)=(A11,A12,,A1n,A21,A22,,A2n,,An1,An2,,Ann)(B11B12B1nBnn)=A11B11+A12B12++A1nB1n++An1Bn1+An2Bn2++AnnBnn=i=1nj=1nAijBij\begin{array}{lll} \operatorname{vec}(A)^T\cdot \operatorname{vec}(B)&=(A_{11},A_{12},\ldots,A_{1n},A_{21},A_{22},\ldots,A_{2n},\ldots,A_{n1},A_{n2},\ldots,A_{nn}) \cdot \begin{pmatrix}B_{11}\\B_{12}\\ \vdots \\B_{1n}\\ \vdots\\ \vdots \\B_{nn}\end{pmatrix} \\ &\\ &=A_{11}B_{11}+A_{12}B_{12}+\ldots+A_{1n}B_{1n}+\ldots+A_{n1}B_{n1}+A_{n2}B_{n2}+\ldots+A_{nn}B_{nn}\\[12pt] &=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}
2) We still have the apparent contradiction that dgdg is both a scalar and a matrix. Even vectorizing dgdg wouldn't make it a scalar as needed. Are there two genuinely different concepts of dgdg here, and if so, how is each to be interpreted?

Yes, but this is only because the notation isn't stringent. We have to decide what dg dg means!

If we have a differentiable function g:Rn2R g\, : \,\mathbb{R}^{n^2}\longrightarrow \mathbb{R} then dg dg is usually the linear function from one tangent space to the other, the derivative or the differential form here, in our case
dg:Tp(Rn2)=Rn2Tg(p)(R)=R dg\, : \,T_p\left(\mathbb{R}^{n^2}\right)=\mathbb{R}^{n^2}\longrightarrow T_{g(p)}\left(\mathbb{R}\right)=\mathbb{R} This means that dg dg is a linear transformation from an n2 n^2 -dimensional vector space into a one-dimensional vector space.
Hence, we can write it as a n2×1 n^2\times 1 matrix, or rearranged as an n×n n\times n matrix. This is a matter of convenience, and given that this is about programming, a matter of index management.

Now, why can it be seen as a scalar? Well, it isn't, but the images under dg dg are scalars. If we evaluate the derivative at a certain point, g(p), g(p), and apply a direction - note that derivatives are always directional - say the direction dX dX which lives in the n2 n^2 -dimensional vector space where our matrices live, then we get
Dg(p)(g)dX=g(p),dXR. D_{g(p)}(g)\cdot dX=\bigl\langle \nabla_{g(p)}\, , \,dX \bigr\rangle \in \mathbb{R}.
Abbreviating the derivative D D of g g at the location g(p) g(p) in direction dX dX by simply writing it dg=Dg(p)(g) dg=D_{g(p)}(g) or even dg=Dg(p)(g)dX dg=D_{g(p)}(g)\cdot dX is sloppy. It should at least be something like dg(X) dg(X) or dXg. d_Xg.

In the end, it is the same question as whether a notation like f(x) f(x) means the function or the resulting number.
Look at the beginning of section 5.1 in your source where they defined df. df. They have smuggled a [dX] [dX] into the end of the line, indicating the multiplication with dX dX , which makes it a number.

Ask 3 scientists about what a derivative is and you get 5 different answers and 8 different notations, at least.
 
Last edited:
It is he same resulting number, only written differently.

A,BF=i=1nj=1nAijBij\begin{array}{lll} \bigl\langle A,B \bigr\rangle_F&=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}
and
vec(A)Tvec(B)=(A11,A12,,A1n,A21,A22,,A2n,,An1,An2,,Ann)(B11B12B1nBnn)=A11B11+A12B12++A1nB1n++An1Bn1+An2Bn2++AnnBnn=i=1nj=1nAijBij\begin{array}{lll} \operatorname{vec}(A)^T\cdot \operatorname{vec}(B)&=(A_{11},A_{12},\ldots,A_{1n},A_{21},A_{22},\ldots,A_{2n},\ldots,A_{n1},A_{n2},\ldots,A_{nn}) \cdot \begin{pmatrix}B_{11}\\B_{12}\\ \vdots \\B_{1n}\\ \vdots\\ \vdots \\B_{nn}\end{pmatrix} \\ &\\ &=A_{11}B_{11}+A_{12}B_{12}+\ldots+A_{1n}B_{1n}+\ldots+A_{n1}B_{n1}+A_{n2}B_{n2}+\ldots+A_{nn}B_{nn}\\[12pt] &=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}


Yes, but this is only because the notation isn't stringent. We have to decide what dg dg means!

If we have a differentiable function g:Rn2R g\, : \,\mathbb{R}^{n^2}\longrightarrow \mathbb{R} then dg dg is usually the linear function from one tangent space to the other, the derivative or the differential form here, in our case
dg:Tp(Rn2)=Rn2Tg(p)(R)=R dg\, : \,T_p\left(\mathbb{R}^{n^2}\right)=\mathbb{R}^{n^2}\longrightarrow T_{g(p)}\left(\mathbb{R}\right)=\mathbb{R} This means that dg dg is a linear transformation from an n2 n^2 -dimensional vector space into a one-dimensional vector space.
Hence, we can write it as a n2×1 n^2\times 1 matrix, or rearranged as an n×n n\times n matrix. This is a matter of convenience, and given that this is about programming, a matter of index management.

Now, why can it be seen as a scalar? Well, it isn't, but the images under dg dg are scalars. If we evaluate the derivative at a certain point, g(p), g(p), and apply a direction - note that derivatives are always directional - say the direction dX dX which lives in the n2 n^2 -dimensional vector space where our matrices live, then we get
Dg(p)(g)dX=g(p),dXR. D_{g(p)}(g)\cdot dX=\bigl\langle \nabla_{g(p)}\, , \,dX \bigr\rangle \in \mathbb{R}.
Abbreviating the derivative D D of g g at the location g(p) g(p) in direction dX dX by simply writing it dg=Dg(p)(g) dg=D_{g(p)}(g) or even dg=Dg(p)(g)dX dg=D_{g(p)}(g)\cdot dX is sloppy. It should at least be something like dg(X) dg(X) or dXg. d_Xg.

In the end, it is the same question as whether a notation like f(x) f(x) means the function or the resulting number.
Look at the beginning of section 5.1 in your source where they defined df. df. They have smuggled a [dX] [dX] into the end of the line, indicating the multiplication with dX dX , which makes it a number.

Ask 3 scientists about what a derivative is and you get 5 different answers and 8 different notations, at least.

Alright, I think I understand it about as well as it can be understood. Thank you for the help!
 
Top