What is my misconception about these two Matrix Calculus formulas for the differential?

Metronome · Apr 29, 2025

If

X

is a matrix of variables,

g(X)

is a scalar-valued function of

X

, and

<\cdot,\ \cdot>_F

is the Frobenius Inner Product, then

dg\ =\ <\nabla g,\ dX>_F

. Some examples I've seen derived are

d(||X||_F)\ =\ <\frac{X}{||X||_F},\ dX>_F

and

d(\vec v^T X \vec w)\ =\ <\vec v \vec w^T,\ dX>_F

.

In general (and thus if

g(X)

is still scalar-valued in particular), if

J(g)

is the Jacobian Matrix for

g(X)

, then

dg\ =\ J(g)dX

.

However, the Frobenius Inner Product returns a scalar, and therefore the RHS of the first equation and thus the LHS of the first equation must be a scalar, while for all

X

which are "non-trivial matrices" (

2

by

2

or larger), each of

J(g)

and

dX

should also be a non-trivial matrix, and since the matrix multiplication of two non-trivial matrices is never a scalar, the RHS of the second equation and thus the RHS of the second equation must not be a scalar. Therefore

dg

is both a scalar and not a scalar.

I have not referenced the fact that

\nabla g = (J(g))^T

. This is also true of course, but I can't even get the shapes to make sense. The problem seems to be that the Frobenius Inner Product returns a very different shape than matrix multiplication. What is my misconception here?

blamocur · Apr 29, 2025

Metronome said:
and since the matrix multiplication of two non-trivial matrices is never a scalar

Where does matrix multiplication come from? You call

X

a matrix, but only use it as a vector in your formulae.

Metronome · Apr 29, 2025

blamocur said:
Where does matrix multiplication come from? You call $X$ a matrix, but only use it as a vector in your formulae.

I am definitely thinking of

X

as

2

by

2

or larger and not necessarily square. I did forget to specify that by

\nabla

, I mean the more general version of the Gradient used in Matrix Calculus which returns a matrix if appropriate, so that may be the issue.

blamocur · Apr 29, 2025

Metronome said:
I am definitely thinking of $X$ as $2$ by $2$ or larger and not necessarily square. I did forget to specify that by $\nabla$ , I mean the more general version of the Gradient used in Matrix Calculus which returns a matrix if appropriate, so that may be the issue.

You can think of

X

as higher order tensors for all I care. But I still don't see where in your post you actually used matrix properties of

X

, as opposed to "plain" vector properties?

Metronome · Apr 29, 2025

blamocur said:
You can think of $X$ as higher order tensors for all I care. But I still don't see where in your post you actually used matrix properties of $X$ , as opposed to "plain" vector properties?

If you frame my post as an (obviously hopeless) attempt to show a contradiction between the two formulas, then I have used matrix properties of

X

where I say "since the matrix multiplication of two non-trivial matrices is never a scalar, the RHS of the second equation and thus the LHS [typo in original] of the second equation must not be a scalar." For example, because

X

is a non-trivial matrix, it can be inferred that

dX

is also a non-trivial matrix, and (unlike for vectors) there is nothing of any shape that could multiply

dX

on the left which would output a scalar.

fresh_42 · Apr 29, 2025

Metronome said:
If $X$ is a matrix of variables, $g(X)$ is a scalar-valued function of $X$ , ...

e.g.

g=\det

Metronome said:
... and $<\cdot,\ \cdot>_F$ is the Frobenius Inner Product, ...

which means

\bigl\langle X,Y \bigr\rangle_F=\sum_{ij}X_{ij}Y_{ij}

Metronome said:
... then $dg\ =\ <\nabla g,\ dX>_F$ .

What do you mean by

dX

? If

g\, : \,\mathbb{R}^{n^2}\to \mathbb{R}

then

dg\, : \,\mathbb{R}^{n^2}\to \mathbb{R}

. What is the difference between

dg

and

\nabla g

? How can the RHS depend on

X

whereas the LHS does not?

blamocur · Apr 29, 2025

Metronome said:
I have used matrix properties of XXX where I say "since the matrix multiplication of two non-trivial matrices is never a scalar

Yes, you said that, but all your formulae treat

X

as a "plain" vector.

Metronome · Apr 30, 2025

blamocur said:
Yes, you said that, but all your formulae treat $X$ as a "plain" vector.

If the problem purported is that my use of

X

is inconsistent with

X

being a matrix, then I think this is just false. The Forbenius Inner Product takes matrices as input, the Frobenius Norm takes matrices as input, I clarified that I'm using the Matrix Calculus extension of

\nabla

,

\vec v^T X \vec w

is a bilinear form, etc. If you think I have still made a mistake, feel free to pinpoint it.

If the problem purported is that my use of

X

does not require

X

to be a matrix, then I think this would be irrelevant even if true. The formulas are supposed to hold for all shapes of

X

(at least up to matrices, if not higher order tensors). Therefore, they should hold for whatever particular shape I think of

X

having.

Metronome · Apr 30, 2025

fresh_42 said:
e.g. $g=\det$

Indeed

fresh_42 said:
which means $\bigl\langle X,Y \bigr\rangle_F=\sum_{ij}X_{ij}Y_{ij}$

Indeed

fresh_42 said:
What do you mean by $dX$ ? If $g\, : \,\mathbb{R}^{n^2}\to \mathbb{R}$ then $dg\, : \,\mathbb{R}^{n^2}\to \mathbb{R}$ . What is the difference between $dg$ and $\nabla g$ ? How can the RHS depend on $X$ whereas the LHS does not?

By

dX

, I mean the differential of the matrix

X

. Similarly,

dg

is the differential of

g(X)

, whereas

\nabla g

is the Gradient of

g(X)

. Both the RHS and LHS of both equations depend on

X

; I just left the dependence of

g

on

X

implicit in the formulas since I had specified it earlier.

fresh_42 · Apr 30, 2025

Metronome said:
By $dX$ , I mean the differential of the matrix $X$ . Similarly, $dg$ is the differential of $g(X)$ , whereas $\nabla g$ is the Gradient of $g(X)$ . Both the RHS and LHS of both equations depend on $X$ ; I just left the dependence of $g$ on $X$ implicit in the formulas since I had specified it earlier.

Still, what is

dX?

Is it the matrix with

x_{ij}

at position

(i,j)

and zero elsewhere, i.e. the Jacobi matrix of its coordinate functions? What are the variables?

g

is a function from a Euclidean space into the field of (I assume) real numbers, which would save us from dealing with conjugates. This means, the differential

dg

of

g

is the same, a function from the Euclidean space to the real numbers, the Jacobi matrix, only with the size

n^2\times 1.

I assume we can write it as

dg=\bigl\langle \nabla g,X \bigr\rangle_F

since it is all about arrangement and nothing seriously happens.

Maybe I didn't get your question, and an example would be helpful. Say we use

n=2,

the variables

x,y,u,v

to avoid confusion with the indices of

X,

and the determinant

g(X)=\det(X)=\det\left(\begin{pmatrix}X_{11}&X_{12}\\X_{21}&X_{22}\end{pmatrix}\right) .

Then

g(x,y,u,v)=g\begin{pmatrix}x&y\\u&v\end{pmatrix}= xv-yu

and

J_X(g)=\nabla_X(g)=\left(\dfrac{\partial g}{\partial x}(X)\, , \,\dfrac{\partial g}{\partial y}(X) \, , \, \dfrac{\partial g}{\partial u}(X) \, , \, \dfrac{\partial g}{\partial v}(X)\right) =\left(X_{22}\, , \,-X_{21}\, , \,-X_{12}\, , \,X_{11}\right).

Do we agree so far?

Metronome · Apr 30, 2025

fresh_42 said:
Still, what is $dX?$ Is it the matrix with $x_{ij}$ at position $(i,j)$ and zero elsewhere, i.e. the Jacobi matrix of its coordinate functions? What are the variables?

$g$ is a function from a Euclidean space into the field of (I assume) real numbers, which would save us from dealing with conjugates. This means, the differential $dg$ of $g$ is the same, a function from the Euclidean space to the real numbers, the Jacobi matrix, only with the size $n^2\times 1.$ I assume we can write it as $dg=\bigl\langle \nabla g,X \bigr\rangle_F$ since it is all about arrangement and nothing seriously happens.

Maybe I didn't get your question, and an example would be helpful. Say we use $n=2,$ the variables $x,y,u,v$ to avoid confusion with the indices of $X,$ and the determinant
$g(X)=\det(X)=\det\left(\begin{pmatrix}X_{11}&X_{12}\\X_{21}&X_{22}\end{pmatrix}\right) .$ Then
$g(x,y,u,v)=g\begin{pmatrix}x&y\\u&v\end{pmatrix}= xv-yu$ and
$J_X(g)=\nabla_X(g)=\left(\dfrac{\partial g}{\partial x}(X)\, , \,\dfrac{\partial g}{\partial y}(X) \, , \, \dfrac{\partial g}{\partial u}(X) \, , \, \dfrac{\partial g}{\partial v}(X)\right) =\left(X_{22}\, , \,-X_{21}\, , \,-X_{12}\, , \,X_{11}\right).$
Do we agree so far?

As I understand,

dX

is a matrix shaped the same as

X

with elements each approaching

0

, possibly independently (i.e., a computer approximation might assign the elements I.I.D. randomly). We can indeed assume that

g

is a function from matrices (with real elements) to real numbers, but the equation should be

dg = \bigl\langle \nabla g,\ dX \bigr\rangle_F

rather than

dg = \bigl\langle \nabla g,\ X \bigr\rangle_F

The example seems close to right so far;

X = \begin{pmatrix}x & y \\ u & v\end{pmatrix}

and

g(X) = \det(X) = xv - yu

, but I think it should be that the Jacobian and the (generalized) Gradient are each other's transposes. In particular I have that

\nabla_X (g) = \det(X)(X^{-1})^T

, so I guess

J_X (g)

would be

\det(X)X^{-1}

.

fresh_42 · May 1, 2025

Metronome said:
As I understand, $dX$ is a matrix shaped the same as $X$ with elements each approaching $0$ , possibly independently (i.e., a computer approximation might assign the elements I.I.D. randomly).

The shape is not the question. The variables are! If you expect

dX

not to be straight away zero, then its matrix entries have to be functions with variables along which we can differentiate them. One possibility is to consider each matrix entry to be a coordinate function:

X_{11}=x\, , \,X_{12}=y\, , \,X_{21}=u\, , \,X_{22}=v.

Another possibility would be to consider the domain from which the matrices are taken as a manifold and consider paths on this manifold. In this case, we have

X_{11}=x(t)\, , \,X_{12}=y(t)\, , \,X_{21}=u(t)\, , \,X_{22}=v(t)

where

t\mapsto X=X(t)

is the parameterization of such a path and the parameter

t

the variable along which we differentiate. In this case

dX=\begin{pmatrix}x'(t)&y'(t)\\u'(t)&v'(t)\end{pmatrix}dt.

However, I suspect that we have the first case here: four variables

x,y,u,v,

the coordinates of the matrix.

Metronome said:
We can indeed assume that $g$ is a function from matrices (with real elements) to real numbers, but the equation should be $dg = \bigl\langle \nabla g,\ dX \bigr\rangle_F$ rather than $dg = \bigl\langle \nabla g,\ X \bigr\rangle_F$

I do not understand this formula without knowing what

g

and

dX

are.

Metronome said:
The example seems close to right so far; $X = \begin{pmatrix}x & y \\ u & v\end{pmatrix}$ and $g(X) = \det(X) = xv - yu$ , but I think it should be that the Jacobian and the (generalized) Gradient are each other's transposes. In particular I have that $\nabla_X (g) = \det(X)(X^{-1})^T$ , so I guess $J_X (g)$ would be $\det(X)X^{-1}$ .

You have to distinguish between the directions along which we differentiate and the location at which the derivative is evaluated! The gradient and so the Jacobian is evaluated at a certain position. Do not use the same letters for variables and location, this is confusing.

Maybe I should have been clearer and should have written

J_P(g)=\nabla_P(g)=\left(\dfrac{\partial g}{\partial x}(P)\, , \,\dfrac{\partial g}{\partial y}(P) \, , \, \dfrac{\partial g}{\partial u}(P) \, , \, \dfrac{\partial g}{\partial v}(P)\right) =\left(P_{22}\, , \,-P_{21}\, , \,-P_{12}\, , \,P_{11}\right).

where

P

is the location where we evaluate the derivative, a point

P.

This means for my example that

\begin{array}{lll} d_P\det=\bigl\langle \nabla_P(\det),dX \bigr\rangle_F&= \bigl\langle \nabla_P(\det),dX \bigr\rangle=\left(P_{22}\, , \,-P_{21}\, , \,-P_{12}\, , \,P_{11}\right)\cdot \begin{pmatrix}dx\\dy\\du\\dv\end{pmatrix}=P_{22}dx-P_{21}dy-P_{12}du+P_{11}dv \end{array} .

Of course, we can read this as a function of location

P\longmapsto d_P\det.

And then we get to the point where confusion with derivatives often arises. The coordinates

P_{ij}

of the location become variables again, and people write them as such, neglecting the changed meaning. Therefore the formula becomes

X=\begin{pmatrix}x&y\\u&v\end{pmatrix} \longmapsto d_X(\det)=v\,dx-u\,dy-y\,du+x\,dv.

The case

g=\| \, . \, \|_F

with

\| \, X \, \|_F=x+y+u+v

is probably easier. Let's see.

\nabla_P(\| \, . \, \|_F)=\left(\left.\dfrac{\partial }{\partial x}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial y}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial u}\right|_P\| \, . \, \|_F \, , \,\left.\dfrac{\partial }{\partial v}\right|_P\| \, . \, \|_F \right)=(1,1,1,1)

and

d_P\left(\| \, . \, \|_F\right) =\bigl\langle \nabla_P\left(\| \, . \, \|_F\right),dX \bigr\rangle_F=(1,1,1,1)\begin{pmatrix}dx\\dy\\du\\dv\end{pmatrix}=dx+dy+du+dv.

We do not need to consider the point of evaluation anymore in this case because the derivative of the Frobenius norm is constantly one in every coordinate.

Metronome · May 1, 2025

I think I have proceeded a bit differently (without vectorizing as you appear to), but I did get the same answer for the Frobenius Inner Product. Zooming out a bit, I've made some progress on an answer to the original question. My two formulas for

dg

seem to differ only by composition with a trace. In other words,

dg\ =\ <\nabla g,\ dX>_F

and

dg\ =\ trace(J(g)dX)

agree! The only mystery now is that I have not seen

dg\ =\ trace(J(g)dX)

explicated as a correct formula, and I have seen

dg\ =\ J(g)dX

.

fresh_42 · May 1, 2025

Metronome said:
I think I have proceeded a bit differently (without vectorizing as you appear to), but I did get the same answer for the Frobenius Inner Product. Zooming out a bit, I've made some progress on an answer to the original question. My two formulas for $dg$ seem to differ only by composition with a trace. In other words, $dg\ =\ <\nabla g,\ dX>_F$ and $dg\ =\ trace(J(g)dX)$ agree! The only mystery now is that I have not seen $dg\ =\ trace(J(g)dX)$ explicated as a correct formula, and I have seen $dg\ =\ J(g)dX$ .

Where do you see a trace? The Frobenius product considers all matrix entries. Traces are obtained if we differentiate the determinant and evaluate it at the identity matrix, see my formula in post #12. If

P=I=\begin{pmatrix}1&0\\0&1\end{pmatrix}

then

d_I(\det)=1\cdot dx-0\cdot dy-0\cdot du +1\cdot dv=dx+dv=\operatorname{trace}(dX).

This is why we get the tangential space of

\operatorname{SL}(2)=\left\{X\in \mathbb{M}(2,\mathbb{R})\,|\,\det(X)=1\right\}

as

\mathfrak{sl}(2)=\left\{dX\in \mathbb{M}(2,\mathbb{R})\,|\, d_I(\det(X))=d(1)=0=\operatorname{trace}(dX)\right\},

the vector space of

2\times 2

matrices with vanishing trace.

Metronome · May 1, 2025

fresh_42 said:
Where do you see a trace? The Frobenius product considers all matrix entries. Traces are obtained if we differentiate the determinant and evaluate it at the identity matrix, see my formula in post #12. If
$P=I=\begin{pmatrix}1&0\\0&1\end{pmatrix}$ then
$d_I(\det)=1\cdot dx-0\cdot dy-0\cdot du +1\cdot dv=dx+dv=\operatorname{trace}(dX).$
This is why we get the tangential space of $\operatorname{SL}(2)=\left\{X\in \mathbb{M}(2,\mathbb{R})\,|\,\det(X)=1\right\}$ as
$\mathfrak{sl}(2)=\left\{dX\in \mathbb{M}(2,\mathbb{R})\,|\, d_I(\det(X))=d(1)=0=\operatorname{trace}(dX)\right\},$ the vector space of $2\times 2$ matrices with vanishing trace.

I don't think the trace is associated with the determinant from this specific example. It seems to appear generally (at least for all square

X

), and can be seen with very little reference to the underlying calculus problem. Say we have some

X

and

g(X)

as before. Then if

X

is square, then the Jacobian and Gradient are each also square, all three of the same size (the Gradient is the Matrix Calculus version of the Gradient as defined here, which is the transpose of the Jacobian). Call the Jacobian

J_{X}(g) = \begin{pmatrix}J_{11} & J_{12} \\ J_{21} & J_{22}\end{pmatrix}

and the differential

dX = \begin{pmatrix}dX_{11} & dX_{12} \\ dX_{21} & dX_{22}\end{pmatrix}

.

The first formula is

dg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F

. Transposing the Jacobian then doing the Forbenius Inner Product yields

dg = J_{11}\ dX_{11} + J_{12}\ dX_{21} + J_{21}\ dX_{12} + J_{22}\ dX_{22}

.

The second formula is

dg = J_X(g)\ dX

. Doing the matrix multiplication yields

\begin{pmatrix}J_{11}\ dX_{11} + J_{12}\ dX_{12} & J_{11}\ dX_{12} + J_{12}\ dX_{22} \\ J_{21}\ dX_{11} + J_{22}\ dX_{21} & J_{21}\ dX_{12} + J_{22}\ dX_{22}\end{pmatrix}

. The first result can be seen in the trace of this second result.

fresh_42 · May 1, 2025

The link helps a lot to clarify the language that is used in your source. I wouldn't write

df =f(x+dx)-f(x)

since it is a bit sloppy, but ok.

Now that I can look up what you actually mean, can we restart the discussion? What is your question? To answer the only sentence with a question mark in your post #1 ...

Metronome said:
The problem seems to be that the Frobenius Inner Product returns a very different shape than matrix multiplication. What is my misconception here?

... I did the following calculation to verify the formula

\bigl\langle A,B \bigr\rangle_F=\operatorname{tr}(A^TB).

\begin{array}{lll} (A^TB)_{ij}&=\displaystyle{\sum_{k=1}^n (A^T)_{ik}B_{kj}=\sum_{k=1}^n A_{ki}B_{kj}}\\[12pt] tr(A^TB)&=\displaystyle{\sum_{m=1}^n(A^TB)_{mm}=\sum_{m=1}^n\left(\sum_{k=1}^n A_{km}B_{km}\right)=\bigl\langle A,B \bigr\rangle_F } \end{array}

The notation

A=\operatorname{vec(A})

is simply the vector we obtain from reading the matrix from left to right, and row by row, e.g.

\operatorname{vec}(X)=(x,y,u,v) \text{ in case }X=\begin{pmatrix}x&y\\u&v\end{pmatrix}.

The Frobenius product of two matrices

A,B

is thus the vector product

\operatorname{vec(A)}^T\cdot \operatorname{vec}(B)

because the vector product matches the indices:

\vec{v}^T\cdot \vec{w}=\displaystyle{\sum_{k=1}^{n}v_k w_k}

or in our case

\operatorname{vec}(A)^T\cdot \operatorname{vec}(B)=\displaystyle{\sum_{(i,j)=(1,1)}^{(n,n)}v_{ij} w_{ij}}

Please note that the "T" at a matrix means the transposed matrix, whereas the "T" at a vector only means that we write it as a row. Vectors without that "T" are column vectors.

Does that answer your question about the connection of the ordinary matrix product and the Frobenius product?

Metronome · May 2, 2025

fresh_42 said:
Now that I can look up what you actually mean, can we restart the discussion? What is your question?

Okay, I think we have most of the grounding now. Starting again, we have two formulas,

dg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F

and

dg = J_X (g)\ dX

(the latter being the generalization of

df\ =\ f'(x)\ dx

from ordinary calculus).

What I believe you have shown is that

dg = \bigl\langle \nabla_X (g),\ dX \bigr\rangle_F

is equivalent to

dg = vec(J_X (g))^T\ vec(dX)

, where I'm amending your definition of vectorization to stacking the columns of a matrix into a column vector (I suspect this is what you wanted anyhow, or else some of your vector products appear to be outer products rather than inner products). I think the last two subquestions to answer the main question are...

1) What justifies vectorization in this way? I could buy the idea of vectorizing both sides of an equation, i.e.,

dg = J_X (g)\ dX\ \implies\ vec(dg) = vec(J_X (g)\ dX)

. I've done this in other contexts. But

dg = J_X (g)\ dX\ \implies\ dg = vec(J_X (g))^T\ vec(dX))

is mysterious.

2) We still have the apparent contradiction that

dg

is both a scalar and a matrix. Even vectorizing

dg

wouldn't make it a scalar as needed. Are there two genuinely different concepts of

dg

here, and if so, how is each to be interpreted?

fresh_42 · May 2, 2025

Metronome said:
1) What justifies vectorization in this way? I could buy the idea of vectorizing both sides of an equation, i.e., $dg = J_X (g)\ dX\ \implies\ vec(dg) = vec(J_X (g)\ dX)$ . I've done this in other contexts. But $dg = J_X (g)\ dX\ \implies\ dg = vec(J_X (g))^T\ vec(dX))$ is mysterious.

It is he same resulting number, only written differently.

\begin{array}{lll} \bigl\langle A,B \bigr\rangle_F&=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}

and

\begin{array}{lll} \operatorname{vec}(A)^T\cdot \operatorname{vec}(B)&=(A_{11},A_{12},\ldots,A_{1n},A_{21},A_{22},\ldots,A_{2n},\ldots,A_{n1},A_{n2},\ldots,A_{nn}) \cdot \begin{pmatrix}B_{11}\\B_{12}\\ \vdots \\B_{1n}\\ \vdots\\ \vdots \\B_{nn}\end{pmatrix} \\ &\\ &=A_{11}B_{11}+A_{12}B_{12}+\ldots+A_{1n}B_{1n}+\ldots+A_{n1}B_{n1}+A_{n2}B_{n2}+\ldots+A_{nn}B_{nn}\\[12pt] &=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}

Metronome said:
2) We still have the apparent contradiction that $dg$ is both a scalar and a matrix. Even vectorizing $dg$ wouldn't make it a scalar as needed. Are there two genuinely different concepts of $dg$ here, and if so, how is each to be interpreted?

Yes, but this is only because the notation isn't stringent. We have to decide what

dg

means!

If we have a differentiable function

g\, : \,\mathbb{R}^{n^2}\longrightarrow \mathbb{R}

then

dg

is usually the linear function from one tangent space to the other, the derivative or the differential form here, in our case

dg\, : \,T_p\left(\mathbb{R}^{n^2}\right)=\mathbb{R}^{n^2}\longrightarrow T_{g(p)}\left(\mathbb{R}\right)=\mathbb{R}

This means that

dg

is a linear transformation from an

n^2

-dimensional vector space into a one-dimensional vector space.
Hence, we can write it as a

n^2\times 1

matrix, or rearranged as an

n\times n

matrix. This is a matter of convenience, and given that this is about programming, a matter of index management.

Now, why can it be seen as a scalar? Well, it isn't, but the images under

dg

are scalars. If we evaluate the derivative at a certain point,

g(p),

and apply a direction - note that derivatives are always directional - say the direction

dX

which lives in the

n^2

-dimensional vector space where our matrices live, then we get

D_{g(p)}(g)\cdot dX=\bigl\langle \nabla_{g(p)}\, , \,dX \bigr\rangle \in \mathbb{R}.

Abbreviating the derivative

D

of

g

at the location

g(p)

in direction

dX

by simply writing it

dg=D_{g(p)}(g)

or even

dg=D_{g(p)}(g)\cdot dX

is sloppy. It should at least be something like

dg(X)

or

d_Xg.

In the end, it is the same question as whether a notation like

f(x)

means the function or the resulting number.
Look at the beginning of section 5.1 in your source where they defined

df.

They have smuggled a

[dX]

into the end of the line, indicating the multiplication with

dX

, which makes it a number.

Ask 3 scientists about what a derivative is and you get 5 different answers and 8 different notations, at least.

Metronome · May 3, 2025

fresh_42 said:
It is he same resulting number, only written differently.

$\begin{array}{lll} \bigl\langle A,B \bigr\rangle_F&=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}$
and
$\begin{array}{lll} \operatorname{vec}(A)^T\cdot \operatorname{vec}(B)&=(A_{11},A_{12},\ldots,A_{1n},A_{21},A_{22},\ldots,A_{2n},\ldots,A_{n1},A_{n2},\ldots,A_{nn}) \cdot \begin{pmatrix}B_{11}\\B_{12}\\ \vdots \\B_{1n}\\ \vdots\\ \vdots \\B_{nn}\end{pmatrix} \\ &\\ &=A_{11}B_{11}+A_{12}B_{12}+\ldots+A_{1n}B_{1n}+\ldots+A_{n1}B_{n1}+A_{n2}B_{n2}+\ldots+A_{nn}B_{nn}\\[12pt] &=\displaystyle{ \sum_{i=1}^n \sum_{j=1}^n A_{ij} B_{ij} } \end{array}$

Yes, but this is only because the notation isn't stringent. We have to decide what $dg$ means!

If we have a differentiable function $g\, : \,\mathbb{R}^{n^2}\longrightarrow \mathbb{R}$ then $dg$ is usually the linear function from one tangent space to the other, the derivative or the differential form here, in our case
$dg\, : \,T_p\left(\mathbb{R}^{n^2}\right)=\mathbb{R}^{n^2}\longrightarrow T_{g(p)}\left(\mathbb{R}\right)=\mathbb{R}$ This means that $dg$ is a linear transformation from an $n^2$ -dimensional vector space into a one-dimensional vector space.
Hence, we can write it as a $n^2\times 1$ matrix, or rearranged as an $n\times n$ matrix. This is a matter of convenience, and given that this is about programming, a matter of index management.

Now, why can it be seen as a scalar? Well, it isn't, but the images under $dg$ are scalars. If we evaluate the derivative at a certain point, $g(p),$ and apply a direction - note that derivatives are always directional - say the direction $dX$ which lives in the $n^2$ -dimensional vector space where our matrices live, then we get
$D_{g(p)}(g)\cdot dX=\bigl\langle \nabla_{g(p)}\, , \,dX \bigr\rangle \in \mathbb{R}.$
Abbreviating the derivative $D$ of $g$ at the location $g(p)$ in direction $dX$ by simply writing it $dg=D_{g(p)}(g)$ or even $dg=D_{g(p)}(g)\cdot dX$ is sloppy. It should at least be something like $dg(X)$ or $d_Xg.$

In the end, it is the same question as whether a notation like $f(x)$ means the function or the resulting number.
Look at the beginning of section 5.1 in your source where they defined $df.$ They have smuggled a $[dX]$ into the end of the line, indicating the multiplication with $dX$ , which makes it a number.

Ask 3 scientists about what a derivative is and you get 5 different answers and 8 different notations, at least.

Alright, I think I understand it about as well as it can be understood. Thank you for the help!

scottbrandon · Jun 24, 2025

Metronome said:
If $X$ is a matrix of variables, $g(X)$ is a scalar-valued function of $X$ , and $<\cdot,\ \cdot>_F$ is the Frobenius Inner Product, then $dg\ =\ <\nabla g,\ dX>_F$ . Some examples I've seen derived are $d(||X||_F)\ =\ <\frac{X}{||X||_F},\ dX>_F$ and $d(\vec v^T X \vec w)\ =\ <\vec v \vec w^T,\ dX>_F$ .

In general (and thus if $g(X)$ is still scalar-valued in particular), if $J(g)$ is the Jacobian Matrix for $g(X)$ , then $dg\ =\ J(g)dX$ .

However, the Frobenius Inner Product returns a scalar, and therefore the RHS of the first equation and thus the LHS of the first equation must be a scalar, while for all $X$ which are "non-trivial matrices" ( $2$ by $2$ or larger), each of $J(g)$ and $dX$ should also be a non-trivial matrix, and since the matrix multiplication of two non-trivial matrices is never a scalar, the RHS of the second equation and thus the RHS of the second equation must not be a scalar. Therefore $dg$ is both a scalar and not a scalar.

I have not referenced the fact that $\nabla g = (J(g))^T$ . This is also true, of course, but after coming across some confusion similar to https://babbel.pissedconsumer.com/customer-service.html, I still can't get the shapes to make sense. The problem seems to be that the Frobenius Inner Product returns a very different shape than matrix multiplication. What is my misconception here?

Your misconception is that you're comparing matrix multiplication with the Frobenius inner product without recognizing that in the Jacobian form, 𝑑 𝑋 dX is implicitly vectorized. When 𝑔 ( 𝑋 ) g(X) is scalar-valued, the Jacobian 𝐽 ( 𝑔 ) J(g) is a row vector and 𝑑 𝑋 dX is treated as a column vector using vectorization. So, 𝑑 𝑔 = 𝐽 ( 𝑔 ) ⋅ v e c ( 𝑑 𝑋 ) dg=J(g)⋅vec(dX) is a scalar, just like 𝑑 𝑔 = ⟨ ∇ 𝑔 , 𝑑 𝑋 ⟩ 𝐹 dg=⟨∇g,dX⟩ F . Both are consistent; they just express the same thing in different forms.

What is my misconception about these two Matrix Calculus formulas for the differential?

Junior Member

Elite Member

Junior Member

Elite Member

Junior Member

Full Member

Elite Member

Junior Member

Junior Member

Full Member

Junior Member

Full Member

Junior Member

Full Member

Junior Member

Full Member

Junior Member

Full Member

Junior Member

New member