Derivatives in Higher Dimensions

Directional and Partial derivatives

Suppose $f \colon \R^d \to \R^n$ is a function. How do we define it’s derivative? Our old definition \begin{equation} \lim_{h \to 0} \frac{f(a + h) - f(a)}{h} \end{equation} does not make sense as it is a ratio of vectors.

Definition 1 (Directional derivative). Let $U \subset \R^d$ be a domain, $f:U \to \R$ be a function, and $v \in \R^d - \set{0}$ be a vector. We define the directional derivative of $f$ in the direction $v$ at the point $a$ by \begin{equation} D_v f(a) \defeq \frac{d}{dt} f( a + tv ) \Bigr|_{t = 0} \end{equation}

Remark 2. Other notations include \begin{equation} D_v f(a) = Df(a)v = Df_a v = \grad_v f(a) = \partial_v f(a) \end{equation}

Definition 3 (Partial derivatives). For $i \in \set{1, \dots, d}$, we define the $i^\text{th}$ partial derivative of $f$ (denoted by $\partial_i f$) by \begin{equation} \partial_i f(a) = D_{e_i} f(a) \,, \end{equation} where $e_i$ is the $i^{\text{th}}$ elementary basis vector.

Remark 4. Other notations include $\frac{\partial f}{\partial x_i}(a)$ or $\partial_{x_i} f(a)$

Example 5. If $f(x) = \abs{x}$, then $\partial_i f(x) = x_i / \abs{x}$ and $\partial_v f(x) = x \cdot v / \abs{x}$, provided $x \neq 0$.

The derivative

Neither of the above notions, however, gives the full picture. To obtain the “right” notion of differentiability, we recall that in one dimension, the derivative gave us a linear approximation of the function. Whatever gives us this in higher dimensions should be the derivative.

Definition 6 (Derivative). We say $f$ is differentiable at $a$ if there exists a linear transformation $T\colon \R^d \to \R^n$ such that \begin{equation} \lim_{h \to 0} \frac{f(a + h) - f(a) - Th}{\abs{h}} = 0\,. \end{equation} In this case, the linear transformation $T$ is called the derivative of $f$ at $a$, and denoted by $Df_a$.

Remark 7 (Uniqueness). One can directly check that the derivative of $f$, if it exists, is unique. That is, if $T_1$, $T_2$ are two linear transformations that are both the derivative of $f$ (according to the definition), then $T_1 = T_2$.

Example 8. If $T \colon \R^d \to \R^n$ is a linear transformation, then $D T_a = T$.

Problem 9. If $f$ is differentiable at $a$, show that $f$ is continuous at $a$.

Lemma 10. Let $U \subseteq \R^d$ be a domain, $f\colon \R^d \to \R^n$ be a function, and $a \in U$. The function $f$ is differentiable at $a$ if and only if there exists a linear transformation $T\colon \R^d \to \R^n$ and a function $e \colon \R^d \to \R^n$ such that \begin{equation} f(a + h) = f(a) + Th + e(h) \quad\text{and}\quad \lim_{h \to 0} \frac{\abs{ e(h) }}{\abs{h}} = 0 \,. \end{equation}

The Jacobian

Proposition 11. If $f$ is differentiable at $a$, then all the directional derivatives $D_v f(a)$ exist. Further, \begin{equation}\label{e:DfJac}\tag{J} Df_a = \begin{pmatrix} \uparrow & \uparrow & \cdots & \uparrow\\ \partial_1 f(a) & \partial_2 f(a) & \cdots & \partial_d f(a)\\ \downarrow & \downarrow & \cdots & \downarrow\\ \end{pmatrix} = \begin{pmatrix} \partial_1 f_1(a) & \partial_2 f_1(a) & \cdots & \partial_d f_1(a)\\ \partial_1 f_2(a) & \partial_2 f_2(a) & \cdots & \partial_d f_2(a)\\ \vdots & \vdots & \cdots & \vdots\\ \partial_1 f_n(a) & \partial_2 f_n(a) & & \partial_d f_n(a)\\ \end{pmatrix} \end{equation} and \begin{equation} D_v f(a) = Df_a v = \sum_{i = 1}^d v_i \partial_i f(a). \end{equation}

Definition 12 (Jacobian). The matrix on the right of \eqref{e:DfJac} is called the Jacobian matrix.

Remark 13. Proposition 11 shows that that the unique linear transformation in the definition of the derivative must be the Jacobian matrix. However, if the Jacobian matrix exists it need not be the derivative of $f$.

Problem 14. Let $f(x, y) = x^2 y / (x^2 + y^2)$ if $(x, y) \neq 0$, and $f(0) = 0$. Show that $f$ is continuous at $0$, and for every $v \in \R^2 - \set{0}$, $D_v f(0)$ exists. However, $f$ is not differentiable at $0$.

Example 15. Let $f(x, y) = x^2 y / (x^4 + y^2)$ if $(x, y) \neq 0$, and $f(0) = 0$. Then for every $v \in \R^2 - \set{0}$, $D_v f(0)$ exists, but $f$ is not differentiable (or even continuous) at $0$.

Theorem 16. If all partial derivatives of $f$ exist in a neighbourhood of $a$, and are continuous at $a$, then $f$ is differentiable at $a$.

Example 17. Let $f \colon \R^2 \to \R$ be defined by $f(x) = \abs{x}^2 \sin( 1/ \abs{x})$ when $x \neq 0$, and $f(0) = 0$. Then $f$ is differentiable on all of $\R^2$ (including $x = 0$), and hence all partial derivatives of $f$ exist at all points in $\R^2$. However, $\partial_1 f$ and $\partial_2 f$ are not continuous at $x = 0$.

The gradient

Definition 18. Suppose $f \colon \R^d \to \R$ is differentiable. The gradient of $f$, denoted by $\grad f$, is defined by \begin{equation} \grad f = (Df)^T = \begin{pmatrix} \partial_1 f\\ \partial_2 f\\ \vdots \\ \partial_d f \end{pmatrix} \,. \end{equation}

Proposition 19. If $v \in \R^d$ with $\abs{v} = 1$. Then $D_v f(a)$ is maximised when $v = \grad f(a)$ and $D_v f(a)$ is minimised when $v = - \grad f(a)$.

Remark 20 (Gradient descent). Suppose you have a function $f$ which you (numerically) want to minimize. Namely, start with a (possibly bad) guess for the minimum $x_0$. The gradient descent algorithm chooses successive approximations by moving in the direction that $f$ decreases the most. By Proposition 19 this is in a direction directly against $\grad f$. Explicitly choose the next approximation by \begin{equation} x_{n+1} = x_n + \gamma_n \grad f(x_n)\,, \end{equation} for some small $\gamma_n$. Choosing the step size $\gamma_n$ requires some care; one popular choice is \begin{equation*} \gamma_n = \frac{ (x_n - x_{n-1}) \cdot ( \grad f(x_n) - \grad f(x_{n-1}) ) }{ \abs{ \grad f(x_n) - \grad f(x_{n-1}) }^2 }\,, \end{equation*} which guarantees converge to a local minimum, under certain assumptions on $f$.

Rules for differentiation

Sums, products, quotients

The one variable calculus rules for differentiation of sums, products and quotients (when they make sense) are still valid in higher dimensions.

Proposition 21. Let $f, g: \R^d \to \R$ be two differentiable functions.

$f + g$ is differentiable and $D(f + g) = Df + Dg$.
$fg$ is differentiable and $D(fg) = f Dg + g Df$.
At points where $g \neq 0$, $f/g$ is also differentiable and \begin{equation} D\paren[\Big]{ \frac{f}{g} } = \frac{g Df - f Dg}{g^2} \end{equation}

Proposition 22. Let $f \colon \R^d \to \R^n$ be a function, $a \in \R^d$. The function $f$ is differentiable at $a$ if and only if for every $i \in \set{1, \dots, d}$, the coordinate function $f_i \colon \R^d \to \R$ is differentiable at $a$. In either case \begin{equation} Df(a) = \begin{pmatrix} \leftarrow & Df_1(a) & \rightarrow\\ \leftarrow & Df_2(a) & \rightarrow\\ \vdots & \vdots & \vdots\\ \leftarrow & Df_n(a) & \rightarrow \end{pmatrix} \end{equation}

Note: Here we use $Df(a)$ to denote the derivative of $f$ at $a$, and $Df_i(a)$ to denote the derivative of the $i$-th coordinate function $f_i$ at the point $a$.

Proposition 23. If $f, g \colon \R^d \to \R^n$ are two differentiable functions, then \begin{equation} D(f \cdot g) = (Df)^T g + (Dg)^T f \,. \end{equation}

These follow in a manner very similar to the one variable versions, and are left for you to verify. The one rule that is a little different in this context is the differentiation of composites.

Chain rule

Theorem 24 (Chain rule). Let $U \subseteq \R^m$, $V \subseteq \R^n$ be domains, $g:U \to V$, $f:V \to \R^d$ be two differentiable functions. Then $f \circ g:U \to \R^d$ is also differentiable and \begin{equation} D(f \circ g)_a = (D f_{g(a)}) (Dg_a) \end{equation}

Note $Df_g$ and $Dg$ are both matrices, and the product above is the matrix product of $Df$ and $Dg$.

Proof sketch. Since $f, g$ are differentiable we know there exist functions $e_1$ and $e_2$ such that \begin{equation} g(a + h) = g(a) + Dg_a + e_2(h) \quad\text{and}\quad f( g(a) + h) = f(g(a)) + Df_{g(a)} + e_1(h)\,, \end{equation} with $\lim_{h \to 0} e_i(h) / \abs{h} = 0$. Consequently, \begin{align} f(g(a + h)) &= f\paren[\big]{ g(a) + Dg_a h + e_2(h)} \\ &= f( g( a) ) + Df_{g(a)} \paren[\big]{ D g_a h + e_2(h)} + e_1 \paren[\big]{ D g_a h + e_2(h)} \\ &= f(g(a)) + \paren[\big]{Df_{g(a)} \, Dg_a} h + e_3(h)\,, \end{align} where \begin{equation} e_3(h) = Df_{g(a)} e_2(h) + e_1 \paren[\big]{ D g_a h + e_2(h)}\,. \end{equation} Now to finish the proof one only needs to show $\lim_{h \to 0} e_3(h) / \abs{h} = 0$, which can be done directly from the $\epsilon$-$\delta$ definition.

Remark 25. In terms of partial derivatives, $\partial_j (f \circ g)_i$ is the entry in the $i$-th row and $j$-th column of the matrix $D(f\circ g)$. Hence \begin{equation} \partial_j (f \circ g)_i = \paren[\big]{ (Df_g) (Dg) }_{i,j} = \sum_{k = 1}^n \partial_k f_i \Bigr|_{g} \partial_j g_k \,. \end{equation} While this can be derived immediately by multiplying the matrices $Df_g$ and $Dg$, it arises often enough that it is worth directly remembering.

Problem 26. Let $F(x, y) = x y$. Given two differentiable functions $f, g\colon \R \to \R$, let $\gamma\colon \R \to \R^2$ be defined by $\gamma(t) =(f(t), g(t))$. Observe $\frac{d}{dt}(fg) = D(F \circ \gamma)$. Compute the right hand side using the chain rule, and use it to derive the product rule.

Problem 27. Derive the quotient rule from the chain rule using a method similar to the previous problem.

Problem 28. Compute $\partial_x (x^x)$.

Problem 29. Recall the fundamental theorem of calculus says that if $f$ is a continuous function then $\partial_t \int_0^t f(s) \, ds = f(t)$. The Leibnitz rule says that if $g$ and $\partial_x g$ are continuous, then $\partial_x \int_a^b g(x, t) \, dt = \int_a^b \partial_x g(x, t) \, dt$. Using these, compute $\partial_t \int_0^t \exp(-(t - s)^2) \, ds$.

Coordinate changes

Suppose $y_1$, …, $y_n$ are differentiable functions of $x_1$, …, $x_m$. Suppose further $z_1$, …, $z_d$ are differentiable functions of $y_1$, …, $y_n$. The chain rule tells us \begin{equation} \frac{\partial z_i}{\partial x_j} = \sum_{k = 1}^n \frac{\partial z_i}{\partial y_k} \frac{\partial y_k}{\partial x_j} \,. \end{equation} This is often used when changing coordinates (something we will see in more detail later).

Problem 30 (Polar coordinates). Let $U = \R^2 - \set{(x, 0) \st x \leq 0}$, and $V = \set{ (r, \theta) \st r > 0, \theta \in (-\pi, \pi)}$. Given a differentiable function $f$ defined on $U$, we treat it as a function of the coordinates $x$ and $y$. Using the relation $x = r \cos\theta$ and $y = r \sin \theta$ for $(r, \theta) \in V$, we can now treat $f$ as a function of $r$ and $\theta$.

Express $\partial_r f$ and $\partial_\theta f$ in terms of $\partial_x f$, $\partial_y f$, $r$ and $\theta$.
Express $r, \theta$ in terms of $x$ and $y$.
Suppose now $g$ is a differentiable function defined on $V$, which we treat as a function as a function of $r$ and $\theta$. Using the previous part, we can treat $g$ as a function of $x$ and $y$. Compute $\partial_x g$ and $\partial_y g$ in terms of $\partial_r g$, $\partial_\theta g$, $x$ and $y$.

Example 31. Let $u = x^2 + y^2$ and $v = y/x$. Explicitly express $u, v$ in terms of $r$ and $\theta$ (the polar coordinate variables) and compute $\partial_r u$, $\partial_\theta u$, $\partial_r v$ and $\partial_\theta v$ directly. Verify that this agrees with the formulae in Problem 30.

Example 32. Consider the function $g$ defined in polar coordinates by $g(r, \theta) = r \theta$. Compute $\partial_x g$, $\partial_y g$ both by directly expressing $g$ in terms of $x$ and $y$ and differentiating, and by using the transformation formulae in Problem 30, and verify they agree.

Problem 33. If the polar coordinates $r, \theta$ are functions functions of variables $s$ and $t$, compute $\partial_s f$ in terms of $\partial_x f$, $\partial_y f$, $r$, $\theta$, $\partial_s r$ and $\partial_s \theta$.