Optimality conditions

Let $\cC\subset\reals^d$ be closed, and $f_0:\reals^d\to\reals$ be a continuously differentiable function. Consider the (possibly non-convex) general optimization problem

\[\begin{aligned} &\min\nolimits_{x\in\reals^d} & & f_0(x) \\ &\subjectto & & x\in\cC \end{aligned}.\]

Intuitively, if a point $x_*$ is a (local) minimum, then

\[\langle\partial f_0(x_*),v\rangle \ge 0\text{ in all "feasible directions" }v\]

If $x_*$ is in the interior of $\cC$, then the set of “feasible directions” is $\reals^d$, and we recover the usual zero-gradient (necessary) optimality condition:

\[\langle\partial f_0(x_*),v\rangle \ge 0\text{ for all }v\in\reals^d\iff\partial f_0(x_*) = 0.\]

If $x_*$ is on the boundary of $\cC$, then the set of “feasible directions” is intuitively the set of all directions that point “inward” from $x_*$. As we shall see, this is the set of tangent vectors.

Tangent vectors

At first blush, we consider defining tangent vectors at $x\in\cC$ as directions $v$ such that $x + \eta d\in\cC$ for all small enough step sizes $\eta$. Although this definition works well for $\cC$ defined by linear constraints, it is too restrictive for $\cC$ with curved boundaries. For example, if $\cC$ is curve in $\reals^2$, then it may not be possible to move in any direction and remain in $\cC$.

Tangent vectors: A vector $v\in\reals^d$ is tangent to $\cC$ at $x\in\cC$ iff there are sequences $(x_n)\subset\cC$, $(x_n) \to x$ and $(\eta_n)\searrow 0$ such that

$\frac{1}{\eta_n}(x_n - x) \to v$ or $x_n - x = \eta_nv + o(\eta_n)$.

Intuitively, $(x_n)$ traces out a curve passing thru $x$, and the line segment $x + \eta v$ is tangent to this curve. Compared to the initial (overly restrictive) definition of tangent vector, this definition adds a $o(\eta_n)$ fudge factor.

We note that any direction $v$ such that $x + \eta v\in\cC$ for all small enough $\eta$ is a tangent vector. Indeed, let $\eta_n$ be a sequence of small enough step sizes that converges to zero and $x_n \triangleq x + \eta_nv$. The assumption $x + \eta v\in\cC$ for all small enough $\eta$ ensures $x_n$ is in $\cC$. We have $x_n - x = \eta_nv$.

We also note that this definition of tangent vector is coincides with the tangent space of a smooth manifold. Recall the tangent space of a smooth manifold $\cM$ at $x\in\cM$ consists of the derivatives (at $x$) of all smooth curves in $\cM$ passing thru $x$. Let $\gamma:[-\delta,\delta]\to\cC$ be a curve in $\cC$ such that $x = \gamma(0)$. To see that $\dot{\gamma}(0)$ is a tangent vector, let $\eta_n\searrow 0$ and $x_n\triangleq\gamma(\eta_n)$. We have

\[x_n - x = \gamma(\eta_n) - \gamma(0) = \eta_n\dot{\gamma}(0) + o(\eta_n),\]

where we used the definition of the derivative in the second step.

Finally, we claim that the set of all tangent vectors at $x\in\cC$ is a closed cone called the tangent cone $T_{\cC}(x)$. Recall a subset of $\reals^d$ is a cone iff it is closed under non-negative scalar multiplication: if $x\in\cC$, then $\alpha x\in\cC$ for any $\alpha \ge 0$. The proof of this claim is elementary, and we leave the details as an exercise to the reader.

Optimality conditions for the general optimization problem

The normal vectors at a point $x\in\cC$ are the vectors that make an obtuse angle with all tangent vectors: $\langle u,v\rangle \le 0$ for all $v\in T_{\cC}(x)$. It is not hard to check that the set of all normal vectors at a point is a closed convex cone. We note that the tangent cone of a non-convex set may not the convex, but the normal cone is generally convex.

Optimality conditions for the general optimization problem. If $x_*$ is a local minimum, then $-\partial f_0(x_*)\in N_{\cC}(x_*)$. This is equivalent to

\[\langle\partial f_0(x_*),v\rangle \ge 0\text{ for all }v\in T_{\cC}(x_*).\]

To see this, let $v\in T_{\cC}(x_*)$ be an arbitrary tangent vector. There are $(x_n)\in\cC$, $x_n\to x$ and $\eta_n\searrow 0$ such that $\frac{1}{\eta_n}(x_n - x_*) \to v$. We have

\[\begin{aligned} 0 &\le f_0(x_n) - f_0(x_*) \\ &= f_0(x_* + \eta_n v) - f_0(x_*) + O(\|x_n - (x_* + \eta_nv)\|) \\ &= f_0(x_* + \eta_n v) - f_0(x_*) + o(\eta_n), \end{aligned}\]

where the first step is the (local) optimality of $x_*$, the second step is the smoothness of continuously differentiable functions, and the third step is the definition of tangent vector. We divide both sides by $\eta_n$ and take limits to obtain

\[0 \le \frac{1}{\eta_n}(f_0(x_* + \eta_n v) - f_0(x_*)) \to \langle\partial f_0(x_*),v\rangle.\]

This optimality condition is intuitive: if $x_*$ is a local minimum, then there is no direction $v$ that is (i) tangent to the feasible set and (ii) a (strict) descent direction.

Optimality conditions for convex problems. The main results here are

$\def\cone{\mathop{\rm cone}}T_{\cC}(x) = \cone(\cC - x_*)$, where $\cone$ is the conic hull;
$N_{\cC}(x) = \{u\in\reals^d\mid\langle u,x'-x\rangle \le 0\text{ for all }x'\in\cC\}$;
The optimality condition $-\partial f_0(x_*)\in N_{\cC}(x_*)$ is necessary and sufficient.

At a high-level, the KKT conditions are an instance of the (necessary) optimality condition for the general optimization problem to non-linear optimization problems in standard form. To keep things simple, we consider a non-linear optimization with only inequality constraints:

\[\begin{aligned} & \min\nolimits_{x\in\reals^d} && f_0(x) \\ & \subjectto && \{f_i(x) \le 0\}_{i=1}^m \end{aligned}.\]

Tangent vectors of optimization problems in “standard” form

To keep things simple, we consider a non-linear optimization with only inequality constraints:

\[\begin{aligned} & \min\nolimits_{x\in\reals^d} && f_0(x) \\ & \subjectto && \{f_i(x) \le 0\}_{i=1}^m \end{aligned}.\]

Let $\cC$ be the feasible set of the preceding optimization problem:

\[\cC \triangleq \{x\in\reals^d\mid\{f_i(x) \le 0\}_{i=1}^m\}.\]

At first blush, we guess that the tangent cone at $x\in\cC$ is

\[T_{\cC}(x) = \{v\in\reals^d:\{\langle\partial f_i(x),v\rangle\le 0\}_{i\in\cA}\},\]

where $\cA \triangleq \{i\in[m]\mid f_i(x) = 0\}$ is the set of indicies of the active constraints at $x$. This guess is motivated by the observation that we must have $\langle\partial f_i(x),v\rangle \le 0\rangle$ for all $i\in\cA$ or moving in the direction $v$ will violate an active inequality constraint. Unfortunately, this guess is not restrictive enough: there are pathological cases in which there are $v$’s that satisfy the preceding guess, but are not tangent vectors.

Consider the set $\cC \triangleq \{x\in\reals^2\mid\frac12\|x + e_1\|_2^2 \le \frac12, \frac12\|x - e_1\|_2^2 \le \frac12\}$, where $e_1 \triangleq (1,0)$. This is the intersection of two disks of radius 1: one centered at $e_1$ and another centered at $-e_1$. The two disks only intersect at the origin. Thus $\cC = \{0\}$, and $T_{\cC}(0) = \{0\}$. On the other hand, the preceding guess is

\[T_{\cC}(x) = \{v\in\reals^2\mid\langle e_1,v\rangle = 0\}.\]

To rule out such pathological cases, we impose constraint qualification (CQ). The standard CQ that we impose in this class is Slater’s CQ, but there are many alternatives.

The KKT conditions

Recall the optimality condition of the general optimization problem: if $x_*$ is a local minimum, then $\langle\partial f_0(x_*),v\rangle \ge 0$ for all $v\in T_{\cC}(x_*)$. In light of the preceding characterization of $T_{\cC}(x_*)$, this is equivalent to there is no $v\in\reals^d$ such that

\[\begin{aligned} \{\langle\partial f_i(x_*),v\rangle &\le 0\}_{i\in\cA}, \\ -\langle\partial f_0(x_*),v\rangle &< 0. \end{aligned}\]

Consider the $v$ in the preceding system of inequalities as (the normal vector) of a hyperplane thru the origin. The optimality condition has a geometric interpretation: there is no hyperplane (thru the origin) that separates $-\partial f_0(x_*)$ from $\{\partial f_i(x_*)\}_{i\in\cA}$. This implies $\partial f_0(x_*)$ is in the conic hull of $\{\partial f_i(x_*)\}_{i\in\cA}$: i.e. there is $\lambda\in\reals_+^m$ such that

\[\textstyle-\partial f_0(x_*) = \sum_{i\in\cA}\lambda_i\partial f_i(x_*).\]

This is almost the stationarity condition in the KKT conditions. We pad $\lambda$ with zero entries for the constraints that are inactive to get

\[\textstyle-\partial f_0(x_*) = \sum_{i=1}^m\lambda_i\partial f_i(x_*)\]

to get the stationarity condition. The complementary slackness and dual feasibility conditions follows from the construction of $\lambda$. Finally, the primal feasibility condition is a consequence of the fact that $x_*$ is a local minimum (hence it’s feasible).

Posted on February 17, 2025 from Ann Arbor, MI.