1. Introduction

In A short note on support vector machine (part 1), we started from the motivation for SVM, formulated it as an optimization problem and ended by a QP solution of the optimizaiton problem. Taking from there, we will discuss the dual version of SVM. This post aims to answer the following question.

  1. Motivation

Usually the first question I asked myself before studying any new knowledge is that why do I need to learn this? (even why does this staff exisit?). Back to the SVM, we have already had a solution, why do we further bother ourselves?

The answer is as follows: in the original SVM, we have number of variables \(d\) (or called features under machine learning settings) and number of observations \(N\). If d becomes very large (such as features transformations), the original SVM turns to be hard to solve. Therefore, we have to reformulate the original one into a new one with N variables and (N+1) constraints.

Another reason is the dual probelm is an unconstrained optimizatio problem with repsect to \(w\) and \(b\) from which we can obtain some interesting property to get the optimal solution more easily.

  1. Lagrange Function - From Constrained Problem to Unconstrained Probelm

In most optimization problem, we always want to reformulate the problem into an easier and equivalent one. Up to now, the orginal problem is in the following format

min \(\frac{1}{2}{w^{T}w}\)
s.t. \(y_{i}(w^{T}x_{i}+b)>=1\)

By using Lagrange multiplier, we can reformulate it into an unconstrained version as

\(\underset{w,b}{\text{min}}\Big( \underset{\alpha_{i}\geq0}{\text{max}} \ \frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}(1-y_{i}(w^{T}x_{i}+b))\Big)\)

What we did ?

This is fairly straightforward, we simply

Why they are equivalent problem ?

To answer this question, we need to take a look at the original problem to see what it needs. Then if we can reason tha thet reformulated problem provides the same, then they are equivalent.

Now we can say after “prove” the equivalence, our SVM is now as

\(\underset{w,b}{\text{min}}\Big( \underset{\alpha_{i}\geq0}{\text{max}} \ \frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}(1-y_{i}(w^{T}x_{i}+b))\Big)\)
  1. Two More Steps before We See the Benefits

The current formulation is hard to be expressed in algebraic mathemtical programming language to obtain a solution. We still have to further simplify it.

The first trick is remove the inner maximation operation and replace it with a random \(\alpha^{'}\)(a vector of length N). Then we can get the following inequility:

\(\underset{w,b}{\text{min}}\Big( \underset{\alpha_{i}\geq0}{\text{max}} \ \frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}(1-y_{i}(w^{T}x_{i}+b))\Big) \geq \underset{w,b}{\text{min}}\Big(\frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}^{'}(1-y_{i}(w^{T}x_{i}+b))\Big)\)

The logic is that the maxmimized value (left hand side(LHS)) is always greater than or equal to a random value (right hand side(RHS))

The second trick comes this way. For each feasible \(\alpha\), we can have an inequality as above. If we “combine” them together, we can conclude the LHS is greater than or equal to the maximium value of RHS. The logic is LHS is greater than or equal to every RHS therefore it must be greater than or equal to the maximum RHS. If you buy this logic, we’ll reach the following:

\(\underset{w,b}{\text{min}}\Big( \underset{\alpha_{i}\geq0}{\text{max}} \ \frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}(1-y_{i}(w^{T}x_{i}+b))\Big) \geq \underset{\alpha_{i}\geq0}{\text{max}}\ \underset{w,b}{\text{min}}\Big(\frac{1}{2}{w^{T}w}+\sum_{i}^{N}\alpha_{i}(1-y_{i}(w^{T}x_{i}+b))\Big)\)

Now, we have a lowerbound (RHS) of the orginal problem (LHS). It swaps the minimization and maximization order. The RHS problem is called the dual problem of the LHS.

What is the benefit ?

If we take a look at the dual problem, the inner minimization is an unconstrained problem (with respect to \(w\) and \(b\) which can be solved much more easily than the original constrained one. In addition, we can obtain some useful properties from this unconstrained version such as the gradient is equal to 0 for the optimal \(w\) and \(b\)

Secondly as said before, the original problem (the first formula in this post) has d (d+1 if we include b) variables and N constraints. This would be troublesome to solve if d becomes large (think of feature transformation into infinite space). We’ll see in the next post that the dual problem can be further simplifed to reach a version only involvs \(\alpha_{i}\) as optimization variables. The number of features (d) in SVM will become irrelavent.(Isn’t this great for infinite feature transformation ?)