Q&A (Week 3)

Under the Hard margin SVM objective, it says “parameter b is optimised indirectly by influencing constraints”. Is this “optimisation of parameter b” referring to separation hyperplane or regularisation?

Both w and b are parameters of SVM model (we need both to specify a hyperplane) and training SVM means finding a good set of (w, b) such that some objective function is maximised under some constraints. In hard margin SVM, that means maximise margin subject to the constraints of no misclassification. It turns out that the “maximise margin” part can be expressed purely in terms of w, and b only comes into making sure that the “no misclassification” constraint is satisfied.

I know the slides uses the word “regularisation” in describing the objective of SVM, but I think that is a little confusing. Just think of the 1/

term as an objective function to measure how well a particular choice of (w, b) performs for a given training data.

Also under the Hard margin SVM objective, what is the “data” here when it says “data-independent regularisation term”? I thought the data was training data, but then I think $1/||w||$ is related to training data, since scaling changes (yi*(wxi+b)). I am confused about why it is considered a “data-independent regularisation term” after scaling the closest distance to 1/||w||.

That slide is just trying to express the fact that once the constraints is satisfied , maximising the 1/

objective no longer cares about training data. It is the constraints of “no misclassification” that depends on the training data. So the training process can be thought of (not actual implementation though) as first using training data to find all (w, b) that satisfies the y_i (w x_i + b) >= 1 constraints and then find the smallest

in that set.

Under the Lagrange duality, where does lambda comes from w, and how it got to the model here.

This is indeed a whole topic on its own. I think this article explains the origin of the dual variable fairly well. You might want to go through the first section carefully and simultaneously having some easy example in your mind. E.g. try to go through the article with the following optimisation problem:

minimise x^2 subject to -x + 3 <= 0

As you mentioned, this might be a little out of scope of this subject.
In the Kernel example, what does the transformation in the last line of the calculation mean?

Notice that it is a function that takes the feature vector (x_1, x_2) with only 2 dimensions and produce another vector with 6 dimensions which includes a (scaled) copy of the original (x1, x2) component but also other quadratic components. In practice, this means we transform our original dataframe with 2 columns into a new dataframe with 6 columns and it is this new dataframe that we feed into SVM so that SVM can do linear separation in this larger 6 dimensional space.

But we can do away with actually computing this new dataframe and feeding a much larger dataframe into SVM by using kernel. The observation is that we only ever need to compute dot product of the new 6-dimensional vector, not the vectors themselves. And we observe in this calculation that calculating the dot product in transformed space is the same as evaluating the kernel in the original space, which is much cheaper.
What is an infinite dimensional vector space?

It is what it says it is, a vector space with infinite dimension. To make sense of this you would need to know that definition of vector spaces and their dimension. Roughly, a (real) vector space is any set of things that you know how to add, and you know how to scale with a real number. The dimension of the vector space is the number of “coordinates” needed to specify a vector in this set. 2-tuples forms a 2D vector space, 3-tuples forms 3D etc.. You can perhaps imagine an infinite list of numbers forming infinite dimensional space (ask yourself: do you know how to add two such lists? do you know how to scale one such list by a real number? If so, then you likely have a vector space).

A non-trivial example relevant to things we discussed is that any (sufficiently nice) function can be expressed by their Taylor series $f(x) = a_0 + a_1 x + a_2 x^2 + \dots$

You can think of this as using an infinite list of numbers $(a_0, a_1, a_2, \dots)$ to specify a function. For example, the list $(1, 2, 1, 0, 0, 0, …)$ specify the function that computes $f(x) = 1 + 2x + x^2$. And to specify all such functions, you do need an infinite list, so this is vector space is infinite dimensional..

I am skipping over lots of details here, but this is mostly covered in any linear algebra course / book.