Perceptron algorithm on a training set
- Consider running the Perceptron algorithm on a training set S arranged in a certain order.
Now suppose we run it with the same initial weights and on the same training set but in a
different order, S
′
. Does Perceptron make the same number of mistakes? Does it end up with
the same final weights? If so, prove it. If not, give a counterexample, i.e. an S and S
′ where
order matters. - We have mainly focused on squared loss, but there are other interesting losses in machine
learning. Consider the following loss function which we denote by ϕ(z) = max(0, −z). Let S
be a training set (x
1
, y1
), . . . ,(x
m, ym) where each x
i ∈ R
n and y
i ∈ {−1, 1}. Consider running
stochastic gradient descent (SGD) to find a weight vector w that minimizes 1
m
Pm
i=1 ϕ(y
i
·
w
T x
i
). Explain the explicit relationship between this algorithm and the Perceptron algorithm.
Recall that for SGD, the update rule when the i
th example is picked at random is
wnew = wold − η∇ϕ
y
iw
T x
i
.
Note: You do not need to be overly concerned about the discontinuity at ϕ(0), so you can
ignore this when calculating the gradient for this problem.
- Here we will give an illustrative example of a weak learner for a simple concept class. Let the
domain be the real line, R, and let C refer to the concept class of “3-piece classifiers”, which
are functions of the following form: for θ1 < θ2 and b ∈ {−1, 1}, hθ1,θ2,b(x) is b if x ∈ [θ1, θ2]
and −b otherwise. In other words, they take a certain Boolean value inside a certain interval
and the opposite value everywhere else. For example, h10,20,1(x) would be +1 on [10, 20], and
−1 everywhere else. Let H refer to the simpler class of “decision stumps”, i.e. functions hθ,b
such that h(x) is b for all x ≤ θ and −b otherwise.
(a) Show formally that for any distribution on R (assume finite support, for simplicity; i.e.,
assume the distribution is bounded within [−B, B] for some large B) and any unknown
labeling function c ∈ C that is a 3-piece classifier, there exists a decision stump h ∈ H
that has error at most 1/3, i.e. P[h(x) ̸= c(x)] ≤ 1/3.
(b) Describe a simple, efficient procedure for finding a decision stump that minimizes error
with respect to a finite training set of size m. Such a procedure is called an empirical
risk minimizer (ERM).
(c) Give a short intuitive explanation for why we should expect that we can easily pick m
sufficiently large that the training error is a good approximation of the true error, i.e.
why we can ensure generalization. (Your answer should relate to what we have gained in
Sample Solution