ProficientWriting | Perceptron algorithm on a training set

Perceptron algorithm on a training set

Consider running the Perceptron algorithm on a training set S arranged in a certain order.
Now suppose we run it with the same initial weights and on the same training set but in a
different order, S
′
. Does Perceptron make the same number of mistakes? Does it end up with
the same final weights? If so, prove it. If not, give a counterexample, i.e. an S and S
′ where
order matters.
We have mainly focused on squared loss, but there are other interesting losses in machine
learning. Consider the following loss function which we denote by ϕ(z) = max(0, −z). Let S
be a training set (x
1
, y1
), . . . ,(x
m, ym) where each x
i ∈ R
n and y
i ∈ {−1, 1}. Consider running
stochastic gradient descent (SGD) to find a weight vector w that minimizes 1
m
Pm
i=1 ϕ(y
i
·
w
T x
i
). Explain the explicit relationship between this algorithm and the Perceptron algorithm.
Recall that for SGD, the update rule when the i
th example is picked at random is
wnew = wold − η∇ϕ

y
iw
T x
i

.
Note: You do not need to be overly concerned about the discontinuity at ϕ(0), so you can
ignore this when calculating the gradient for this problem.

Here we will give an illustrative example of a weak learner for a simple concept class. Let the
domain be the real line, R, and let C refer to the concept class of “3-piece classifiers”, which
are functions of the following form: for θ1 < θ2 and b ∈ {−1, 1}, hθ1,θ2,b(x) is b if x ∈ [θ1, θ2]
and −b otherwise. In other words, they take a certain Boolean value inside a certain interval
and the opposite value everywhere else. For example, h10,20,1(x) would be +1 on [10, 20], and
−1 everywhere else. Let H refer to the simpler class of “decision stumps”, i.e. functions hθ,b
such that h(x) is b for all x ≤ θ and −b otherwise.
(a) Show formally that for any distribution on R (assume finite support, for simplicity; i.e.,
assume the distribution is bounded within [−B, B] for some large B) and any unknown
labeling function c ∈ C that is a 3-piece classifier, there exists a decision stump h ∈ H
that has error at most 1/3, i.e. P[h(x) ̸= c(x)] ≤ 1/3.
(b) Describe a simple, efficient procedure for finding a decision stump that minimizes error
with respect to a finite training set of size m. Such a procedure is called an empirical
risk minimizer (ERM).
(c) Give a short intuitive explanation for why we should expect that we can easily pick m
sufficiently large that the training error is a good approximation of the true error, i.e.
why we can ensure generalization. (Your answer should relate to what we have gained in

Sample Solution

This question has been answered.

Get Answer

Subject:
Type:
Pages/Words:
	Single spaced
	approx 275 words per page
Urgency:
Level:
Currency:
Total Cost:

No More Worries!

Paper Formatting

No Lateness!

Our Guarantees

Perceptron algorithm on a training set

This question has been answered.

Price Calculator

Our Services

Why Choose Us