15-12-2012, 02:56 PM
More on regression. Gradient descent. Classification
1More on regression..pdf (Size: 448.46 KB / Downloads: 69)
Recall from last time
• The problem of supervised learning: given data D ! X × Y
find a hypothesis h : X # Y which approximates well the given
data
• Supervised learning algorithms make specific choices about the
hypothesis class, error function used to evaluate the
approximation and algorithm for error minimization
• Linear regression:
– Consider h to be a linear function
– Consider minimizing the mean squared error between h and
the true values on data set D
– Compute the gradient of the MSE and set it to 0
• We obtain a closed-form solution for the parameters
Overfitting
• A general, HUGELY IMPORTANT problem for all machine
learning algorithms
• We can find a hypothesis that predicts perfectly the training data
but does not generalize well to new data
• E.g., a lookup table!
• We are seeing an instance here: if we have a lot of parameters,
the hypothesis ”memorizes” the data points, but is wild
everywhere else.
Overfitting more formally
• Every hypothesis has a ”true” error J"(h) (measured on all
possible data items we could ever encounter)
• Because we do not have all the data, we measure the error on
the training set JD(h)
• Suppose we compare hypotheses h1 and h2 on the training set,
and JD(h1) < JD(h2)
• If h2 is ”truly” better, i.e. J"(h2) < J"(h1), our algorithm is
overfitting.
• We need theoretical and empirical methods to guard against it!
Leave-one-out cross-validation
• How can we choose the best d for an order-d polynomial fit to
the data?
• Repeat the following procedure:
– Leave out one instance from the training set, to estimate the
true prediction error for the best order-d fit for
d & {1, 2, . . . , 9}.
– Use all the other data items for finding w
– Measure the error on the instance left out
– This is an unbiased estimate of the true prediction error
• Choose the d with lowest average prediction error
Cross-validation
• A general procedure for estimating the true error of a predictor
• The data is split into three subsets:
– A training set used only to find the parameters w
– A validation set used to find the right hypothesis class (e.g.
the degree of the polynomial)
– A test set used to report the prediction error of the algorithm
• These set must be disjoint!
• The process is repeated several times, and the results are
averaged to provide error estimates.