1
00:00:00,820 --> 00:00:03,673
When performing linear regression, we have a number of data points.
2
00:00:03,673 --> 00:00:09,043
Let's say that we have 1, 2, 3 and so on up through M data points.
3
00:00:09,043 --> 00:00:12,013
Each data point has an output variable, Y,
4
00:00:12,013 --> 00:00:15,400
and a number of input variables, X1 through X N.
5
00:00:16,410 --> 00:00:20,400
So in our baseball example Y is the lifetime number of home runs.
6
00:00:20,400 --> 00:00:24,370
And our X1 and XN are things like height and weight.
7
00:00:24,370 --> 00:00:27,762
Our one through M samples might be different baseball players.
8
00:00:27,762 --> 00:00:32,572
So maybe data point one is Derek Jeter, data point two is Barry Bonds, and
9
00:00:32,572 --> 00:00:34,870
data point M is Babe Ruth.
10
00:00:34,870 --> 00:00:37,823
Generally speaking, we are trying to predict the values of
11
00:00:37,823 --> 00:00:41,779
the output variable for each data point, by multiplying the input variables by
12
00:00:41,779 --> 00:00:45,579
some set of coefficients that we're going to call theta 1 through theta N.
13
00:00:45,579 --> 00:00:49,105
Each theta, which we'll from here on out call the parameters or
14
00:00:49,105 --> 00:00:52,826
the weights of the model, tell us how important an input variable is
15
00:00:52,826 --> 00:00:55,589
when predicting a value for the output variable.
16
00:00:55,589 --> 00:00:57,381
So if theta 1 is very small,
17
00:00:57,381 --> 00:01:01,880
X1 must not be very important in general when predicting Y.
18
00:01:01,880 --> 00:01:03,940
Whereas if theta N is very large,
19
00:01:03,940 --> 00:01:07,300
then XN is generally a big contributor to the value of Y.
20
00:01:07,300 --> 00:01:10,420
This model is built in such a way that we can multiply each X by
21
00:01:10,420 --> 00:01:13,500
the corresponding theta, and sum them up to get Y.
22
00:01:13,500 --> 00:01:17,080
So that our final equation will look something like the equation down here.
23
00:01:17,080 --> 00:01:20,250
Theta 1 plus X1 plus theta 2 times X2,
24
00:01:20,250 --> 00:01:23,948
all the way up to theta N plus XN equals Y.
25
00:01:23,948 --> 00:01:26,670
And we'd want to be able to predict Y for each of our M data points.
26
00:01:27,780 --> 00:01:31,844
In this illustration, the dark blue points represent our reserve data points,
27
00:01:31,844 --> 00:01:34,712
whereas the green line shows the predictive value of Y for
28
00:01:34,712 --> 00:01:37,598
every value of X given the model that we may have created.
29
00:01:37,598 --> 00:01:41,380
The best equation is the one that's going to minimize the difference across all
30
00:01:41,380 --> 00:01:44,160
data points between our predicted Y, and our observed Y.
31
00:01:45,280 --> 00:01:48,876
What we need to do is find the thetas that produce the best predictions.
32
00:01:48,876 --> 00:01:52,960
That is, making these differences as small as possible.
33
00:01:52,960 --> 00:01:56,650
If we wanted to create a value that describes the total areas of our model,
34
00:01:56,650 --> 00:01:58,380
we'd probably sum up the areas.
35
00:01:58,380 --> 00:02:02,440
That is, sum over all of our data points from I equals 1, to M.
36
00:02:02,440 --> 00:02:05,290
The predicted Y minus the actual Y.
37
00:02:05,290 --> 00:02:08,320
However, since these errors can be both negative and
38
00:02:08,320 --> 00:02:11,570
positive, if we simply sum them up, we could have
39
00:02:11,570 --> 00:02:17,040
a total error term that's very close to 0, even if our model is very wrong.
40
00:02:17,040 --> 00:02:20,710
In order to correct this, rather than simply adding up the error terms,
41
00:02:20,710 --> 00:02:23,100
we're going to add the square of the error terms.
42
00:02:23,100 --> 00:02:26,940
This guarantees that the magnitude of each individual error term,
43
00:02:26,940 --> 00:02:29,680
Y predicted minus Y actual is positive.
44
00:02:30,920 --> 00:02:33,620
Why don't we make sure the distinction between input of variables and
45
00:02:33,620 --> 00:02:34,830
output of variables is clear.