# 线性回归

Machine Learning course notes

### Cost Function

• Hypothesis 函数：

• cost 函数：

### Cost Function - Intuition(1)

• 当 $\theta_1=1$ 时，$h_\theta(x) = x$, 有 $J(1) = \frac{1}{2m}(0^2+0^2+0^2)=0$
• 当 $\theta_1=0.5$ 时，$h_\theta(x) = 0.5x$，有 $J(0.5) = \frac{1}{2m}((0.5-1^2+(1-2)^2+(1.5-3)^2)=0.58$
• 当 $\theta_1=0$ 时，$h_\theta(x) = 0$, 有 $J(0) = \frac{1}{2m}(1^2+2^2+3^2)=2.3$

### Cost Function - Intuition(2)

Taking any color and going along the ‘circle’, one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(θ0,θ1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when θ0 = 800 and θ1= -0.15

Taking another h(x) and plotting its contour plot, one gets the following graphs:

When θ0 = 360 and θ1 = 0, the value of J(θ0,θ1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of θ1 and θ0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most ‘circle’.

• Ocatave Demo
function J = computeCost(X, y, theta)

m = length(y); % number of training examples

predictions = X*theta;

sqrErrors = ( predictions - y ).^2;

J = 1/(2*m)*sum(sqrErrors);

end


• 方法 1. 选择任意${\theta }_{0},{\theta }_{1}$，例如：${\theta }_{0}=1,{\theta }_{1}=1$

• 方法 2. 不断改变${\theta }_{0},{\theta }_{1}$使$J\left({\theta }_{0},{\theta }_{1}\right)$按梯度方向进行减少，直到找到最小值

• 图形理解

• 梯度下降法：
• := 代表赋值，例如 a:=b 代表把 b 的值赋值给 a，类似的比如 a:=a+1。因此 := 表示的是计算机范畴中的赋值。而=号则代表 truth assertion，a = b 的含义是 a 的值为 b
• α 代表 learning rate 是梯度下降的步长 - $\frac{\mathrm{\partial }}{\mathrm{\partial }{\theta }_{j}}J\left({\theta }_{0},{\theta }_{1}\right)$代表对θ求偏导
$θ j := θ j - α ∂ ∂ θ j J ( θ 0 , θ 1 )$
• 理解梯度下降

$θ 1 := θ 1 − α d d θ 1 J ( θ 1 )$

$θ 1 := θ 1 − α ∗ 0$

$repeat until convergence:{ { θ 0 := θ 0 − α 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) θ 1 := θ 1 − α 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) x i }$

### Multiple features

Size(x) Price(y)
2104 460
1035 224
868 230
642 126

x为房子的 size，y是房价，上述的一维线性回归函数：

h θ (x) = θ 0 + θ 1 x

[/itex]，但是影响房价的因素很多，比如房屋数量，楼层数等等：

Size(x1) number of bed room (x2) number of floors(x3) Price(y)
2104 5 2 460
1035 4 1 224
868 3 2 230
642 2 1 126

$h θ ( x ) = ∑ j = 0 n θj xj = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n$

$x j ( i ) = value of feature j in the i t h training example x ( i ) = the input (features) of the i t h training example m = the number of training examples n = the number of features$

$x(2) = 1035 4 1 224$

${x}_{3}^{\mathrm{\left(2\right)}}$表示上面向量中中第三个元素：

$x 3 (2) = 1$

$hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4$

$h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n$ ${x}_{0}$

$x= x0 x1 ... xn , θ= θ0 θ1 ... θn$

$h θ ( x ) = θ 0 θ 1 . . . θ n x 0 x 1 ⋮ x n = θ T x$

### Gradient Descent for Multiple variables

$J ( θ ) = 1 2 m ∑ i = 1 m h θ ( x i ) − y i 2$

$θ j := θ j - α ∂ ∂ θ j J ( θ )$

${\theta }_{j}$求偏导，得到：

$repeat until convergence: { θ 0 := θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) θ 1 := θ 1 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 1 ( i ) θ 2 := θ 2 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 2 ( i ) ⋯ θ j := θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) }$
• 线性回归梯度计算的 Ocatave Demo
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

num_features = size(X,2);
h = X*theta;

for j = 1:num_features
x = X(:,j);
theta(j) = theta(j) - alpha*(1/m)*sum((h-y).* x);
end

% ============================================================

% Save the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);

end

end


### Feature Scaling

Idea: Make sure features are on a similar scale.

E.g.:x1 = size(0-200 feet),x2=number of bedrooms(1-5)

x1 = size(0-200 feet)/5,x2=(number of bedrooms)/5,则 contour 图会变为接近圆形，梯度下降收敛的速度会加快。通常为了加速收敛，会将每个 feature 值(每个xi)统一到某个区间里，比如 $0\le {x}_{1}\le 3$$-2\le {x}_{2}\le 0.5$等等

### Mean normalization

Replace ${x}_{i}$with ${x}_{i}-{\mu }_{i}$ to make features have approximately zero mean.实际上就是将 feature 归一化，

• ${\mu }_{i}$

是所有 ${x}_{i}$

• ${\mu }_{i}$

xi的区间范围，(max-min)

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then

$x i := p r i c e − 1000 1900$
• μ表示所有 feature 的平均值
• s = max - min

### Learning Rate

$θ j := θ j - α ∂ ∂ θ j J ( θ )$

• Summary: - if α is too small: slow convergence - if α is too large: J(θ)may not decrease on every iteration; may not converge - To choose α , try: …, 0.001, 0.003, 0.01,0.03, 0.1,0.3, 1, …

### Polynomial regression

$h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n$

We can improve our features and the form of our hypothesis function in a couple different ways. We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.

$h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 or h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 + θ 3 x 1 3$

### Normal Equation

$J ( θ ) = 1 2 m ∑ i = 1 m h θ ( x i ) − y i 2$

$∂∂θjJ(θ) = 0 (for every j)$

$X = 1 x1(1) x2(1) ... xn(1) 1 x1(2) x2(2) ... xn(2) 1 ... ... ... ... 1 x1(m) x2(m) ... xn(m) , θ = θ1 θ2 ... θn , Y = y1 y2 ... ym$

• $X$

$m\*\mathrm{\left(n+1\right)}$的矩阵

• $\theta$

$\mathrm{\left(n+1\right)}\*1$的矩阵

• $Y$

$m\*1$的矩阵

• 单位矩阵 E，$\mathrm{AE}=\mathrm{EA}=A$
• 矩阵的逆${A}^{\mathrm{-1}}$，A 必须为方阵，$A{A}^{\mathrm{-1}}={A}^{\mathrm{-1}}A=E$

1. 先把 θ 左边的矩阵变成一个方阵。通过乘以${X}^{T}$可以实现，则有 ${X}^{T}X·\theta ={X}^{T}Y$

2. 把 θ 左边的部分变成一个单位矩阵，这样左边就只剩下 θ，$\left({X}^{T}X{\right)}^{-1}{X}^{T}X·\theta =\left({X}^{T}X{\right)}^{-1}{X}^{T}Y$

3. 由于$\left({X}^{T}X{\right)}^{-1}{X}^{T}X=E$，因此式子变为$\theta =\left({X}^{T}X{\right)}^{-1}{X}^{T}Y$，这就Normal Equation的表达式。

Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
$O\left(k{n}^{2}\right)$ $O\left({n}^{3}\right)$ need to calculate inverse of ${X}^{T}X$
Works well when n is large Slow if n is very large

When implementing the normal equation in octave we want to use the pinv function rather than inv. The pinv function will give you a value of θ even if${X}^{T}X$ is not invertible(不可逆).

If ${X}^{T}X$ is noninvertible, the common causes might be having :

• Redundant features, where two features are very closely related (i.e. they are linearly dependent)
• Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).

Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.