线性回归


Machine Learning course notes

Cost Function

  • Hypothesis 函数:

怎么计算参数 θ 呢?需要通过代价函数求解

  • cost 函数:

这个式子的含义是找到θ0,θ1的值使 J(θ0,θ1)的值最小,为了求导方便系数乘了 1/2

Cost Function - Intuition(1)

对于 Hypothesis 函数:

当$\theta_0=0$时,简化为:

对于 cost 函数简化为:

假设有一组训练集:(1,1),(2,2),(3,3)

  • 当 $\theta_1=1$ 时,$h_\theta(x) = x$, 有 $J(1) = \frac{1}{2m}(0^2+0^2+0^2)=0$
  • 当 $\theta_1=0.5$ 时,$h_\theta(x) = 0.5x$,有 $J(0.5) = \frac{1}{2m}((0.5-1^2+(1-2)^2+(1.5-3)^2)=0.58$
  • 当 $\theta_1=0$ 时,$h_\theta(x) = 0$, 有 $J(0) = \frac{1}{2m}(1^2+2^2+3^2)=2.3$

以此类推,通过不同的θ值可以求出不同的J(θ),如下图所示:

我们的目标是找到一个θ值使J(θ)最小。显然上述案例中,当θ=1时,J(θ)最小,因此我们可以得到 Hypothesis 函数:

Cost Function - Intuition(2)

使用 contour plots 观察J(θ0,θ1)在二维平面的投影,

关于 contour plot[参考](https://nb.khanacademy.org/math/multivariable-calculus/thinking-about-multivariable-function/visualizing-scalar-valued-functions/v/contour-plots)

Taking any color and going along the ‘circle’, one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(θ0,θ1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when θ0 = 800 and θ1= -0.15

Taking another h(x) and plotting its contour plot, one gets the following graphs:

When θ0 = 360 and θ1 = 0, the value of J(θ0,θ1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of θ1 and θ0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most ‘circle’.

  • Ocatave Demo
function J = computeCost(X, y, theta)

m = length(y); % number of training examples

predictions = X*theta;

sqrErrors = ( predictions - y ).^2;

J = 1/(2*m)*sum(sqrErrors);

end

Gradient descent

对 Cost 函数: J(θ0,θ1),找到θ0,θ1使 J(θ0,θ1)值最小

  • 方法 1. 选择任意θ0,θ1,例如:θ0=1,θ1=1

  • 方法 2. 不断改变θ0,θ1使J(θ0,θ1)按梯度方向进行减少,直到找到最小值

  • 图形理解

Altext

  • 梯度下降法:
    • := 代表赋值,例如 a:=b 代表把 b 的值赋值给 a,类似的比如 a:=a+1。因此 := 表示的是计算机范畴中的赋值。而=号则代表 truth assertion,a = b 的含义是 a 的值为 b
    • α 代表 learning rate 是梯度下降的步长 - θjJ(θ0,θ1)代表对θ求偏导
θ j := θ j - α θ j J ( θ 0 , θ 1 )
  • 理解梯度下降

梯度下降是求多维函数的极值方法,因此公式是对 θj 求导,每一个θj代表一元参数,也可以理解为一维向量,上述 case 中,只有θ0θ1两个参数,可以理解在这两个方向上各自下降,他们的向量方向为J(θ)下降的方向,下降过程是一个同步迭代的过程:

理解二维梯度下降之前,可以先假设J(θ)是一维的,即只有一个参数,那么上述梯度下降公式简化为:

θ 1 := θ 1 α d d θ 1 J ( θ 1 )

问题简化为对一元函数求导,假设J(θ)如下图所示: θ会逐渐向极值点出收敛,当θ到达极值点时,该处导数为 0,则θ值不再变化。

θ 1 := θ 1 α 0

理解了一维的梯度下降,接下来看怎么把它应用到J(θ0,θ1)上,对θ0,θ1分别求偏导,得到下面公式:

repeat until convergence:{ { θ 0 := θ 0 α 1 m i = 1 m ( h θ ( x i ) y i ) θ 1 := θ 1 α 1 m i = 1 m ( h θ ( x i ) y i ) x i }

对于线性回归,J(θ)是凸函数(convex function),因此上述两个式子没有局部极值点,只有全局唯一的一个极值点。梯度下降法通常在离极值点远的地方下降很快,但在极值点附近时会收敛速度很慢。并且,在目标函数是凸函数时,梯度下降法的解是全局最优解。而在一般情况下,梯度下降法不保证求得全局最优解。

Multiple features

上几节讨论的问题是:已知一个房子大小和价格样本数据集,来推导房价和房屋大小的关系函数:

Size(x) Price(y)
2104 460
1035 224
868 230
642 126

x为房子的 size,y是房价,上述的一维线性回归函数:

h θ (x) = θ 0 + θ 1 x

</math>,但是影响房价的因素很多,比如房屋数量,楼层数等等:

Size(x1) number of bed room (x2) number of floors(x3) Price(y)
2104 5 2 460
1035 4 1 224
868 3 2 230
642 2 1 126

对应到公式里,则表现为x是多维时,公式如下:

h θ ( x ) = j = 0 n θj xj = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + + θ n x n

其中:

x j ( i ) = value of feature j in the i t h training example x ( i ) = the input (features) of the i t h training example m = the number of training examples n = the number of features

举例来说,x(2)表示第二组训练集:

x(2) = 1035 4 1 224

x3(2)表示上面向量中中第三个元素:

x 3 (2) = 1

还是举个买房子的例子,假如我们得到如下函数:

hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

其中hθ(x)表示房子的总价,θ0 = 80 代表房子的基础价格,x1代表这栋房子的 size,θ1是用 cost function 计算出来对x1的系数,类似的x2代表房子的房间数,θ2是对x2的系数,等等

在这个式子中

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + + θ n x n x0

默认为1,即x0(i)=1可以把每条样本和对应的参数看成两条vector:

x= x0 x1 ... xn , θ= θ0 θ1 ... θn

对于xθ来说,都是(n+1)x1的矩阵,而θT1x(n+1),则上述式子也可以表示为:

h θ ( x ) = θ 0 θ 1 . . . θ n x 0 x 1 x n = θ T x

上述式子也叫做Multivariate linear regression

Gradient Descent for Multiple variables

参考一维线性回归的的 cost 函数,多维线性回归的 cost 函数为:

J ( θ ) = 1 2 m i = 1 m h θ ( x i ) y i 2

多维梯度下降公式和前面类似:

θ j := θ j - α θ j J ( θ )

θj求偏导,得到:

repeat until convergence: { θ 0 := θ 0 α 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x 0 ( i ) θ 1 := θ 1 α 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x 1 ( i ) θ 2 := θ 2 α 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x 2 ( i ) θ j := θ j α 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x j ( i ) }
  • 线性回归梯度计算的 Ocatave Demo
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
%GRADIENTDESCENTMULTI Performs gradient descent to learn theta
%   theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

    num_features = size(X,2);
    h = X*theta;

    for j = 1:num_features
        x = X(:,j);
        theta(j) = theta(j) - alpha*(1/m)*sum((h-y).* x);
    end

    % ============================================================

    % Save the cost J in every iteration
    J_history(iter) = computeCostMulti(X, y, theta);

end

end

Feature Scaling

Idea: Make sure features are on a similar scale.

E.g.:x1 = size(0-200 feet),x2=number of bedrooms(1-5)

这种情况 contour 图是一个瘦长的椭圆,如图:

在不优化的情况下,这类梯度下降速度很慢。如果我们将x1,x2做如下调整:

x1 = size(0-200 feet)/5,x2=(number of bedrooms)/5,则 contour 图会变为接近圆形,梯度下降收敛的速度会加快。通常为了加速收敛,会将每个 feature 值(每个xi)统一到某个区间里,比如 0x13-2x20.5等等

Mean normalization

Replace xiwith xi-μi to make features have approximately zero mean.实际上就是将 feature 归一化,

例如x1=(size-1000)/2000 x2=(#bedrooms-2)/5

则有:-0.5x10.5-0.5x20.5

  • μi

    是所有 xi

  • μi

    xi的区间范围,(max-min)

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then

x i := p r i c e 1000 1900
  • μ表示所有 feature 的平均值
  • s = max - min

Learning Rate

θ j := θ j - α θ j J ( θ )

这节讨论如何选取α

为了求解J(θ)的最小值,梯度下降会不断的迭代找出最小值,理论上来说随着迭代次数的增加,J(θ)将逐渐减小,如图:

Altext

但是如果α选取过大,则可能会导致越过极值点的情况,导致随着迭代次数的增加,J(θ)的值增加或忽高忽低不稳定的情况:

Altext

解决办法都是选取较小的α

  • Summary: - if α is too small: slow convergence - if α is too large: J(θ)may not decrease on every iteration; may not converge - To choose α , try: …, 0.001, 0.003, 0.01,0.03, 0.1,0.3, 1, …

Polynomial regression

对于线性回归函数:

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + + θ n x n

其中xi代表 feature 种类,有些情况下使用这些 feature 制作目标函数不方便,因此可以考虑重新定义 feature 的值

We can improve our features and the form of our hypothesis function in a couple different ways. We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.

例如我们可以将两个 feature 合成一个:x3 = x1*x2,使用x3作为先行回归的 feature 值。

另外,如果只有一个 feature,而使用线性函数又不适合描述完整的数据集,可以考虑多项式函数,比如使用二次函数或者三次函数:

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 or h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 + θ 3 x 1 3

可以令 x2=x12,x3=x12 但是这么选择的一个问题在于 feature scaling 会比较重要,如果 x1 的 range 是[1,1000],那么 x2 的 range 就会变成[1,1000000]等

Normal Equation

对于 cost 函数:

J ( θ ) = 1 2 m i = 1 m h θ ( x i ) y i 2

前面提到的求J(θ)最小值的思路是使用梯度下降法,对θj求偏导得到各个 θ 值:

θjJ(θ) = 0 (for every j)

出了梯度下降法之外,还有一种方法叫做Normal Equation,这种方式不需要迭代,可以直接计算出 θ 值 。

假设我们有 m 个样本。特征向量的维度为 n。因此,可知样本为 {(x(1),y(1)),(x(2),y(2)),...(x(m),y(m))},其中对于每一个样本中的x(i),都有 x(i)={x1(i)x2(i),...xn(i)},令线性回归函数 hθ(x)=θ0+θ1x1+θ2x2+θ3x3++θnxn,则有:

X = 1 x1(1) x2(1) ... xn(1) 1 x1(2) x2(2) ... xn(2) 1 ... ... ... ... 1 x1(m) x2(m) ... xn(m) , θ = θ1 θ2 ... θn , Y = y1 y2 ... ym

其中:

  • X

    m\*(n+1)的矩阵

  • θ

    (n+1)\*1的矩阵

  • Y

    m\*1的矩阵

看个例子:

若希望h(θ)=y,则有X·θ=Y,回想单位矩阵矩阵的逆的性质:

  • 单位矩阵 E,AE=EA=A
  • 矩阵的逆A-1,A 必须为方阵,AA-1=A-1A=E

再来看看式子 X·θ=Y 若想求出 θ,那么我们需要做一些转换:

  1. 先把 θ 左边的矩阵变成一个方阵。通过乘以XT可以实现,则有 XTX·θ=XTY

  2. 把 θ 左边的部分变成一个单位矩阵,这样左边就只剩下 θ,(XTX)-1XTX·θ=(XTX)-1XTY

  3. 由于(XTX)-1XTX=E,因此式子变为θ=(XTX)-1XTY,这就Normal Equation的表达式。

如果用 Octave 表示,命令为:pinv(X'*X)*X'*Y

什么 case 适合使用 Normal Equation,什么 case 适合使用 Gradient Descent?

Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
O(kn2) O (n3) need to calculate inverse of XTX
Works well when n is large Slow if n is very large

当样本数量 n>=1000 时使用梯度下降,小于这个数量使用 normal equation 更方便,当 n 太大时,计算 XTX 会非常慢

When implementing the normal equation in octave we want to use the pinv function rather than inv. The pinv function will give you a value of θ even ifXTX is not invertible(不可逆).

If XTX is noninvertible, the common causes might be having :

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  • Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).

Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.