# 吴恩达机器学习

## 1.Intro

### 1.1 welcome

• 手写识别
• 垃圾邮件分类
• 搜索引擎
• 图像处理

• 数据挖掘

• 网页点击流数据分析
• 人工无法处理的工作(量大)

• 手写识别
• 计算机视觉
• 个人定制

• 推荐系统
• 研究大脑

• ……

### 1.2 什么是机器学习

1. 机器学习定义 这里主要有两种定义：
• Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

这个定义有点不正式但提出的时间最早，来自于一个懂得计算机编程的下棋菜鸟。他编写了一个程序，但没有显式地编程每一步该怎么走，而是让计算机自己和自己对弈，并不断地计算布局的好坏，来判断什么情况下获胜的概率高，从而积累经验，好似学习，最后，这个计算机程序成为了一个比他自己还厉害的棋手。

• Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Tom Mitchell 的定义更为现代和正式。在过滤垃圾邮件这个例子中，电子邮件系统会根据用户对电子邮件的标记（是/不是垃圾邮件）不断学习，从而提升过滤垃圾邮件的准确率，定义中的三个字母分别代表：

• P(Performance): 电子邮件系统过滤垃圾邮件的准确率。
• E(Experience): 用户对电子邮件的标记。

2.不同的机器学习算法

1、监督学习

2、非监督学习

• 半监督学习: 介于监督学习于无监督学习之间
• 推荐算法: 没错，就是那些个买完某商品后还推荐同款的某购物网站所用的算法。
• 强化学习: 通过观察来学习如何做出动作，每个动作都会对环境有所影响，而环境的反馈又可以引导该学习算法。
• 迁移学习

### 1.3 监督学习

1. 回归问题(Regression)

回归问题即为预测一系列的连续值。continuous value output

在房屋价格预测的例子中，给出了一系列的房屋面基数据，根据这些数据来预测任意面积的房屋价格。给出照片-年龄数据集，预测给定照片的年龄。

曲线的拟合是可以自由选择的,不一定是直线,也可以是任何一种曲线

2.分类问题(Classification)

Q:You’re running a company, and you want to develop learning algorithms to address each of two problems. Problem 1:You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.

Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised. Should you treat these as classification or as regression problems?

A:Treat problem 1 as a regression problem, problem 2 as a classification problem.

Ex:这个第一个问题是连续的预测,我们对于销量这个量是连续的,这是个连续量,对于是否被黑就是离散的是和否的问题

### 1.4 无监督学习

1. 聚类(Clustering)

• 新闻聚合
• DNA 个体聚类
• 天文数据分析
• 市场细分
• 社交网络分析
2. 非聚类(Non-clustering)

• 鸡尾酒问题

//解线性方程

Of the following examples, which would you address using an unsupervised learning algorithm? (Check all that apply.)

Given email labeled as spam/not spam, learn a spam filter.

Given a set of news articles found on the web, group them into sets of articles about the same stories.

Given a database of customer data, automatically discover market segments and group customers into different market segments.

Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not.

A:2,3

Quiz:

Quiz Question 1 A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is E?

Answer The process of the algorithm examining a large amount of historical weather data.

Explanation T := The weather prediction task. P := The probability of it correctly predicting a future date’s weather. E := The process of the algorithm examining a large amount of historical weather data.

Question 2 Suppose you are working on weather prediction, and you would like to predict whether or not it will be raining at 5pm tomorrow. You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?

Explanation Classification is appropriate when we are trying to predict one of a small number of discrete-valued outputs, such as whether it will rain (which we might designate as class 0), or not (say class 1).

Question 3 Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?

Explanation Regression is appropriate when we are trying to predict a continuous-valued output, since as the price of a stock (similar to the housing prices example in the lectures).

Question 4 Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.

Explanation Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow “similar” or “related”. := This is an unsupervised learning/clustering problem (similar to the Google News example in the lectures).

Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatements. := This can be addressed using an unsupervised learning, clustering, algorithm, in which we group patients into different clusters.

Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years. := This can be addressed as a supervised learning, classification, problem, where we can learn from a labeled dataset comprising different people’s genetic data, and labels telling us if they had developed diabetes.

Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript’s author (when the identity of this author is unknown). := This can be addressed as a supervised learning, classification, problem, where we learn from the labeled data to predict gender.

In farming, given data on crop yields over the last 50 years, learn to predict next year’s crop yields. := This can be addresses as a supervised learning problem, where we learn from historical data (labeled with historical crop yields) to predict future crop yields.

Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail. := This can addressed using a clustering (unsupervised learning) algorithm, to cluster spam mail into sub-types.

Examine a web page, and classify whether the content on the web page should be considered “child friendly” (e.g., non-pornographic, etc.) or “adult.” := This can be addressed as a supervised learning, classification, problem, where we can learn from a dataset of web pages that have been labeled as “child friendly” or “adult.”

Examine the statistics of two football teams, and predicting which team will win tomorrow’s match (given historical data of teams’ wins/losses to learn from). := This can be addressed using supervised learning, in which we learn from historical records to make win/loss predictions.

Question 5 Which of these is a reasonable definition of machine learning?

Answer Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Explanation This was the definition given by Arthur Samuel (who had written the famous checkers playing, learning program).

## 2.单变量线性回归

### 2.1 Model Represent

1. 房价预测训练集

m=训练集的元素数

x=自变量

y=因变量

(x,y)单个训练样本.可以用上下标表示训练集的标号,X和Y属于R

2.问题解决模型

x: 特征/输入变量。y代表输出变量

## 2.2 代价函数(Cost Function)

M: 训练集中的样本总数

y: 目标变量/输出变量

(x,y): 训练集中的实例

(x(i),y(i)): 先训练集中的第 个样本实例

y^: 的预测值

1/2系数 存在与否都不会影响结果，这里是为了在应用梯度下降时便于求解，平方的导数会抵消掉 。

## 2.3 代价函数 – 直观理解1(Cost Function – Intuition I)

• 假设函数(Hypothesis):

• 参数(Parameters):

• 代价函数(Cost Function):

• 目标(Goal):

## 2.4 代价函数 – 直观理解2(Cost Function – Intuition II)

0=360,1=0 时：

:=赋值表达式

/theta_j: 第 个特征参数

”:=“: 赋值操作符

/alpha: 学习速率(learning rate),

$J$是个函数每个$\theta$就是要求的值

• 学习速率过小图示：

收敛的太慢，需要更多次的迭代。

• 学习速率过大图示：

可能越过最低点，甚至导致无法收敛。

## 2.7 线性回归中的梯度下降(Gradient Descent For Linear Regression)

$j=0,j=1$时，线性回归中代价函数求导的推导过程：

Quiz:

1.Let f be some function so that

f(θ0,θ1) outputs a number. For this problem,

f is some arbitrary/unknown smooth function (not necessarily the

cost function of linear regression, so f may have local optima).

Suppose we use gradient descent to try to minimize f(θ0,θ1) as a function of θ0 and θ1. Which of the

following statements are true? (Check all that apply.)

2.Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some θ0, θ1such that J(θ0,θ1)=0.

Which of the statements below must then be true? (Check all that apply.)

Other Options:

# 4.多变量线性回归(Linear Regression with Multiple Variables)

## 4.1 多特征(Multiple Features)

N: 特征的总数

$x^i$: 代表样本矩阵中第$i$行，也就是第i个训练实例。

$x^i_j$: 代表样本矩阵中第i行的第j列，也就是第i个训练实例的第 =j个特征。

$\theta^T$: 矩阵的转置

$\theta$: 特征向量,我们要解出的值的集合

x: 某个样本的特征向量， n+1维特征量向量

$x_0$: 为了计算方便我们会假设这个值为1

## 4.2 多变量梯度下降(Gradient Descent for Multiple Variables)

$h_{\theta}(x)=\theta^Tx$则得到同时更新参数的向量化(Vectorization)实现：

$X$: 训练集数据， 维矩阵（包含基本特征 ）,每一列都是

## 4.4 梯度下降实践2-学习速率(Gradient Descent in Practice II – Learning Rate)

• 多次迭代收敛法

• 无法确定需要多少次迭代
• 较易绘制关于迭代次数的图像
• 根据图像易预测所需的迭代次数
• 自动化测试收敛法（比较阈值）

• 不易选取阈值
• 代价函数近乎直线时无法确定收敛情况

## 4.6 正规方程(Normal Equation)

$X_{m*(n+1)},y_m$

$X^{-1}$: 矩阵 的逆，在 Octave 中，inv 函数用于计算矩阵的逆，类似的还有 pinv 函数。

X': 在 Octave 中表示矩阵 X 的转置，即 $X^T$

## 4.7 不可逆性正规方程(Normal Equation Noninvertibility)

（本部分内容为选讲）

• 特征之间线性相关

比如同时包含英寸的尺寸和米为单位的尺寸两个特征，它们是线性相关的

• 特征数量大于训练集的数量。

• 减少多余/重复特征
• 增加训练集数量
• 使用正则化（后文）

Quiz

You’d like to use polynomial regression to predict a student’s final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form $h_θ(x) = θ_0 + θ_1 x_1 + θ_2 x_2$ , where $x_1$ is the midterm score and$x_2$ is$(midterm score)^2$. Further, you plan to use both feature scaling (dividing by the “max-min”, or range, of a feature) and mean normalization.

What is the normalized feature $x_1^(3)$? (Hint: midterm = 69, final = 78 is training example 4.) Please round off your answer to two decimal places and enter in the text box below.

A：Rather than use the current value of α, it’d be more promising to try a larger value of α (say α=1.0). B：α=0.3 is an effective choice of learning rate. C：Rather than use the current value of α, it’d be more promising to try a smaller value of α (say α=0.1).

A：X is 14×3, y is 14×1, θ is 3×3 B：X is 14×4, y is 14×4, θ is 4×4 C：X is 14×3, y is 14×1, θ is 3×1 D：X is 14×4, y is 14×1, θ is 4×1

A：Gradient descent, since it will always converge to the optimal θ. B：The normal equation, since gradient descent might be unable to find the optimal θ. C：The normal equation, since it provides an efficient way to directly find the solution. D：Gradient descent, since (X^T X)^{-1} will be very slow to compute in the normal equation.

A：It prevents the matrix$X^T X$ (used in the normal equation) from being non-invertable (singular/degenerate). B：It speeds up gradient descent by making it require fewer iterations to get to a good solution. C：It is necessary to prevent gradient descent from getting stuck in local optima. D：It speeds up solving for θ using the normal equation.

## 5.Octave Tutorial

### 5.1 基本操作

1.基本的四则运算是可以直接算出来的 2.~= 是非运算 3.PS1('>>'); 4.可以声明变量并访问 5.disp():输出 可以内嵌spintf进行C类型的输出:disp(sprintf(‘……’)%(参数表)) 6.矩阵:A=[1 2 ;3 4 ;5 6 ;]:生成了一个矩阵,向量亦同 7.A=%d:%d:%d第一个数:开始值,第二个:步长,第三个:最终值,生成一个向量 8.ones(1,3)生成全为1的矩阵 9.zeros(1,3)生成全为0的矩阵 10.rand(1,3)生成全为随机数的矩阵 11.randn(1,3)生成一个全为随机数的矩阵,不过这个矩阵的数据方差为1 12.hist(向量,格子数) 13.eye()生成单位矩阵

### 5.2 移动数据

1.size(A):返回矩阵大小 size(A,%d)第%d个维度的元素数 3.length(A)返回矩阵最大的维度的元素数 4.cd 转换路径 5.load 载入文件 6.who(whos)显示存储的变量 7.文件名(%d,%d)读出第?-?个元素 8.存储文件:把名字和数据存进一个文件里面 save 文件名 变量名 -ascii 9,访问元素 A(%d,%d) :表示访问每一个元素 A(2,:) A(:,2)或者A([1,3],:),可以访问,也可以存东西进 10.[A,[……]]新的列向量[A B],[A;B] 11.A(:)生成一个列向量

### 5.3 计算

1.A*B矩阵乘法 2.A.*B点积 A./B 1./A 都是可以的 exp(v) log(b) 3.A' 转置 4.max()求最大值,还可以求出矩阵之间的最大值 5,find()寻找满足条件的值,把结果存进一个矩阵里面(行列) 6,sum(),可以加参数,求出每个维度数据的和 7.prod() 8.floor()向下求整 9.celi()向上求整 10.max(A,[],%d)求每一列(行)的最大值 11.flipub(A)翻转 12.pinv(A)求逆矩阵

### 5.4 画图

1.plot(x,y)画图(其中x和y是维度相同的向量),可以加参数来规定颜色 2.xlabel(‘’)标记x的标签 3.print 输出图像文件 4.figure(x):给图像标号 5.subplot(1,2,1)分成几乘几的网格,我这个图用那个网格 6.imagesc(A),colorbar,colormap gray

### 5.5 流程控制

1.indices: %d:%d 从几到几的连续数列 可以使用for i=indices 最后用end结束即可 2.while语句:while 跟循环条件, end结束 3.if 判断条件 end 结束 4.函数:

### 5.6 线性代数

1.Octave&Matlab 向量从1开始

$\sum^n_{j=0}\theta_jx_j=\theta^Tx$

# 6 逻辑回归(Logistic Regression)

## 6.1 分类(Classification)

• 垃圾邮件判断
• 金融欺诈判断
• 肿瘤诊断

$f(x_0)>0.5$ ，预测为$y=1$ ，即正向类；

$f(x_0)<0.5$ ，预测为 $y=0$，即负向类。

## 6.2 假设函数表示(Hypothesis Representation)

$h_\theta(x)=g(z)=g(\theta^Tx)$

sigmoid 函数（如下图）是逻辑函数的特殊情况，其公式为$\frac{1}{1+e^{-z}}$

## 6.3 决策边界(Decision Boundary)

$h_{\theta}(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x^2_4)$

## 6.5 简化的成本函数和梯度下降(Simplified Cost Function and Gradient Descent)

• 牛顿法和拟牛顿法(Newton's method & Quasi-Newton Methods)

• DFP算法
• 局部优化法(BFGS)
• 有限内存局部优化法(L-BFGS)
• 拉格朗日乘数法(Lagrange multiplier)

Octave/Matlab 中对这类高级算法做了封装，易于调用。

## 6.7 多类别分类: 一对多(Multiclass Classification: One-vs-all)

$h_\theta^i(x)$ : 输出$y=i$(属于第$i$个分类）的可能性

$k$: 类别总数，如上图$k=3$

# 7 正则化(Regularization)

## 7.1 过拟合问题(The Problem of Overfitting)

• 欠拟合(Underfitting)

无法很好的拟合训练集中的数据，预测值和实际值的误差很大，这类情况被称为欠拟合。拟合模型比较简单（特征选少了）时易出现这类情况。类似于，你上课不好好听，啥都不会，下课也差不多啥都不会。

• 优良的拟合(Just right)

不论是训练集数据还是不在训练集中的预测数据，都能给出较为正确的结果。类似于，学霸学神！

• 过拟合(Overfitting)

能很好甚至完美拟合训练集中的数据，即 ，但是对于不在训练集中的新数据，预测值和实际值的误差会很大，泛化能力弱，这类情况被称为过拟合。拟合模型过于复杂（特征选多了）时易出现这类情况。类似于，你上课跟着老师做题都会都听懂了，下课遇到新题就懵了不会拓展。

• 偏差(bias)

指模型的预测值与真实值的偏离程度。偏差越大，预测值偏离真实值越厉害。偏差低意味着能较好地反应训练集中的数据情况。

• 方差(Variance)

指模型预测值的离散程度或者变化范围。方差越大，数据的分布越分散，函数波动越大，泛化能力越差。方差低意味着拟合曲线的稳定性高，波动小。

• 减少特征的数量

• 手动选取需保留的特征
• 使用模型选择算法来选取合适的特征(如 PCA 算法)
• 减少特征的方式易丢失有用的特征信息
• 正则化(Regularization)

• 可保留所有参数（许多有用的特征都能轻微影响结果）

• 减少/惩罚各参数大小(magnitude)，以减轻各参数对模型的影响程度

• 当有很多参数对于模型只有轻微影响时，正则化方法的表现很好

## 7.2 代价函数(Cost Function)

$\lambda$: 正则化参数(Regularization Parameter)，

$\theta_0$: 不惩罚基础参数

• 过大

• 导致模型欠拟合(假设可能会变成近乎的直线 )
• 无法正常去过拟问题
• 梯度下降可能无法收敛
• 过小

• 无法避免过拟合（等于没有）

## 7.3 线性回归正则化(Regularized Linear Regression)

$1-\frac{\alpha\lambda}{m}<1$

$if m≤n$ ($X^TX$不可逆),但是如果不是的话,这个矩阵一定是不可逆的

## 7.4 逻辑回归正则化(Regularized Logistic Regression)

Quiz

You are training a classification model with logistic regression. Which of the following statements are true? Check

• Introducing regularization to the model always results in equal or better performance on the training set.
• Adding many new features to the model helps prevent overfitting ont the training set.
• Introducing regularization to the model always results in equal or better performance on examples not in the training set.
• Adding a new feature to the model always results in equal or better performance on the training set. * 选项1: 将正则化方法加入模型并不是每次都能取得好的效果,如果λ取得太大的化就会导致欠拟合. 这样不论对traing set 还是 examples都不好. 不正确 * 选项2: more features能够更好的fit 训练集,同时也容易导致overfit,是more likely而不是prevent. 不正确 * 选项3: 同1,将正则化方法加入模型并不是每次都能取得好的效果,如果λ取得太大的化就会导致欠拟合. 这样不论对traing set 还是 examples都不好. 不正确 * 选项4: 新加的feature会提高train set的拟合度,而不是example拟合度. 正确

Which of the following statements about regularization are true? Check all that apply.

• Using too large a value of λ can cause your hypothesis to overfit the data; this can be avoided by reducing λ.
• Using a very large value of λ cannot hurt the performance of your hypothesis; the only reason we do not set λ to be too large is to avoid numerical problems.
• Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when λ=0).
• Because logistic regression outputs values 0≤hθ(x)≤1, its range of output values can only be “shrunk” slightly by regularization anyway, so regularization is generally not helpful for it. ** 答案: 3 ** 正则化方法的公式: J(θ)=12m[∑i=1m(hθ(x(i))−y(i))2+λ∑i=1nθ2j]

* 选项1: λ太大导致overfit不对,是underfit,当λ太大时θ1θ2...θn≈0.只有θ0起作用,拟合出来是一条直线. λ太小才会导致overfit. 不正确 * 选项2: 同1. 不正确 * 选项3: 当λ没有选择好时,可能会导致训练效果还不如不加的λ好. 正确 * 选项4: “shrunk” slightly的是θ, regularization是想要解决overfit. 不正确

# 8 神经网络：表达(Neural Networks: Representation)

## 8.2 神经网络和大脑(Neurons and the Brain)

BrainPort 系统：帮助失明人士通过摄像头以及舌尖感官“看”东西

## 8.3 模型表示1(Model Representation I)

• $x_0$: 偏置单元(bias unit)，$x_0$=1
• $\Theta$: 权重(weight)，即参数。
• 激活函数$g$: ，即逻辑函数等。
• 输入层: 对应于训练集中的特征$x$
• 输出层: 对应于训练集中的结果$y$

$a_i^j$: 第i层的第 个激活单元

$\Theta^j$: 从第 层映射到第 层时的权重矩阵。

$\Theta^j_{v,u}$: 从第 层的第 个单元映射到第 层的第 个单元的权重

$s_j$: 第j层的激活单元数目（不包含偏置单元）

• 符号较多，记不住可随时回顾！
• 每个单元会作用于下一层的所有单元（矩阵乘法运算）。
• 如果第j层有$s_j$个单元，第j+1层有$s_{j+1}$个单元，$\Theta^j$是一个$s_{j+1}*(s_j+1)$维的权重矩阵。即每一层的权重矩阵大小都是非固定的。
• 其中， +1来自于偏置单元，这样意味着输出层不包含偏置单元，输入层和隐藏层需要增加偏置单元。

## 8.4 模型表示2(Model Representation II)

$m$: 训练集中的样本实例数量

$s_2$: 第二层神经网络中激活单元的数量

# 9 神经网络: 学习(Neural Networks: Learning)

## 9.1 代价函数(Cost Function)

• 二元分类问题(0/1分类)

只有一个输出单元 ($K=1$)

• 多元()分类问题

输出单元不止一个($K>1$)

$L$: 神经网络的总层数

$s_l$: 第$l$层激活单元的数量（不包含偏置单元）

$h_{\Theta}(x)_k$: 分为第$k$个分类($k^{th}$)的概率 ($P(y=k|x,\Theta)$)

$K$: 输出层的输出单元数量，即类数 - 1

$y^i_k$: 第$i$个训练样本的第$k$个分量值

$y$:$K$维向量

## 9.2 反向传播算法(Backpropagation Algorithm)

1. 对于给定训练集 ，初始化每层间的误差和矩阵$\Delta$，即令所有的$\Delta_{i,j}^l=0$，使得每个$\Delta^l$为一个全零矩阵。

2. 接下来遍历所有样本实例，对于每一个样本实例，有下列步骤：

1. 运行前向传播算法，得到初始预测$\alpha^L=h_{\Theta}(x)$

2. 运行反向传播算法，从输出层开始计算每一层预测的误差(error)，以此来求取偏导。

输出层的误差即为预测与训练集结果的之间的差值：，

对于隐藏层中每一层的误差，都通过上一层的误差来计算：

## 9.4 实现注意点: 参数展开(Implementation Note: Unrolling Parameters)

Octave/Matlab 代码：

reshape(A,m,n): 将向量 A 重构为 m * n 维矩阵。

Octave/Matlab 代码：

epsilon = 1e-4;for i = 1:n,  thetaPlus = theta;  thetaPlus(i) += epsilon;  thetaMinus = theta;  thetaMinus(i) -= epsilon;  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon);end

## 9.6 随机初始化(Random Initialization)

Octave/Matlab 代码：

xxxxxxxxxxTheta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(m,n): 返回一个在区间 (0,1) 内均匀分布的随机矩阵。

$\epsilon$: 和梯度下降中的$\epsilon$没有联系，这里只是一个任意实数，给定了权重矩阵初始化值的范围。

## 9.7 综合起来(Putting It Together)

1. 神经网络的建模（后续补充）

• 选取特征，确定特征向量$x$的维度，即输入单元的数量。
• 鉴别分类，确定预测向量$h_{\Theta}(x)$的维度，即输出单元的数量。
• 确定隐藏层有几层以及每层隐藏层有多少个隐藏单元。

默认情况下，隐藏层至少要有一层，也可以有多层，层数越多一般意味着效果越好，计算量越大。

2. 训练神经网络

1. 随机初始化初始权重矩阵

2. 应用前向传播算法计算初始预测

3. 计算代价函数的值

4. 应用后向传播宣发计算代价函数的偏导数

5. 使用梯度检验检查算法的正确性，别忘了用完就禁用它

6. 丢给最优化函数最小化代价函数

由于神经网络的代价函数非凸，最优化时不一定会收敛在全局最小值处，高级最优化函数能确保收敛在某个局部最小值处。

### 第 1 题

You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function. In the backpropagation algorithm, one of the steps is to update Δ(2)ij:=Δ(2)ij+δ(3)i∗(a(2))j for every i,j. Which of the following is a correct vectorization of this step?

• Δ(2):=Δ(2)+(a(2))T∗δ(3)

• Δ(2):=Δ(2)+(a(3))T∗δ(2)

• Δ(2):=Δ(2)+δ(3)∗(a(2))T

• Δ(2):=Δ(2)+δ(3)∗(a(3))T

** 答案: 3 ** ** **

### 第 2 题

Suppose 𝚃𝚑𝚎𝚝𝚊𝟷 is a 5x3 matrix, and 𝚃𝚑𝚎𝚝𝚊𝟸 is a 4x6 matrix. You set 𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌=[𝚃𝚑𝚎𝚝𝚊𝟷(:);𝚃𝚑𝚎𝚝𝚊𝟸(:)] . Which of the following correctly recovers 𝚃𝚑𝚎𝚝𝚊𝟸?

• 𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌(𝟷𝟼:𝟹𝟿),𝟺,𝟼)
• 𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌(𝟷𝟻:𝟹𝟾),𝟺,𝟼)
• 𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌(𝟷𝟼:𝟸𝟺),𝟺,𝟼)
• 𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌(𝟷𝟻:𝟹𝟿),𝟺,𝟼)
• 𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝚝𝚑𝚎𝚝𝚊𝚅𝚎𝚌(𝟷𝟼:𝟹𝟿),𝟼,𝟺)

### 第 3 题

Let J(θ)=2θ3+2. Let θ=1, and ϵ=0.01. Use the formula J(θ+ϵ)−J(θ−ϵ)2ϵ to numerically compute an approximation to the derivative at θ=1. What value do you get? (When θ=1, the true/exact derivati ve is dJ(θ)dθ=6.)

• 8
• 6
• 5.9998
• 6.0002

** 答案: 4 ** ** J(θ+ϵ)−J(θ−ϵ)2ϵ=2∗(θ+ϵ)3+2−(2∗(θ−ϵ)3+2)2ϵ ** ** 将ϵ=0.01, θ=1 代入上式,并化简一下可得出 ** ** (θ+ϵ)3−(θ−ϵ)3ϵ=(1+0.01)3−(1−0.01)30.01=6.0002 **

### 第 4 题

Which of the following statements are true? Check all that apply.

• Using a large value of λ cannot hurt the performance of your neural network; the only reason we do not set λ to be too large is to avoid numerical problems.
• Gradient checking is useful if we are using gradient descent as our optimization algorithm. However, it serves little purpose if we are using one of the advanced optimization methods (such as in fminunc).
• Using gradient checking can help verify if one’s implementation of backpropagation is bug-free.
• If our neural network overfits the training set, one reasonable step to take is to increase the regularization parameter λ.

** 答案: 3 4 ** * 选项1: 以前的quiz中遇到过,当选的λ太大时,会成为一条直线,underfit. 不正确 * 选项2: Gradient checking是验检神经网络内部的计算结果对不对,不关心是用的哪个算法,高级算法如fminuc与原始的sigmoid梯度下降都可以用gradient checking来验检. 不正确 * 选项3: Gradient checking就是用来验检神经网络内部的计算结果对不对的. 正确 * 选项4: 回忆一下当时引入正规化项Regularization的目的是什么? 是为了解决overfitting的问题. 假设设置的λ太小了就相当于没有加入正规化项,penalize不够,还是会overfit,则需要加大λ; 再补充另一方面,如果把λ设置的太大了,就会成为一条直线,导致出现underfit. neural network 与 lr 是一样的. 正确

## 第 5 题

Which of the following statements are true? Check all that apply.

• Suppose that the parameter Θ(1) is a square matrix (meaning the number of rows equals the number of columns). If we replace Θ(1) with its transpose (Θ(1))T, then we have not changed the function that the network is computing.
• Suppose we have a correct implementation of backpropagation, and are training a neural network using gradient descent. Suppose we plot J(Θ) as a function of the number of iterations, and find that it is increasing rather than decreasing. One possible cause of this is that the learning rate α is too large.
• Suppose we are using gradient descent with learning rate α. For logistic regression and linear regression, J(Θ) was a convex optimization problem and thus we did not want to choose a learning rate αthat is too large. For a neural network however, J(Θ) may not be convex, and thus choosing a very large value of α can only speed up convergence.
• If we are training a neural network using gradient descent, one reasonable “debugging” step to make sure it is working is to plot J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.

** 答案: 2 4 ** * 选项1: 神经网络的计算(network computing).以第2层为例: a(2)=g(z(2))=g(Θ(1)∗a(1)), 当Θ(1)转置为(Θ(1))T时,显然 g(Θ(1)∗a(1))≠g((Θ(1))T∗a(1)) . 不正确 * 选项2: 当选的α太大时,gradient descent有可能越过最低点,导致CostFunction的值越来越大. 正确 * 选项3: 没有确定神经网络的J(Θ)是不是一定是凸的(等大神指导),但是就算不是凸的也不一定非得设置一个很大的α. 不正确 * 选项4: 将CostFunction的值每次都打印出来,看有没有减小来确定我们的训练是不是有效,感觉这个选项是完全正确的废话. 正确