吴恩达深度学习第二课课后测验(docx版)

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Practical aspects of deep learning
1.If you have 10,000,000 examples, how would you split the train/dev/test set?
☐33% train, 33% dev, 33% test
☐60% train, 20% dev, 20% test
☐98% train, 1% dev, 1% test
2.The dev and test set should:
☐Come from the same distribution
☐Come from different distribution
☐Be identical to each other (same (x,y) pairs)
☐Have the same number of examples
3.If your Neural Network model seems to have high bias, what of the following would be
promising things to try? (Check all that apply.)
☐Add regularization
☐Get more test data
☐Increase the number of units in each hidden layer
☐Make the Neural Network deeper
☐Get more training data
4.You are working on an automated check-out kiosk for a supermarket, and are building a
classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)
☐Increase the regularization parameter lambda
☐Decrease the regularization parameter lambda
☐Get more training data
☐Use a bigger neural network
5.What is weight decay?
☐ A regularization technique (such as L2 regularization) that results in gradient decent shrinking the weights on every iteration.
☐The process of gradually decreasing the learning rate during training.
☐Gradual corruption of the weights in the neural network if it is trained on noisy data.
☐ A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
6.What happens when you increase the regularization hyperparameter lambda?
☐Weights are pushed toward becoming smaller (closer to 0)
☐Weights are pushed toward becoming bigger (further from 0)
☐Doubling lambda should roughly result in doubling the weights
☐Gradient descent taking bigger steps with each iteration (proportional to lambda)
7.With the inverted dropout technique, at test time:
☐You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
☐You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.
☐You do not apply dropout ( do not randomly eliminating units), but keep the 1/keep_prob factor in the calculations used in training.
☐You do not apply dropout (do not randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training.
8.Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following:
(Check the two that apply)
☐Increasing the regularization effect
☐Reducing the regularization effect
☐Causing the neural network to end up with a higher training set error
☐Causing the neural network to end up with a lower training set error
9.Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all
that apply.)
☐Xavier initialization
☐Data augmentation
☐Gradient Checking
☐Exploding gradient
☐L2 regularization
☐Vanishing gradient
☐Dropout
10.Why do we normalize the inputs x?
☐It makes the parameter initialization faster
☐It makes the cost function faster to optimize
☐Normalization is another word for regularization—It helps to reduce variance
☐It makes it easier to visualize the data
Optimization algorithms
1.Which notation would you use to denote the 3rd layer’s activation when the inputs is the 7th
example from the 8th minibatch?
☐a[8]{3}(7)
☐a[8]{7}(3)
☐a[3]{8}(7)
☐a[3]{7}(8)
2.Which of these statements about mini-batch gradient descent do you agree with ?
☐One iteration of mini-batch gradient decent (computing on a single mini-batch) is faster than one iteration of batch gradient decent.
☐You should implement mini-batch gradient descent without an explicit for loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
☐Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
3.Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
☐If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
☐If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
☐If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
☐If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
4.Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations,
looks like this:。