Course1-week2-foundation of neural network

本文主要是介绍Course1-week2-foundation of neural network，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Week 2

Basics of Neural Network Programming

2.1 binary classification

$m$ training example: $(x^{(1)}, y^{(1)}), \cdots, (x^{(m)}, y^{(m)})$

X = ⎡ ⎣ ⎢ ⎢ ⎢ ⋮ x (1) ⋮ ⋮ x (2) ⋮ \dots \dots \dots ⋮ x (m) ⋮ ⎤ ⎦ ⎥ ⎥ ⎥, Y = [y (1) y (2) \dots y (m)]

$X = \left[ \begin{matrix} \vdots & \vdots & \cdots & \vdots \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \vdots & \vdots & \cdots & \vdots \\ \end{matrix} \right], Y = \left[ \begin{matrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \\ \end{matrix} \right]$

X.shape = (n_x, m), Y.shape = (1, m)

2.2 logistic regression

This is a learning algorithm you use when the output label y in a supervised learning problem are all either 0 or 1, so for binary classification problem.
Given an input feature vector $x$ maybe corresponding to the image that you want to recognize either a cat picture or not a cat picture, you want the algorithm that can output a prediction, which we’ll call $\hat{y}$ , which is your estimate of $y$ , you want the $\hat{y}$ to be the probability of the chance $P(y=1 | x)$ , you want $\hat{y}$ to tell you, what is the chance than this is a cat picture. you want $\hat{y}$ to be the chance the y equal to 1

y^= θ (w T x + b) = 1 1 + e w T x + b

$\hat{y} = \theta(w^Tx+b) = \frac{1}{1 + e^{w^Tx + b}}$
When you implement your logistic regression, your job is try to learn the paramenter

W W $W$ and

b

$b$ so that

y^ y ^ $\hat{y}$ becomes a good estimator of the chance of

y y $y$ equal to

1

$1$ . In order to train the

W W $W$ and

b

$b$ , you need to defined a cost function.

2.3 logistic regression cost funciton

Let’s see what loss function or error function we can use to measure how well our algorithm is doing.
One thing we can do is following:

L (y^, y) = 1 2 (y^- y) 2

$L(\hat{y}, y) = \frac12(\hat{y} - y)^2$
It turn out that you could do this, but in logistic regression peoplt don’t usually do this, because whem you come to learn your parameters, you find that the optimization problem becomes non-convex, which will be a problem with muliple local optima. So gradient desecnt may not find the global optimum.
So what we use in the logistic regression is actually the following function.

L (y^, y) = - (y l o g y^+ (1 - y) l o g (1 - y^))

$L(\hat{y}, y) = -\bigg(ylog\hat{y} + (1-y)log(1-\hat{y})\bigg)$
Here’s some intuition for why this loss function make sence
We want

L L $L$ to be as small as possible. To understand why this make sence, let’s look at the two cases.

if want logy^ large, because the y^ can never be bigger than 1, so you want y^ to be close to 1
- if $y = 0 \longrightarrow L(\hat{y}, y) = -log(1 - \hat{y}) \ \longrightarrow$ want $1 - \hat{y}$ large so that want $\hat{y}$ small, because $\hat{y}$ can never smaller than 0, so you want $\hat{y}$ to be close to 0, This is saying that if $y = 0$ , your loss function will push the parameters to make the $\hat{y}$ as close to zero as possible
- Cost funtion:
  
  J(W,b)=1m∑j=1mL(y(j),y^(j))(1)
  
  It turns out that logistic regression can be viewed as a very very small neural network, Let’s go on the next to see how we view the logistic regression as a very small neural network.
  
  2.4 gradient descent
  
  Now let’s talk about how you can use the gradient descent algorithem to train or learn the $W$ and $b$ on your training set. The cost function measures how well your parameters $W$ and $b$ are doing on the training set. What we want to do is really to find the value of $W$ and $b$ that corresponds to the minmum of the cost function $J(W, b)$ . It turn out that the cost function J is a convex function, so just a single big bowl. (fig.1)So the fact that our cost function $J(W, b)$ as defined here is convex is one of the huge reason why this particular cost functionn for logsitic regression. For logistic regression almost any initialization method works.
  
  repeat{}w:=w−αdJ(w)dw
  
  I will repeatedly do that until the algorithm converges. Now let’s just make sure that this gradient descent update make sense. (fig.2)If we have started off with the large value of $w$ , now the derivate is positive, and $w$ get updates as $w$ minus a learning rate times the poistive derivative, so you end up taking a step to the left. If we have started off the small value of $w$ , now at the point the slope will be negative, and so the gradient descent updates would subtract $\alpha$ times a negative number, and so end up slowly increaseing $w$ , so you end up making $w$ bigger and bigger. So that whether you initilize on the left or on the right, gradient descent updates will nove you towards the global minmum point.
  
  2.5 derivatives
  
  For a straight line which function is $f(a) = 3a$ , the slope of this function $f(a)$ denote as $\frac{df(a)}{da}$ , which equals to 3, this equation means that if we nudge $a$ to the right a little bit, $f(a)$ go up by 3 times as much as I nudge just the value of little $a$ .
  2.6 more derivatives example
  The derivative of the function just means the slope of the function, and the slope of function can be diffient at diffient point on the function. In our first example where $f(a) = 3a$ , this is a straight line the derivative was the same everywhere, it’s 3 everywhere. But for the function $f(a) = a^2$ , the slope of the line vary, so the slope or the derivative can be diffierent at diffierent point on the curve.
  
  2.7 computation graph
  
  2.8 derivatives with a computation graph
  
  2.9 logistic regression gradient descent (just one training example)
  
  We will talk about how to compute derivatives for you to implement gradient descent for logistic regression.
  
  focus on just one example:
  
  z=wTx+b
  
  y^=a=σ(z)
  
  L(y^,y)=−(yloga+(1−y)log(1−a))
  
  In logistic regression what we want to do is to modify the paramenters $w$ and $b$ in order to reduce the loss. We know how to compute the loss on a single training example. Now let’s talk about how you can go backward to compute the derivatives.
  
  da=dL(a,y)da=−dda(yloga+(1−y)log(1−a))=−ya+1−y1−a
  
  dz=dL(a,y)dz=dL(a,y)dadadz=(−ya+1−y1−a)(a(1−a))=−y(1−a)+a(1−y)a(1−a)a(1−a)=a−y
  
  where:
  a=σ(z)=11+e−z
  
  dadz=e−z(1+e−z)2=−1+1+e−z(1+e−z)2=(−1(1+e−z)2+11+e−z)=a−a2=a(1−a)
  
  dw1=dLdw1=dLdzdzdw1=x1dz
  
  dw2=dLdw2=dLdzdzdw2w=x2dz
  
  db=dLdb=dLdzdzdb=dz
  
  so if you want to do gradient descent with respect to just one example. what you will do is following:
  
  w1:=w1−αdw1
  
  w2:=w2−αdw1
  
  b:=b−αdb
  
  so this is one step with respect to single example, but to train logstic regression model you have not just one example.
  
  2.10 gradient descent on m examples
  
  In the previous we saw how to compute derivatives and implements gradient descent with respect to just one training example for logistic regression, now we want to do it for $m$ examples.
  $J (W, b) = \frac{1}{m} \sum_{i = 1}^{m} L (a^{(i)}, y^{(i)})$
  
  a(i)=σ(z(i))=σ(wTx(i)+b)
  
  When just have one training example $(x^{(i)}, y^{(i)})$ , we saw how to compute the derivatives $dw_1^{(i)}, dw_2^{(i)}$ and $db^{(i)}$ .
  
  The overall cost function was really the average of the individual losses, so the derivatives respect to $w_1$ of the cost function is also going to be the average of the derivatives respect to $w_1$ of the individual losses.
  
  ∂∂w1J(W,b)=1m∑i=1m∂∂w1L(a(i),y(i))
  
  one single step or one single iteration of the gradient descent for logistic regression
```
J = 0; dw_1 = 0, dw_2 = 0, db = 0
for i in m:z^(i) = W^T x^(i) + ba^(i) = sigmoid(z^(i))J += -(y^(i) log(a^(i)) + (1-y^(i))log(1-a^(i)))dz^(i) = a^(i) - y^(i)dw_1 += x^(i)_(1) dz^(i)...dw_n += x^(i)_(m) dz^(i)db += dz^(i)
J /= m
dw_1 /= m
dw_2 /= m
db /= mw_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
...
b = b - alpha * db
```
  Next let’s talk about vectorization, so that you can implement a single iteration of gradient descent without use any for loop.
  
  2.11 vectorization
```
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)tic = time.time()
c = np.dot(a, b)
toc = time.time()
print(c)
print('Vectorized  version ' + str(1000*(toc - tic)) + 'ms')c = 0
tic = time.time()
for i in range(1000000):c += a[i]*b[i]
toc = time.time()
print(c)
print('for loop', str(1000*(toc - tic)), 'ms')
```
```
249962.033774
Vectorized  version 2.5033950805664062ms
249962.033774
for loop 674.983024597168 ms
```
  2.12 more vectorization examples
  
  2.13 vectorizing logistic regression
  
  We have talked about how vectorization let’s you speed up your code significantly. Now we will talk about how we can vectorize the implementation of logistic regression, so you can process the entire training set, that is implement a single iteration of gradient descent with respect to an entire training set without useing even a single explicit for loop.
  
  z(1)=WTx(1)+b
  
  a(1)=σ(z(1))
  
  ⋮
  
  z(m)=WTx(m)+b
  
  a(m)=σ(z(m))
  
  remember than we defined a matrix capital $X$ to be training input, stacked together in different columns.
  $X = [\begin{matrix} ⋮ & ⋮ & \dots & ⋮ \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ ⋮ & ⋮ & \dots & ⋮ \end{matrix}]$
  
  Next we will show how we can compute $z^{(1)}, z^{(2)}, \cdots, z^{(m)}$ all in one step, with one line code.
  
  Z=[z(1),z(2),⋯,z(m)]=[w1,w2,⋯,wm]⎡⎣⎢⎢⎢⋮x(1)⋮⋮x(2)⋮⋯⋯⋯⋮x(m)⋮⎤⎦⎥⎥⎥+b=[wTx(1)+b,wTx(2)+b,⋯,wTx(m)+b]=wTx
```
Z = np.dot(w.T, X) + b
A = sigmoid(Z)
```
  2.14 vectorizing logistic regression’s gradient compute
  
  You saw how use vectorization to compute their predictions for an entire training set all at the same time. Now you will to see how to use vectorization to also perform the gradient computations for all m training samples, again, all those at the same time.
  
  dz(1)=a(1)−y(1)dz(2)=a(2)−y(2)⋯dz(m)=a(m)−y(m)⎫⎭⎬⎪⎪⎪⎪⎪⎪⎪⎪⟶dZ=[dz(1),dz(2),⋯,dz(m)]=A−Y(8)
  
  <script type="math/tex; mode=display" id="MathJax-Element-314"></script>
  
  db=0db+=dz(1)db+=dz(2)⋯db+=dz(m)⎫⎭⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⟶db=1m∑i=1mdz(i)=1mnp.sum(dZ)(2)
  
  dw=0dw+=x(1)dz(1)dw+=x(2)dz(2)⋯dw+=x(m)dz(m)⎫⎭⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⟶dw=1mXdzT=1m[x(1),x(2),⋯,x(m)]⎡⎣⎢⎢dz(1)⋮dz(m)⎤⎦⎥⎥(3)
  
  implement a iteration of gradient descent for logistic regression without useing a single FOR loop.
  
  Z=wTX+b (Z.shape()=(1,m))A=σ(Z)dZ=A−Ydw=1mXdZTdb=1mnp.sum(dZ)w=w−αdwb=b−αdb
  
  2.15 boradcasting in python
```
A = np.array([[56.0, 0.0, 4.4, 68.],[1.2, 104.0, 52.0, 8.0],[1.8, 135.0, 99.0, 0.9]])
print(A)
```
```
[[  56.     0.     4.4   68. ][   1.2  104.    52.     8. ][   1.8  135.    99.     0.9]]
```
```
cal = A.sum(axis = 0) # you want to python to sum vertically
print(cal)
```
```
[  59.   239.   155.4   76.9]
```
```
percentage = A/cal
print(percentage)
```
```
[[ 0.94915254  0.          0.02831403  0.88426528][ 0.02033898  0.43514644  0.33462033  0.10403121][ 0.03050847  0.56485356  0.63706564  0.01170351]]
```
  2.16 a note on python or numpy vectors
```
a = np.random.rand(10)
print(a)
```
```
[ 0.22354426  0.90835372  0.70797423  0.37848066  0.76930812  0.211096450.80027059  0.54396317  0.39067918  0.44721829]
```
```
a.shape
```
```
(10,)
```
  This is called rank 1 array in python, and it’s neither a row vector nor a colume vector.
```
print(a.T)
```
```
[ 0.22354426  0.90835372  0.70797423  0.37848066  0.76930812  0.211096450.80027059  0.54396317  0.39067918  0.44721829]
```
```
print(np.dot(a, a.T))
```
```
3.44491371769
```
```
a = np.random.randn(5, 1)
print(a)
```
```
[[-1.43316675][ 0.41149367][ 1.29073252][ 0.75121868][-0.70654665]]
```
```
print(a.shape)
```
```
(5, 1)
```
```
print(a.T)
```
```
[[-1.43316675  0.41149367  1.29073252  0.75121868 -0.70654665]]
```
```
print(np.dot(a, a.T))
```
```
[[ 2.05396694 -0.58973904 -1.84983493 -1.07662164  1.01259917][-0.58973904  0.16932704  0.53112826  0.30912173 -0.29073947][-1.84983493  0.53112826  1.66599043  0.96962238 -0.91196273][-1.07662164  0.30912173  0.96962238  0.5643295  -0.53077104][ 1.01259917 -0.29073947 -0.91196273 -0.53077104  0.49920817]]
```
  So what I’am going to recommend is that when you doing your programming exercise, you just do not use these rank 1 array, instead, if every time you create an array, you commit to making it either a column vector or a row vector, these behavior of vector may be easier to understand.
  
  If you not very sure what’s the dimension of one vector, you can often throw an assertion statement:
```
assert(a.shape == (5, 1))
```
  2.17 explanation of logistic regression cost function
  
  Why we use the cost function for logistic regression.
  
  y^=σ(wTx+b)
  
  we said that we want to interpert $\hat{y}$ as the chance that $y = 1$ for given the set of input features $x$ .
  so another way to say this is that:
  
  $\hat{y} = p (y = 1 | x); 1 - \hat{y} = p (y = 0 | x)$
  
  or we can say:
  
  if y=1:p(y|x)=y^if y=0:p(y|x)=1−y^
  
  next what I’m going to do is take these two equations, which basically define $p(y|x)$ for two cases of $y = 0$ or $y = 1$ , and summarize them into a single equation.
  
  p(y|x)=y^y(1−y^)(1−y)
  
  It’s turn out that this one line summarize the two equations top.
  - suppose $y = 1, p(y|x) = \hat{y} \centerdot (1 - \hat{y})^{0} = \hat{y}$
  - suppose $y = 0, p(y|x) = \hat{y}^0 \centerdot (1 - \hat{y})^{1 - 0} = 1 - \hat{y}$
  because the log function is a strictly monotomically increaseing function, you’re maximizing $log(p|x)$ give you similiar result that is optimizing $p(y|x)$ .
  
  logp(y|x)=log(y^y(1−y^)(1−y))=ylogy^+(1−y)log(1−y^)=−L(y^,y)
  
  we can carry out maximum likelihood estimation, so we want to find the parameters to maximizes the chance of you observation in the training set.
  
  log∏i=1mp(y(i)|x(i))=∑i=1mlogp(y(i)|x(i))=−∑i=1mL(y(i),yi)