Course1-week2-foundation of neural network

2023-10-11 04:38

本文主要是介绍Course1-week2-foundation of neural network,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Week 2

Basics of Neural Network Programming

2.1 binary classification

m m training example: (x(1),y(1)),,(x(m),y(m))

X=x(1)x(2)x(m),Y=[y(1)y(2)y(m)] X = [ ⋮ ⋮ ⋯ ⋮ x ( 1 ) x ( 2 ) ⋯ x ( m ) ⋮ ⋮ ⋯ ⋮ ] , Y = [ y ( 1 ) y ( 2 ) ⋯ y ( m ) ]

X.shape = (n_x, m), Y.shape = (1, m)

2.2 logistic regression

This is a learning algorithm you use when the output label y in a supervised learning problem are all either 0 or 1, so for binary classification problem.
Given an input feature vector x x maybe corresponding to the image that you want to recognize either a cat picture or not a cat picture, you want the algorithm that can output a prediction, which we’ll call y^, which is your estimate of y y , you want the y^ to be the probability of the chance P(y=1|x) P ( y = 1 | x ) , you want y^ y ^ to tell you, what is the chance than this is a cat picture. you want y^ y ^ to be the chance the y equal to 1

y^=θ(wTx+b)=11+ewTx+b y ^ = θ ( w T x + b ) = 1 1 + e w T x + b

When you implement your logistic regression, your job is try to learn the paramenter W W and b so that y^ y ^ becomes a good estimator of the chance of y y equal to 1. In order to train the W W and b, you need to defined a cost function.

2.3 logistic regression cost funciton

Let’s see what loss function or error function we can use to measure how well our algorithm is doing.
One thing we can do is following:

L(y^,y)=12(y^y)2 L ( y ^ , y ) = 1 2 ( y ^ − y ) 2

It turn out that you could do this, but in logistic regression peoplt don’t usually do this, because whem you come to learn your parameters, you find that the optimization problem becomes non-convex, which will be a problem with muliple local optima. So gradient desecnt may not find the global optimum.
So what we use in the logistic regression is actually the following function.
L(y^,y)=(ylogy^+(1y)log(1y^)) L ( y ^ , y ) = − ( y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) )

Here’s some intuition for why this loss function make sence
We want L L to be as small as possible. To understand why this make sence, let’s look at the two cases.

  1. if y=1L(y^,y)=logy^  want logy^ l o g y ^ large, because the y^ y ^ can never be bigger than 1, so you want y^ y ^ to be close to 1
    • if y=0L(y^,y)=log(1y^)  y = 0 ⟶ L ( y ^ , y ) = − l o g ( 1 − y ^ ) ⟶ want 1y^ 1 − y ^ large so that want y^ y ^ small, because y^ y ^ can never smaller than 0, so you want y^ y ^ to be close to 0, This is saying that if y=0 y = 0 , your loss function will push the parameters to make the y^ y ^ as close to zero as possible
    • Cost funtion:

      J(W,b)=1mj=1mL(y(j),y^(j))(1) (1) J ( W , b ) = 1 m ∑ j = 1 m L ( y ( j ) , y ^ ( j ) )

      It turns out that logistic regression can be viewed as a very very small neural network, Let’s go on the next to see how we view the logistic regression as a very small neural network.

      2.4 gradient descent

      Now let’s talk about how you can use the gradient descent algorithem to train or learn the W W and b on your training set. The cost function measures how well your parameters W W and b are doing on the training set. What we want to do is really to find the value of W W and b that corresponds to the minmum of the cost function J(W,b) J ( W , b ) . It turn out that the cost function J is a convex function, so just a single big bowl. (fig.1)So the fact that our cost function J(W,b) J ( W , b ) as defined here is convex is one of the huge reason why this particular cost functionn for logsitic regression. For logistic regression almost any initialization method works.


      repeat{}w:=wαdJ(w)dw r e p e a t { w := w − α d J ( w ) d w }


      I will repeatedly do that until the algorithm converges. Now let’s just make sure that this gradient descent update make sense. (fig.2)If we have started off with the large value of w w , now the derivate is positive, and w get updates as w w minus a learning rate times the poistive derivative, so you end up taking a step to the left. If we have started off the small value of w, now at the point the slope will be negative, and so the gradient descent updates would subtract α α times a negative number, and so end up slowly increaseing w w , so you end up making w bigger and bigger. So that whether you initilize on the left or on the right, gradient descent updates will nove you towards the global minmum point.

      2.5 derivatives

      For a straight line which function is f(a)=3a f ( a ) = 3 a , the slope of this function f(a) f ( a ) denote as df(a)da d f ( a ) d a , which equals to 3, this equation means that if we nudge a a to the right a little bit, f(a) go up by 3 times as much as I nudge just the value of little a a .

      2.6 more derivatives example

      The derivative of the function just means the slope of the function, and the slope of function can be diffient at diffient point on the function. In our first example where f(a)=3a, this is a straight line the derivative was the same everywhere, it’s 3 everywhere. But for the function f(a)=a2 f ( a ) = a 2 , the slope of the line vary, so the slope or the derivative can be diffierent at diffierent point on the curve.

      2.7 computation graph

      2.8 derivatives with a computation graph

      2.9 logistic regression gradient descent (just one training example)

      We will talk about how to compute derivatives for you to implement gradient descent for logistic regression.

      focus on just one example:

      z=wTx+b z = w T x + b

      y^=a=σ(z) y ^ = a = σ ( z )

      L(y^,y)=(yloga+(1y)log(1a)) L ( y ^ , y ) = − ( y l o g a + ( 1 − y ) l o g ( 1 − a ) )

      In logistic regression what we want to do is to modify the paramenters w w and b in order to reduce the loss. We know how to compute the loss on a single training example. Now let’s talk about how you can go backward to compute the derivatives.


      da=dL(a,y)da=dda(yloga+(1y)log(1a))=ya+1y1a d a = d L ( a , y ) d a = − d d a ( y l o g a + ( 1 − y ) l o g ( 1 − a ) ) = − y a + 1 − y 1 − a

      dz=dL(a,y)dz=dL(a,y)dadadz=(ya+1y1a)(a(1a))=y(1a)+a(1y)a(1a)a(1a)=ay d z = d L ( a , y ) d z = d L ( a , y ) d a d a d z = ( − y a + 1 − y 1 − a ) ( a ( 1 − a ) ) = − y ( 1 − a ) + a ( 1 − y ) a ( 1 − a ) a ( 1 − a ) = a − y

      a=σ(z)=11+ez a = σ ( z ) = 1 1 + e − z

      dadz=ez(1+ez)2=1+1+ez(1+ez)2=(1(1+ez)2+11+ez)=aa2=a(1a) d a d z = e − z ( 1 + e − z ) 2 = − 1 + 1 + e − z ( 1 + e − z ) 2 = ( − 1 ( 1 + e − z ) 2 + 1 1 + e − z ) = a − a 2 = a ( 1 − a )

      dw1=dLdw1=dLdzdzdw1=x1dz d w 1 = d L d w 1 = d L d z d z d w 1 = x 1 d z

      dw2=dLdw2=dLdzdzdw2w=x2dz d w 2 = d L d w 2 = d L d z d z d w 2 w = x 2 d z

      db=dLdb=dLdzdzdb=dz d b = d L d b = d L d z d z d b = d z

      so if you want to do gradient descent with respect to just one example. what you will do is following:

      w1:=w1αdw1 w 1 := w 1 − α d w 1

      w2:=w2αdw1 w 2 := w 2 − α d w 1

      b:=bαdb b := b − α d b

      so this is one step with respect to single example, but to train logstic regression model you have not just one example.

      2.10 gradient descent on m examples

      In the previous we saw how to compute derivatives and implements gradient descent with respect to just one training example for logistic regression, now we want to do it for m m examples.


      a(i)=σ(z(i))=σ(wTx(i)+b) a ( i ) = σ ( z ( i ) ) = σ ( w T x ( i ) + b )

      When just have one training example (x(i),y(i)) ( x ( i ) , y ( i ) ) , we saw how to compute the derivatives dw(i)1,dw(i)2 d w 1 ( i ) , d w 2 ( i ) and db(i) d b ( i ) .

      The overall cost function was really the average of the individual losses, so the derivatives respect to w1 w 1 of the cost function is also going to be the average of the derivatives respect to w1 w 1 of the individual losses.

      w1J(W,b)=1mi=1mw1L(a(i),y(i)) ∂ ∂ w 1 J ( W , b ) = 1 m ∑ i = 1 m ∂ ∂ w 1 L ( a ( i ) , y ( i ) )

      one single step or one single iteration of the gradient descent for logistic regression

      J = 0; dw_1 = 0, dw_2 = 0, db = 0
      for i in m:z^(i) = W^T x^(i) + ba^(i) = sigmoid(z^(i))J += -(y^(i) log(a^(i)) + (1-y^(i))log(1-a^(i)))dz^(i) = a^(i) - y^(i)dw_1 += x^(i)_(1) dz^(i)...dw_n += x^(i)_(m) dz^(i)db += dz^(i)
      J /= m
      dw_1 /= m
      dw_2 /= m
      db /= mw_1 = w_1 - alpha * dw_1
      w_2 = w_2 - alpha * dw_2
      b = b - alpha * db


      Next let’s talk about vectorization, so that you can implement a single iteration of gradient descent without use any for loop.

      2.11 vectorization

      import numpy as np
      import time
      a = np.random.rand(1000000)
      b = np.random.rand(1000000)tic = time.time()
      c =, b)
      toc = time.time()
      print('Vectorized  version ' + str(1000*(toc - tic)) + 'ms')c = 0
      tic = time.time()
      for i in range(1000000):c += a[i]*b[i]
      toc = time.time()
      print('for loop', str(1000*(toc - tic)), 'ms')
      Vectorized  version 2.5033950805664062ms
      for loop 674.983024597168 ms

      2.12 more vectorization examples



      2.13 vectorizing logistic regression

      We have talked about how vectorization let’s you speed up your code significantly. Now we will talk about how we can vectorize the implementation of logistic regression, so you can process the entire training set, that is implement a single iteration of gradient descent with respect to an entire training set without useing even a single explicit for loop.

      z(1)=WTx(1)+b z ( 1 ) = W T x ( 1 ) + b

      a(1)=σ(z(1)) a ( 1 ) = σ ( z ( 1 ) )

      z(m)=WTx(m)+b z ( m ) = W T x ( m ) + b

      a(m)=σ(z(m)) a ( m ) = σ ( z ( m ) )

      remember than we defined a matrix capital X X to be training input, stacked together in different columns.


      Next we will show how we can compute z(1),z(2),,z(m) z ( 1 ) , z ( 2 ) , ⋯ , z ( m ) all in one step, with one line code.

      Z=[z(1),z(2),,z(m)]=[w1,w2,,wm]x(1)x(2)x(m)+b=[wTx(1)+b,wTx(2)+b,,wTx(m)+b]=wTx Z = [ z ( 1 ) , z ( 2 ) , ⋯ , z ( m ) ] = [ w 1 , w 2 , ⋯ , w m ] [ ⋮ ⋮ ⋯ ⋮ x ( 1 ) x ( 2 ) ⋯ x ( m ) ⋮ ⋮ ⋯ ⋮ ] + b = [ w T x ( 1 ) + b , w T x ( 2 ) + b , ⋯ , w T x ( m ) + b ] = w T x

      Z =, X) + b
      A = sigmoid(Z)

      2.14 vectorizing logistic regression’s gradient compute

      You saw how use vectorization to compute their predictions for an entire training set all at the same time. Now you will to see how to use vectorization to also perform the gradient computations for all m training samples, again, all those at the same time.

      dz(1)=a(1)y(1)dz(2)=a(2)y(2)dz(m)=a(m)y(m)dZ=[dz(1),dz(2),,dz(m)]=AY(8) (8) d z ( 1 ) = a ( 1 ) − y ( 1 ) d z ( 2 ) = a ( 2 ) − y ( 2 ) ⋯ d z ( m ) = a ( m ) − y ( m ) } ⟶ d Z = [ d z ( 1 ) , d z ( 2 ) , ⋯ , d z ( m ) ] = A − Y

      <script type="math/tex; mode=display" id="MathJax-Element-314"></script>


      db=0db+=dz(1)db+=dz(2)db+=dz(m)db=1mi=1mdz(i)=1mnp.sum(dZ)(2) (2) d b = 0 d b + = d z ( 1 ) d b + = d z ( 2 ) ⋯ d b + = d z ( m ) } ⟶ d b = 1 m ∑ i = 1 m d z ( i ) = 1 m n p . s u m ( d Z )

      dw=0dw+=x(1)dz(1)dw+=x(2)dz(2)dw+=x(m)dz(m)dw=1mXdzT=1m[x(1),x(2),,x(m)]dz(1)dz(m)(3) (3) d w = 0 d w + = x ( 1 ) d z ( 1 ) d w + = x ( 2 ) d z ( 2 ) ⋯ d w + = x ( m ) d z ( m ) } ⟶ d w = 1 m X d z T = 1 m [ x ( 1 ) , x ( 2 ) , ⋯ , x ( m ) ] [ d z ( 1 ) ⋮ d z ( m ) ]

      implement a iteration of gradient descent for logistic regression without useing a single FOR loop.

      Z=wTX+b (Z.shape()=(1,m))A=σ(Z)dZ=AYdw=1mXdZTdb=1mnp.sum(dZ)w=wαdwb=bαdb Z = w T X + b ( Z . s h a p e ( ) = ( 1 , m ) ) A = σ ( Z ) d Z = A − Y d w = 1 m X d Z T d b = 1 m n p . s u m ( d Z ) w = w − α d w b = b − α d b

      2.15 boradcasting in python

      A = np.array([[56.0, 0.0, 4.4, 68.],[1.2, 104.0, 52.0, 8.0],[1.8, 135.0, 99.0, 0.9]])
      [[  56.     0.     4.4   68. ][   1.2  104.    52.     8. ][   1.8  135.    99.     0.9]]
      cal = A.sum(axis = 0) # you want to python to sum vertically
      [  59.   239.   155.4   76.9]
      percentage = A/cal
      [[ 0.94915254  0.          0.02831403  0.88426528][ 0.02033898  0.43514644  0.33462033  0.10403121][ 0.03050847  0.56485356  0.63706564  0.01170351]]

      2.16 a note on python or numpy vectors

      a = np.random.rand(10)
      [ 0.22354426  0.90835372  0.70797423  0.37848066  0.76930812  0.211096450.80027059  0.54396317  0.39067918  0.44721829]

      This is called rank 1 array in python, and it’s neither a row vector nor a colume vector.

      [ 0.22354426  0.90835372  0.70797423  0.37848066  0.76930812  0.211096450.80027059  0.54396317  0.39067918  0.44721829]
      print(, a.T))
      a = np.random.randn(5, 1)
      [[-1.43316675][ 0.41149367][ 1.29073252][ 0.75121868][-0.70654665]]
      (5, 1)
      [[-1.43316675  0.41149367  1.29073252  0.75121868 -0.70654665]]
      print(, a.T))
      [[ 2.05396694 -0.58973904 -1.84983493 -1.07662164  1.01259917][-0.58973904  0.16932704  0.53112826  0.30912173 -0.29073947][-1.84983493  0.53112826  1.66599043  0.96962238 -0.91196273][-1.07662164  0.30912173  0.96962238  0.5643295  -0.53077104][ 1.01259917 -0.29073947 -0.91196273 -0.53077104  0.49920817]]

      So what I’am going to recommend is that when you doing your programming exercise, you just do not use these rank 1 array, instead, if every time you create an array, you commit to making it either a column vector or a row vector, these behavior of vector may be easier to understand.

      If you not very sure what’s the dimension of one vector, you can often throw an assertion statement:

      assert(a.shape == (5, 1))

      2.17 explanation of logistic regression cost function

      Why we use the cost function for logistic regression.

      y^=σ(wTx+b) y ^ = σ ( w T x + b )

      we said that we want to interpert y^ y ^ as the chance that y=1 y = 1 for given the set of input features x x .
      so another way to say this is that:


      or we can say:

      if y=1:p(y|x)=y^if y=0:p(y|x)=1y^ i f y = 1 : p ( y | x ) = y ^ i f y = 0 : p ( y | x ) = 1 − y ^

      next what I’m going to do is take these two equations, which basically define p(y|x) p ( y | x ) for two cases of y=0 y = 0 or y=1 y = 1 , and summarize them into a single equation.

      p(y|x)=y^y(1y^)(1y) p ( y | x ) = y ^ y ( 1 − y ^ ) ( 1 − y )

      It’s turn out that this one line summarize the two equations top.

      • suppose y=1,p(y|x)=y^(1y^)0=y^ y = 1 , p ( y | x ) = y ^ ⋅ ( 1 − y ^ ) 0 = y ^
      • suppose y=0,p(y|x)=y^0(1y^)10=1y^ y = 0 , p ( y | x ) = y ^ 0 ⋅ ( 1 − y ^ ) 1 − 0 = 1 − y ^

      because the log function is a strictly monotomically increaseing function, you’re maximizing log(p|x) l o g ( p | x ) give you similiar result that is optimizing p(y|x) p ( y | x ) .

      logp(y|x)=log(y^y(1y^)(1y))=ylogy^+(1y)log(1y^)=L(y^,y) l o g p ( y | x ) = l o g ( y ^ y ( 1 − y ^ ) ( 1 − y ) ) = y l o g y ^ + ( 1 − y ) l o g ( 1 − y ^ ) = − L ( y ^ , y )

      we can carry out maximum likelihood estimation, so we want to find the parameters to maximizes the chance of you observation in the training set.

      logi=1mp(y(i)|x(i))=i=1mlogp(y(i)|x(i))=i=1mL(y(i),yi) l o g ∏ i = 1 m p ( y ( i ) | x ( i ) ) = ∑ i = 1 m l o g p ( y ( i ) | x ( i ) ) = − ∑ i = 1 m L ( y ( i ) , y i )

这篇关于Course1-week2-foundation of neural network的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


poj 2349 Arctic Network uva 10369(prim or kruscal最小生成树)

题目很麻烦,因为不熟悉最小生成树的算法调试了好久。 感觉网上的题目解释都没说得很清楚,不适合新手。自己写一个。 题意:给你点的坐标,然后两点间可以有两种方式来通信:第一种是卫星通信,第二种是无线电通信。 卫星通信:任何两个有卫星频道的点间都可以直接建立连接,与点间的距离无关; 无线电通信:两个点之间的距离不能超过D,无线电收发器的功率越大,D越大,越昂贵。 计算无线电收发器D

MonoHuman: Animatable Human Neural Field from Monocular Video 翻译

MonoHuman:来自单目视频的可动画人类神经场 摘要。利用自由视图控制来动画化虚拟化身对于诸如虚拟现实和数字娱乐之类的各种应用来说是至关重要的。已有的研究试图利用神经辐射场(NeRF)的表征能力从单目视频中重建人体。最近的工作提出将变形网络移植到NeRF中,以进一步模拟人类神经场的动力学,从而动画化逼真的人类运动。然而,这种流水线要么依赖于姿态相关的表示,要么由于帧无关的优化而缺乏运动一致性

图神经网络框架DGL实现Graph Attention Network (GAT)笔记

参考列表: [1]深入理解图注意力机制 [2]DGL官方学习教程一 ——基础操作&消息传递 [3]Cora数据集介绍+python读取 一、DGL实现GAT分类机器学习论文 程序摘自[1],该程序实现了利用图神经网络框架——DGL,实现图注意网络(GAT)。应用demo为对机器学习论文数据集——Cora,对论文所属类别进行分类。(下图摘自[3]) 1. 程序 Ubuntu:18.04

深度学习--对抗生成网络(GAN, Generative Adversarial Network)

对抗生成网络(GAN, Generative Adversarial Network)是一种深度学习模型,由Ian Goodfellow等人在2014年提出。GAN主要用于生成数据,通过两个神经网络相互对抗,来生成以假乱真的新数据。以下是对GAN的详细阐述,包括其概念、作用、核心要点、实现过程、代码实现和适用场景。 1. 概念 GAN由两个神经网络组成:生成器(Generator)和判别器(D

A Comprehensive Survey on Graph Neural Networks笔记

一、摘要-Abstract 1、传统的深度学习模型主要处理欧几里得数据(如图像、文本),而图神经网络的出现和发展是为了有效处理和学习非欧几里得域(即图结构数据)的信息。 2、将GNN划分为四类:recurrent GNNs(RecGNN), convolutional GNNs,(GCN), graph autoencoders(GAE), and spatial–temporal GNNs(S

Neighborhood Homophily-based Graph Convolutional Network

#paper/ccfB 推荐指数: #paper/⭐ #pp/图结构学习 流程 重定义同配性指标: N H i k = ∣ N ( i , k , c m a x ) ∣ ∣ N ( i , k ) ∣ with c m a x = arg ⁡ max ⁡ c ∈ [ 1 , C ] ∣ N ( i , k , c ) ∣ NH_i^k=\frac{|\mathcal{N}(i,k,c_{

[论文笔记]Making Large Language Models A Better Foundation For Dense Retrieval

引言 今天带来北京智源研究院(BAAI)团队带来的一篇关于如何微调LLM变成密集检索器的论文笔记——Making Large Language Models A Better Foundation For Dense Retrieval。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 密集检索需要学习具有区分性的文本嵌入,以表示查询和文档之间的语义关系。考虑到大语言模


课程大纲         使用线上接口测试网站演示操作,浏览器F12检查工具如何进行简单的接口测试:抓包、复制请求、篡改数据、发送新请求。         测试地址: ① 抓包:鼠标右键打开“检查”工具(F12),tab导航选择“网络”(Network),输入前3项点击提交,可看到录制的请求和返回数据。

OpenSNN推文:神经网络(Neural Network)相关论文最新推荐(九月份)(一)

基于卷积神经网络的活动识别分析系统及应用 论文链接:oalib简介:  活动识别技术在智能家居、运动评估和社交等领域得到广泛应用。本文设计了一种基于卷积神经网络的活动识别分析与应用系统,通过分析基于Android搭建的前端采所集的三向加速度传感器数据,对用户的当前活动进行识别。实验表明活动识别准确率满足了应用需求。本文基于识别的活动进行卡路里消耗计算,根据用户具体的活动、时间以及体重计算出相应活

Convolutional Neural Networks for Sentence Classification论文解读

基本信息 作者Yoon Kimdoi发表时间2014期刊EMNLP网址 研究背景 1. What’s known 既往研究已证实 CV领域著名的CNN。 2. What’s new 创新点 将CNN应用于NLP,打破了传统NLP任务主要依赖循环神经网络(RNN)及其变体的局面。 用预训练的词向量(如word2v