1. Objective

In logistic regression, the goal is to update parameters WW and bb in order to minimize the loss function.

To achieve this, we need to:

  • Compute predictions (forward pass)
  • Compute derivatives (backward pass)
  • Update parameters using gradient descent

2. Forward Propagation

For a single training example with two features x1,x2x_1, x_2​:

Step 1: Linear combination

z=w1x1+w2x2+bz = w_1 x_1 + w_2 x_2 + b

Step 2: Activation (sigmoid)

a=y^=σ(z)a = \hat{y} = \sigma(z)

Step 3: Loss function

L(a,y)=[ylog(a)+(1y)log(1a)]L(a, y) = – \left[ y \log(a) + (1 – y)\log(1 – a) \right]

This completes the forward pass.


3. Backward Propagation (Derivatives)

We now compute derivatives starting from the loss and moving backward through the computation graph.

Step 1: Derivative with respect to aa

La=ya+1y1a\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

(In implementation, this is stored as dA)

Step 2: Derivative with respect to zz

Using the chain rule:

Lz=ay\frac{\partial L}{\partial z} = a – y

(This is a key simplification result)

This comes from combining:

  • La\frac{\partial L}{\partial a}
  • az=a(1a)\frac{\partial a}{\partial z} = a(1-a)

Step 3: Derivatives with respect to parameters

Lw1=x1(ay)\frac{\partial L}{\partial w_1} = x_1 \cdot (a – y)Lw2=x2(ay)\frac{\partial L}{\partial w_2} = x_2 \cdot (a – y)Lb=ay\frac{\partial L}{\partial b} = a – y

In implementation:

  • dW1=x1dzdW_1 = x_1 \cdot dz
  • dW2=x2dzdW_2 = x_2 \cdot dz
  • db=dzdb = dz

4. Gradient Descent Update

After computing gradients, update parameters:

w1:=w1αdW1w_1 := w_1 – \alpha \cdot dW_1 w2:=w2αdW2w_2 := w_2 – \alpha \cdot dW_2 b:=bαdbb := b – \alpha \cdot db

Where:

  • α\alpha is the learning rate

5. Summary of Computation Flow

Forward pass:

xzaLx \rightarrow z \rightarrow a \rightarrow L

Backward pass:

Laz(w,b)L \rightarrow a \rightarrow z \rightarrow (w, b)

This follows the computation graph:

  • Forward: left → right
  • Backward: right → left

6. Key Insight

  • The backward pass applies the chain rule
  • The derivative simplifies to:

dz=aydz = a – y

This simplification is crucial for efficient implementation.


7. Limitation of This Version

This derivation applies to:

a single training example

In practice:

  • We use a dataset with mm examples
  • Gradients are averaged over all examples