1. Objective

Previously, gradient descent was derived for a single training example.
In practice, we train logistic regression using m training examples.

The goal is:

Compute gradients over the entire dataset
Update parameters WW and bb to minimize the cost function


2. Cost Function Definition

The cost function is defined as the average of the loss over all training examples:

J(W,b)=1mi=1mL(a(i),y(i))J(W, b) = \frac{1}{m} \sum_{i=1}^{m} L(a^{(i)}, y^{(i)})

Where:

  • a(i)=σ(z(i))a^{(i)} = \sigma(z^{(i)})
  • z(i)=WTx(i)+bz^{(i)} = W^T x^{(i)} + b

3. Key Idea: Averaging Gradients

For multiple examples:

The gradient of the cost function is the average of gradients from each example

For example:

Jw1=1mi=1mL(i)w1\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial L^{(i)}}{\partial w_1}

This applies to all parameters WW and bb.


4. Algorithm Structure

To implement gradient descent:

Step 1: Initialize variables

  • J=0J = 0
  • dW=0dW = 0
  • db=0db = 0

Step 2: Loop over training examples

For each example i=1i = 1 to mm:

Forward computation:

z(i)=WTx(i)+bz^{(i)} = W^T x^{(i)} + ba(i)=σ(z(i))a^{(i)} = \sigma(z^{(i)})

Compute loss contribution:

J+=[y(i)log(a(i))+(1y(i))log(1a(i))]J += – \left[ y^{(i)} \log(a^{(i)}) + (1 – y^{(i)}) \log(1 – a^{(i)}) \right]


Backward computation:

dz(i)=a(i)y(i)dz^{(i)} = a^{(i)} – y^{(i)}dW+=x(i)dz(i)dW += x^{(i)} \cdot dz^{(i)} db+=dz(i)db += dz^{(i)}


5. Averaging Step

After processing all training examples:dW=1mdWdW = \frac{1}{m} dWdb=1mdbdb = \frac{1}{m} dbJ=1mJJ = \frac{1}{m} J


6. Parameter Update

Apply gradient descent:

W:=WαdWW := W – \alpha \cdot dWb:=bαdbb := b – \alpha \cdot db


7. Important Implementation Detail

  • dWdW, dbdb act as accumulators
  • They sum contributions from all training examples
  • After averaging, they represent gradients of the full cost function

8. Computational Limitation

This implementation uses:

  • A loop over mm training examples
  • A loop over features

This leads to inefficient computation for large datasets


9. Motivation for Vectorization

To improve efficiency:

Avoid explicit loops
Use vectorized operations

Vectorization allows:

  • Faster computation
  • Better scalability for large datasets

10. Key Takeaways

  1. Gradient over dataset = average of individual gradients
  2. Accumulate gradients over all examples
  3. Normalize by dividing by mm
  4. Update parameters using gradient descent
  5. Loop-based implementation is inefficient → vectorization needed