1. Objective
Previously, gradient descent was derived for a single training example.
In practice, we train logistic regression using m training examples.
The goal is:
Compute gradients over the entire dataset
Update parameters and to minimize the cost function
2. Cost Function Definition
The cost function is defined as the average of the loss over all training examples:
Where:
3. Key Idea: Averaging Gradients
For multiple examples:
The gradient of the cost function is the average of gradients from each example
For example:
This applies to all parameters and .
4. Algorithm Structure
To implement gradient descent:
Step 1: Initialize variables
Step 2: Loop over training examples
For each example to :
Forward computation:
Compute loss contribution:
Backward computation:
5. Averaging Step
After processing all training examples:
6. Parameter Update
Apply gradient descent:
7. Important Implementation Detail
- , act as accumulators
- They sum contributions from all training examples
- After averaging, they represent gradients of the full cost function
8. Computational Limitation
This implementation uses:
- A loop over training examples
- A loop over features
This leads to inefficient computation for large datasets
9. Motivation for Vectorization
To improve efficiency:
Avoid explicit loops
Use vectorized operations
Vectorization allows:
- Faster computation
- Better scalability for large datasets
10. Key Takeaways
- Gradient over dataset = average of individual gradients
- Accumulate gradients over all examples
- Normalize by dividing by
- Update parameters using gradient descent
- Loop-based implementation is inefficient → vectorization needed
