1. Objective of Training
In logistic regression, we define:
- A loss function for a single training example
- A cost function $J(W, b)$ for the entire training set
The cost function is defined as the average of the loss over all training examples:
The goal of training is:
Find parameters $W$ and $b$ that minimize the cost function $J(W, b)$
2. Interpretation of the Cost Function
The cost function can be interpreted as a surface over the parameter space:
- Horizontal axes: parameters and
- Vertical axis: value of
Each point corresponds to a specific value of the cost function.
The minimum point on this surface represents the optimal parameters.
3. Convexity of the Cost Function
For logistic regression, the cost function is convex.
This means:
- The surface has a single global minimum
- There are no local minima
This property is critical because it ensures that optimization algorithms can reliably find the best solution.
4. Gradient Descent Algorithm
Gradient descent is an iterative optimization algorithm used to minimize the cost function.
The algorithm updates the parameters step by step in the direction that reduces the cost.
5. Parameter Update Rule
The update rule for gradient descent is:
Where:
- : learning rate
- , : partial derivatives (gradients)
6. Meaning of the Gradient
The gradient represents the slope of the cost function at the current parameter values.
- If the slope is positive → decrease the parameter
- If the slope is negative → increase the parameter
This ensures that each update moves the parameters in the direction of steepest descent.
7. Iterative Process
The gradient descent algorithm follows these steps:
- Initialize parameters and (commonly to zero)
- Compute the gradient of the cost function
- Update parameters using the update rule
- Repeat until convergence
Over multiple iterations, the parameters gradually move toward the minimum of the cost function.
8. Role of Learning Rate (α)
The learning rate controls the size of each update step.
- If is too large → updates may overshoot the minimum
- If is too small → convergence becomes very slow
Choosing an appropriate learning rate is essential for efficient training.
9. Behavior Under Convexity
Because the cost function in logistic regression is convex:
- Gradient descent will converge to the same global minimum
- The initialization point does not significantly affect the final result
This makes the optimization process stable and predictable.
10. Partial Derivative Notation
When the cost function depends on multiple variables (such as and ):
- Partial derivatives are used: ,
In implementation:
- : gradient with respect to
- : gradient with respect to
