1. Objective of Training

In logistic regression, we define:

  • A loss function for a single training example
  • A cost function $J(W, b)$ for the entire training set

The cost function is defined as the average of the loss over all training examples:

J(W,b)=1mi=1mL(y^(i),y(i))J(W, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})

The goal of training is:

Find parameters $W$ and $b$ that minimize the cost function $J(W, b)$


2. Interpretation of the Cost Function

The cost function can be interpreted as a surface over the parameter space:

  • Horizontal axes: parameters WW and bb
  • Vertical axis: value of J(W,b)J(W, b)

Each point (W,b)(W, b) corresponds to a specific value of the cost function.

The minimum point on this surface represents the optimal parameters.


3. Convexity of the Cost Function

For logistic regression, the cost function J(W,b)J(W, b) is convex.

This means:

  • The surface has a single global minimum
  • There are no local minima

This property is critical because it ensures that optimization algorithms can reliably find the best solution.


4. Gradient Descent Algorithm

Gradient descent is an iterative optimization algorithm used to minimize the cost function.

The algorithm updates the parameters step by step in the direction that reduces the cost.


5. Parameter Update Rule

The update rule for gradient descent is:

W:=WαJWW := W – \alpha \frac{\partial J}{\partial W}b:=bαJbb := b – \alpha \frac{\partial J}{\partial b}

Where:

  • α\alpha: learning rate
  • JW\frac{\partial J}{\partial W}​, Jb\frac{\partial J}{\partial b}​: partial derivatives (gradients)

6. Meaning of the Gradient

The gradient represents the slope of the cost function at the current parameter values.

  • If the slope is positive → decrease the parameter
  • If the slope is negative → increase the parameter

This ensures that each update moves the parameters in the direction of steepest descent.


7. Iterative Process

The gradient descent algorithm follows these steps:

  1. Initialize parameters WW and bb (commonly to zero)
  2. Compute the gradient of the cost function
  3. Update parameters using the update rule
  4. Repeat until convergence

Over multiple iterations, the parameters gradually move toward the minimum of the cost function.


8. Role of Learning Rate (α)

The learning rate controls the size of each update step.

  • If α\alpha is too large → updates may overshoot the minimum
  • If α\alpha is too small → convergence becomes very slow

Choosing an appropriate learning rate is essential for efficient training.


9. Behavior Under Convexity

Because the cost function in logistic regression is convex:

  • Gradient descent will converge to the same global minimum
  • The initialization point does not significantly affect the final result

This makes the optimization process stable and predictable.


10. Partial Derivative Notation

When the cost function depends on multiple variables (such as WW and bb):

  • Partial derivatives are used: JW\frac{\partial J}{\partial W}​, Jb\frac{\partial J}{\partial b}

In implementation:

  • dWdW: gradient with respect to WW
  • dbdb: gradient with respect to bb