Understanding Logistic Regression Cost Function and Optimization

1. Objective of Training

In logistic regression, we define:

A loss function for a single training example
A cost function $J(W, b)$ for the entire training set

The cost function is defined as the average of the loss over all training examples:

$J(W, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$

The goal of training is:

Find parameters $W$ and $b$ that minimize the cost function $J(W, b)$

2. Interpretation of the Cost Function

The cost function can be interpreted as a surface over the parameter space:

Horizontal axes: parameters $W$ and $b$
Vertical axis: value of $J(W, b)$

Each point $(W, b)$ corresponds to a specific value of the cost function.

The minimum point on this surface represents the optimal parameters.

3. Convexity of the Cost Function

For logistic regression, the cost function $J(W, b)$ is convex.

This means:

The surface has a single global minimum
There are no local minima

This property is critical because it ensures that optimization algorithms can reliably find the best solution.

4. Gradient Descent Algorithm

Gradient descent is an iterative optimization algorithm used to minimize the cost function.

The algorithm updates the parameters step by step in the direction that reduces the cost.

5. Parameter Update Rule

The update rule for gradient descent is:

$W := W – \alpha \frac{\partial J}{\partial W}$ $b := b – \alpha \frac{\partial J}{\partial b}$

Where:

$\alpha$ : learning rate
$\frac{\partial J}{\partial W}$ , $\frac{\partial J}{\partial b}$ : partial derivatives (gradients)

6. Meaning of the Gradient

The gradient represents the slope of the cost function at the current parameter values.

If the slope is positive → decrease the parameter
If the slope is negative → increase the parameter

This ensures that each update moves the parameters in the direction of steepest descent.

7. Iterative Process

The gradient descent algorithm follows these steps:

Initialize parameters $W$ and $b$ (commonly to zero)
Compute the gradient of the cost function
Update parameters using the update rule
Repeat until convergence

Over multiple iterations, the parameters gradually move toward the minimum of the cost function.

8. Role of Learning Rate (α)

The learning rate controls the size of each update step.

If $\alpha$ is too large → updates may overshoot the minimum
If $\alpha$ is too small → convergence becomes very slow

Choosing an appropriate learning rate is essential for efficient training.

9. Behavior Under Convexity

Because the cost function in logistic regression is convex:

Gradient descent will converge to the same global minimum
The initialization point does not significantly affect the final result

This makes the optimization process stable and predictable.

10. Partial Derivative Notation

When the cost function depends on multiple variables (such as $W$ and $b$ ):

Partial derivatives are used: $\frac{\partial J}{\partial W}$ , $\frac{\partial J}{\partial b}$

In implementation:

$dW$ : gradient with respect to $W$
$db$ : gradient with respect to $b$

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Gradient Descent in Logistic Regression

1. Objective of Training

2. Interpretation of the Cost Function

3. Convexity of the Cost Function

4. Gradient Descent Algorithm

5. Parameter Update Rule

6. Meaning of the Gradient

7. Iterative Process

8. Role of Learning Rate (α)

9. Behavior Under Convexity

10. Partial Derivative Notation

Like this:

Related

1. Objective of Training

2. Interpretation of the Cost Function

3. Convexity of the Cost Function

4. Gradient Descent Algorithm

5. Parameter Update Rule

6. Meaning of the Gradient

7. Iterative Process

8. Role of Learning Rate (α)

9. Behavior Under Convexity

10. Partial Derivative Notation

Share this:

Like this:

Related

Discover more from Your Gateway to Data Mastery