4.2 Evaluating and Tuning Models

Modern AI Fundamentals

0% completed

Training a model is only the first step in the machine learning workflow. And evaluating how well it performs on unseen data and tuning it for better results are just as important.

Therefore, in this section, we are going to introduce common metrics (accuracy, precision, recall, F1 score), explain overfitting vs. underfitting, and show you basic hyperparameter tuning techniques—so you can refine your models systematically.

1. Accuracy, Precision, Recall, F1 Score

When you’re dealing with classification problems (like detecting spam vs. not spam), numerical metrics show how well your model is doing.

Below are four common ones, presented conceptually:

Accuracy
- The fraction of predictions your model gets right overall.
- When to Use: It’s a quick measure for balanced datasets where classes appear roughly in equal proportions (e.g., 50% spam, 50% not spam).
- Pitfalls: If you have a dataset with 95% “not spam” and 5% “spam,” a naive model that always predicts “not spam” would be 95% accurate—yet entirely miss the spam class.
Precision
- Out of all the items labeled “positive” (spam), how many are actually correct?
- Interpretation: High precision means when the model says “this is spam,” it’s usually right.
- Use Case: Important in scenarios where false positives are costly (e.g., you don’t want to flag legitimate emails as spam).
Recall
- Out of all the items that are truly “positive” (spam), how many did the model correctly identify?
- Interpretation: High recall means it’s catching most of the spam, but it might label some non-spam as spam in the process.
- Use Case: Important if missing positives is costly (e.g., you don’t want to let harmful emails slip through).
F1 Score
- The harmonic mean of precision and recall; a balance between the two.
- Interpretation: Helps you find a middle ground if you don’t want to overly sacrifice one metric for the other.
- Use Case: A great all-around measure for imbalanced datasets.

Example in Code (using scikit-learn terminology):

Python3

. . . .

2. Overfitting vs. Underfitting & Cross-Validation

Overfitting and underfitting reflect two common pitfalls:

Overfitting: Your model performs extremely well on the training data but fails to generalize to unseen data. It has effectively “memorized” training examples rather than learning general patterns.
Underfitting: Your model hasn’t captured the underlying trends enough, resulting in poor performance on both training and test sets.

How to Spot Them

If your training accuracy is much higher than test accuracy, that’s a red flag for overfitting.
If both training and test accuracies are equally low, your model is likely underfitting (too simplistic or not well-tuned).

Cross-Validation Basics

A technique to better evaluate model stability by splitting your data into multiple folds (e.g., 5 parts). The model is trained on 4 folds and validated on the remaining 1, repeated for each fold.
Why It Helps:
- You get multiple performance estimates instead of just one.
- Reduces the variance in your performance measure and provides a more robust sense of how the model might fare on real-world data.

Example:

Python3

. . . .

3. Hyperparameter Tuning

Most models have hyperparameters—settings you choose before training (e.g., the maximum depth of a decision tree, the number of neighbors in k-NN, learning rate for neural nets).

Finding the best combination of these can significantly improve your model’s performance.

Simple Approaches

Grid Search
- How It Works:
  1. You define a grid of possible values for hyperparameters. For example: > {'max_depth': [3, 5, 7], 'min_samples_split': [2, 5, 10]}
  2. The algorithm tries every possible combination using cross-validation and reports which set yields the best average performance.
- Pros: Systematic, guarantees you’ll find the best combination out of the grid.
- Cons: Can be slow if the grid is large.

Example:

Python3

. . . .

Random Search
- How It Works: Picks random combinations of hyperparameters from given ranges for a fixed number of iterations.
- Pros: Faster and often nearly as good as an exhaustive grid search, especially if your parameter space is large.
- Cons: Results can vary slightly each run, but generally it’s more efficient.

Example:

Python3

. . . .

.....

Like the course? Get enrolled and start learning!

1. Accuracy, Precision, Recall, F1 Score

Accuracy

Precision

Recall

F1 Score

2. Overfitting vs. Underfitting & Cross-Validation

How to Spot Them

Cross-Validation Basics

3. Hyperparameter Tuning

Simple Approaches

Grid Search

Random Search