Alexandre Levret demonstrates how developers can fine-tune GPT-4o using Azure AI Foundry to enhance image classification accuracy, providing a hands-on comparison against conventional CNNs and exploring practical trade-offs in performance, cost, and development workflow.

Fine-Tuning GPT-4o for Image Classification on Azure AI Foundry

Author: Alexandre Levret

This guide shows how to apply the latest Vision-Language Models (VLM) on Azure for image classification, even if you don’t have deep learning expertise. By fine-tuning GPT-4o on Azure OpenAI, you can boost accuracy for tasks like dog breed identification, leveraging modern cloud-based AI.

Overview

Dataset: Stanford Dogs (120 breeds, thousands of images)
Goal: Compare out-of-the-box GPT-4o (zero-shot), fine-tuned GPT-4o, and a traditional lightweight CNN
Workflow:
1. Data preparation
2. Batch inference with Azure OpenAI
3. Model fine-tuning with Vision Fine-Tuning API
4. Evaluation of results
5. Cost, latency, and accuracy analysis

All code, scripts, and templates are available in the GitHub repository.

1. What is Image Classification?

Image classification is a core computer vision task—grouping images into categories, like identifying dog breeds. Traditionally, this used Convolutional Neural Networks (CNNs), but since the advent of Large Language Models (LLMs) with vision capabilities, Vision-Language Models can now deliver powerful classification and reasoning with text and images.

Anyone can access these models via apps or APIs. For example, upload an image and ask, “What is the dog’s breed in the picture?” The model understands both the image and the prompt.

2. Deploying a Vision-Language Model on Azure

With Azure AI Foundry, you have access to many models, including the latest GPT-4o that supports both batch inference and vision fine-tuning. This enables:

Batch API: Efficient, lower-cost large-scale inference
Vision Fine-Tuning API: Task-specific adaptation of base models

Batch Inference Example

You format requests in JSONL:

{"model": "gpt-4o-batch", "messages": [ {"role": "system", "content":"Classify the following input image into one of the following categories: [Affenpinscher, Afghan Hound, ... , Yorkshire Terrier]." }, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "b64", "detail": "low"}} ]} ]}

Specify model version and task
Send base64-encoded images
Omit the correct label to simulate inference

Batch API saves 50% compared to synchronous inference, with longer latency (up to 24 hours, not guaranteed).

3. Fine-Tuning with Vision Fine-Tuning API

Fine-tuning adapts the pre-trained model for your specific task. Azure OpenAI supports various methods, including Supervised Fine-Tuning (SFT).

You provide new labeled data in JSONL format (image-text pairs)
Configure hyperparameters (batch size, learning rate, epochs, etc.)
Example training entry:

{"messages": [ {"role": "system", "content": "Classify the following input image into one of the following categories: [Affenpinscher, Afghan Hound, ... , Yorkshire Terrier]."}, {"role": "user", "content": [{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<encoded_image>", "detail": "low"}}]}, {"role": "assistant", "content": "Springer Spaniel"} ]}

Selected hyperparameters used in the fine-tuning run:
- Batch size: 6
- Learning rate: 0.5
- Epochs: 2
- Seed: 42
More details at Microsoft Learn

The process returns log files and visualizations of training progress (loss per step, etc.) in the Azure AI Foundry portal.

4. Baseline: Training a Classic CNN

Alongside the VLM approach, the guide also implements a lightweight CNN, trained on the same subset. The CNN achieves a mean accuracy of 61.67%—a useful reference, though it’s not state-of-the-art and requires more setup and maintenance.

Mean accuracy: 61.67%
Training time: Under 30 minutes
Deployment: Local or via Azure Machine Learning

5. Results: Accuracy, Latency, and Cost

Aspect	Base GPT-4o (Zero-Shot)	Fine-Tuned GPT-4o	CNN Baseline
Mean accuracy	73.67%	82.67% (+9pp vs base)	61.67% (-12pp vs base)
Mean latency	1665ms	1506ms (-9.6%)	~tens of ms
Cost	Inference only ($)	Training+Hosting+Inf ($$)	Local or AML ($)

Key observations:

Fine-tuned GPT-4o achieves highest accuracy and slightly lower latency relative to zero-shot
Zero-shot is quickest to production, but less accurate
CNN is cheapest in inference, but much less accurate and requires engineering work

6. Next Steps

Explore other datasets and tasks (OCR, multi-modal prompts)
Experiment with different parameters
Integrate the fine-tuned model into your pipeline
Inspect and modify the codebase at the GitHub repository

Take advantage of Azure AI Foundry Models for model selection and secure enterprise-grade deployments.

References & Further Reading

This post appeared first on “Microsoft AI Foundry Blog”. Read the entire article here