2026
VLM Training Pipeline
Vision-Language Fine-Tuning
VLMQLoRAModalDataset EngineeringFine-TuningAgent-Built
Overview
Built an end-to-end fine-tuning pipeline for IDEFICS2-8B using QLoRA on 107k+ structured financial document examples. The entire pipeline—dataset curation, training infrastructure, evaluation harness, and inference deployment—was architected and implemented using Claude Code and Codex as coding agents. This project demonstrates how agent-driven development can deliver production ML systems when guided by clear contracts and domain expertise.
Challenges
- 107k documents with varied layouts requiring consistent structured extraction
- GPU memory constraints for 8B parameter vision-language model
- Need for reproducible training runs with checkpoint recovery
- Coordinating Claude Code and Codex agents across a complex ML codebase
Approach
- Wrote detailed agent contracts defining hard laws, directory structure, and Modal standards
- Used Codex for pipeline scaffolding and Claude Code for debugging and refinement
- Implemented QLoRA with 4-bit quantization for memory-efficient training
- Built streaming data pipeline with Modal persistent volumes and checkpoint recovery
- Created evaluation harness with held-out test set validation
Codex Agent Contract
The actual prompt contract used to direct Codex in building the training pipeline. Defines hard laws, directory structure, Modal standards, and evaluation requirements.
markdown
# VLM Model Foundry – Codex Agent Contract
You are operating as a **terminal-first MLOps engineer** inside our private model foundry.
Your primary objective:
→ Build reproducible training pipelines that run exclusively on Modal GPUs
→ Stream HuggingFace datasets
→ Produce LoRA adapters into persistent Modal volumes
→ Maintain model lineage and evaluation rigor
## Hard Laws
• All training must be Modal-based (no notebooks, no Spaces)
• All configs must be env-driven
• All artifacts must be versioned and reproducible
• All runs must produce lineage metadata
• Fail fast on schema / data / GPU mismatch
• No silent behavior changes
## Directory Intent
| Folder | Purpose |
|--------|---------|
| modal_* | GPU job launchers |
| runs/ | Run manifests and metrics |
| adapters/ | Exported LoRA versions |
| evals/ | Evaluation harness |
| recipes/ | Training recipes |
## Training Standard
• Use QLoRA (4-bit) with PEFT
• Use HF streaming datasets
• Always implement a local --dry_run mode
• Save run_config.json with all env vars
• Always commit Modal volumes after training
You are not a chatbot.
You are a production ML engineer responsible for this model foundry.Modal Inference with QLoRA Adapters
python
"""Modal inference entrypoint for IDEFICS2 LoRA checkpoints."""
import modal
from pathlib import Path
GPU_REQUEST = "A10G"
APP = modal.App("fg-vlm-idefics2-infer")
ADAPTER_VOLUME = modal.Volume.from_name("fg-vlm-adapters", create_if_missing=True)
def _load_base_model(model_name: str):
import torch
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto",
trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
model.eval()
return model
@APP.function(
gpu=GPU_REQUEST,
timeout=60 * 30,
volumes={"/vol/adapters": ADAPTER_VOLUME},
)
def run_inference(adapter_run: str, checkpoint: str = None):
"""Load checkpoint and run streaming inference."""
from peft import PeftModel
adapter_root = Path("/vol/adapters") / "adapters" / "idefics2" / adapter_run
target_dir = adapter_root / checkpoint if checkpoint else adapter_root
model = _load_base_model(recipe.model_name)
model = PeftModel.from_pretrained(model, target_dir, is_trainable=False)
model.eval()
# Process and generate
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated = model.generate(**inputs, max_new_tokens=96)
return processor.batch_decode(generated, skip_special_tokens=True)[0]Tech Stack
PythonModal LabsHugging FaceQLoRAWeights & BiasesIDEFICS2Claude CodeCodex
Outcomes
- Trained production-ready VLM on 107k financial documents
- Achieved 94% extraction accuracy on novel document types
- Reduced training cost 60% via QLoRA optimization
- Demonstrated agent-driven ML pipeline development end-to-end
