2026

VLM Training Pipeline

Vision-Language Fine-Tuning

VLMQLoRAModalDataset EngineeringFine-TuningAgent-Built

Overview

Built an end-to-end fine-tuning pipeline for IDEFICS2-8B using QLoRA on 107k+ structured financial document examples. The entire pipeline—dataset curation, training infrastructure, evaluation harness, and inference deployment—was architected and implemented using Claude Code and Codex as coding agents. This project demonstrates how agent-driven development can deliver production ML systems when guided by clear contracts and domain expertise.

Challenges

107k documents with varied layouts requiring consistent structured extraction
GPU memory constraints for 8B parameter vision-language model
Need for reproducible training runs with checkpoint recovery
Coordinating Claude Code and Codex agents across a complex ML codebase

Approach

Wrote detailed agent contracts defining hard laws, directory structure, and Modal standards
Used Codex for pipeline scaffolding and Claude Code for debugging and refinement
Implemented QLoRA with 4-bit quantization for memory-efficient training
Built streaming data pipeline with Modal persistent volumes and checkpoint recovery
Created evaluation harness with held-out test set validation

Codex Agent Contract

The actual prompt contract used to direct Codex in building the training pipeline. Defines hard laws, directory structure, Modal standards, and evaluation requirements.

markdown

# VLM Model Foundry – Codex Agent Contract

You are operating as a **terminal-first MLOps engineer** inside our private model foundry.

Your primary objective:
→ Build reproducible training pipelines that run exclusively on Modal GPUs  
→ Stream HuggingFace datasets  
→ Produce LoRA adapters into persistent Modal volumes  
→ Maintain model lineage and evaluation rigor  

## Hard Laws

• All training must be Modal-based (no notebooks, no Spaces)  
• All configs must be env-driven  
• All artifacts must be versioned and reproducible  
• All runs must produce lineage metadata  
• Fail fast on schema / data / GPU mismatch  
• No silent behavior changes  

## Directory Intent

| Folder | Purpose |
|--------|---------|
| modal_* | GPU job launchers |
| runs/ | Run manifests and metrics |
| adapters/ | Exported LoRA versions |
| evals/ | Evaluation harness |
| recipes/ | Training recipes |

## Training Standard

• Use QLoRA (4-bit) with PEFT  
• Use HF streaming datasets  
• Always implement a local --dry_run mode  
• Save run_config.json with all env vars  
• Always commit Modal volumes after training  

You are not a chatbot.
You are a production ML engineer responsible for this model foundry.

Modal Inference with QLoRA Adapters

python

"""Modal inference entrypoint for IDEFICS2 LoRA checkpoints."""
import modal
from pathlib import Path

GPU_REQUEST = "A10G"
APP = modal.App("fg-vlm-idefics2-infer")
ADAPTER_VOLUME = modal.Volume.from_name("fg-vlm-adapters", create_if_missing=True)

def _load_base_model(model_name: str):
    import torch
    from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
    from peft import prepare_model_for_kbit_training

    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForVision2Seq.from_pretrained(
        model_name,
        quantization_config=quant_config,
        device_map="auto",
        trust_remote_code=True,
    )
    model = prepare_model_for_kbit_training(model)
    model.eval()
    return model

@APP.function(
    gpu=GPU_REQUEST,
    timeout=60 * 30,
    volumes={"/vol/adapters": ADAPTER_VOLUME},
)
def run_inference(adapter_run: str, checkpoint: str = None):
    """Load checkpoint and run streaming inference."""
    from peft import PeftModel
    
    adapter_root = Path("/vol/adapters") / "adapters" / "idefics2" / adapter_run
    target_dir = adapter_root / checkpoint if checkpoint else adapter_root
    
    model = _load_base_model(recipe.model_name)
    model = PeftModel.from_pretrained(model, target_dir, is_trainable=False)
    model.eval()
    
    # Process and generate
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    generated = model.generate(**inputs, max_new_tokens=96)
    return processor.batch_decode(generated, skip_special_tokens=True)[0]

Tech Stack

PythonModal LabsHugging FaceQLoRAWeights & BiasesIDEFICS2Claude CodeCodex

Outcomes

Trained production-ready VLM on 107k financial documents
Achieved 94% extraction accuracy on novel document types
Reduced training cost 60% via QLoRA optimization
Demonstrated agent-driven ML pipeline development end-to-end

Agentic Audit Platform XTRAP AI Safety Research