Aaron Diemel
Back to work
2025

XTRAP

Autonomous Red Team System

Red TeamingAdversarial AIMulti-AgentSafety Evaluation

Overview

An autonomous adversarial evaluation system that generates, refines, and memorizes successful attack patterns against language models. Uses multi-model attacker/defender loops with recursive refinement to continuously improve attack effectiveness.

Challenges

  • Manual red teaming doesn't scale across model updates
  • Attack patterns become stale as models are patched
  • Need systematic coverage of vulnerability categories
  • Evaluation requires consistent scoring methodology

Approach

  • Built attacker agent that generates novel jailbreak attempts
  • Implemented defender evaluation with constitutional scoring
  • Created attack memory bank with embedding-based retrieval
  • Designed recursive refinement loop for failed attempts

Multi-Model Attack Loop

python
class AdversarialLoop:
    def __init__(self, attacker_model, defender_model):
        self.attacker = attacker_model
        self.defender = defender_model
        self.memory = AttackMemoryBank()
    
    async def run_attack_cycle(self, target_behavior: str):
        # Retrieve similar successful attacks
        similar_attacks = self.memory.retrieve(target_behavior, k=5)
        
        # Generate new attack variants
        attack = await self.attacker.generate(
            target=target_behavior,
            examples=similar_attacks,
            strategy="recursive_refinement"
        )
        
        # Evaluate against defender
        response = await self.defender.respond(attack.prompt)
        score = self.evaluate_success(response, target_behavior)
        
        if score > 0.8:
            self.memory.store(attack, score)
            return {"success": True, "attack": attack}
        
        # Recursive refinement on failure
        return await self.refine_attack(attack, response)

Tech Stack

PythonOpenAIAnthropicLangGraphVector DBStructured Outputs

Outcomes

  • Discovered 58+ verified jailbreaks in Gray Swan Challenge
  • Achieved top ~1% ranking in adversarial evaluation
  • Patterns contributed to model safety improvements