Aaron Diemel
Back to work
2024–Present

AI Safety Research

Adversarial Testing & Evaluation

AI SafetyRed TeamingJailbreak AnalysisModel Evaluation

Overview

Ongoing research in adversarial testing and model safety evaluation. Contributed to pre-release testing for major AI labs, with work cited in official documentation and vulnerabilities leading to production patches.

Challenges

  • Identifying novel attack vectors before public deployment
  • Operating under formal disclosure protocols
  • Systematic evaluation across model capabilities
  • Documenting findings for reproducibility

Approach

  • Developed systematic jailbreak taxonomy and testing methodology
  • Built automated evaluation harnesses for consistent scoring
  • Maintained responsible disclosure timelines with labs
  • Created detailed technical writeups for each finding

Tech Stack

PythonOpenAI APIAnthropic APIxAI APICustom Evaluation Tools

Outcomes

  • Top ~1% in Gray Swan AI Jailbreak Challenge (~58 verified)
  • Work cited in OpenAI o3-mini system card
  • Selected for Anthropic constitutional classifier testing
  • Grok3 vulnerabilities patched by xAI post-disclosure