AI Evaluation Case Studies
A documented record of AI model errors — covering safety risks, reasoning failures, factual hallucinations, and visual misidentifications.
AI Safety Case: Unsafe JavaScript Generated for Browser Execution
Model: Grok | Date: October 2025
While troubleshooting list visibility on X (Twitter), the model generated executable JavaScript involving internal APIs and session data.
WHAT HAPPENED
- Model provided JavaScript intended for direct browser execution
- Code called internal X API endpoints using authenticated session credentials
- Accessed CSRF token data from cookies
- No warning was given about the security context
RESPONSE & OUTCOME
- Recognized the instruction as unsafe; avoided executing in a logged-in session
- Used an incognito window for controlled testing instead
- Model acknowledged the mistake, apologized, and updated its guidance
- Confirmed the unsafe output was generated across parallel sessions
KEY INSIGHT
The model generated executable code without accounting for user security context — demonstrating a failure to enforce safeguards around authenticated environments.
IMPACT
A user following the original instructions could have exposed sensitive session data or interacted with private APIs in an unsafe manner.
EVIDENCE
- Screenshot: Code provided by model
- Screenshot: User challenge
- Screenshot: Model apology
- Screenshot: Final corrected stance
AI Evaluation Case: Model Resistance & Reasoning Correction
Model: Grok | Date: October 2025
During a team comparison (PSG vs. Barcelona vs. Chelsea), the model relied on simplified outcome-based reasoning. I challenged this using a "play-to-the-level" framework and cross-domain analogy.
WHAT HAPPENED
- Model prioritized isolated results (e.g., tournament outcomes) over competitive context
- Underweighted opponent quality and strength of schedule
- Produced a ranking that didn't reflect performance at elite level
- Model resisted correction, reframing arguments to defend its original stance
RESPONSE & OUTCOME
- Introduced a "play-to-the-level" framework comparing opponent quality across competitions
- Used analogy to shift reasoning: Elite teams — like elite athletes such as Roger Federer — must be evaluated relative to the level of their competition, not just by outcomes
- After sustained challenge, the model acknowledged gaps in its logic and revised its ranking
KEY INSIGHT
The model demonstrated resistance to integrating external reasoning, prioritizing its initial assumptions until structured, repeated contradiction forced reassessment.
IMPACT
Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.
EVIDENCE
- Screenshot: Initial model stance
- Screenshot: User argument and analogy
- Screenshot: Model resistance
- Screenshot: Final concession
AI Evaluation Case: Incorrect Match Data — Supercopa de España
Model: Grok | Date: January 7, 2026
Following a match between FC Barcelona and Athletic Bilbao, Grok fabricated match details rather than expressing uncertainty.
WHAT HAPPENED
- Grok stated Marcus Rashford was not involved in the match
- Reported the result as Barcelona 3–1; actual score was 5–0
- Listed incorrect scorers (Lewandowski, Yamal, Pedri)
- Rashford had joined Barcelona months earlier and did play
RESPONSE & OUTCOME
- Corrected the response by citing the actual result and Rashford's transfer timeline
- Highlighted his confirmed presence in the match
- Model acknowledged the correction and stated its information was outdated
KEY INSIGHT
The model generated fabricated match details instead of expressing uncertainty — a clear failure in real-time fact validation.
IMPACT
Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.
EVIDENCE
- Screenshot: Match post
- Screenshot: User question
- Screenshot: Grok incorrect response
- Screenshot: Model correction
AI Evaluation Case: Player Misidentification — Raphinha vs Marcus Rashford
Model: Grok | Date: September 2025
A cropped image posted by FC Barcelona prompted Grok to identify a player based only on lower-body visuals — incorrectly and with high confidence.
WHAT HAPPENED
- Grok identified the player as Raphinha citing tattoo patterns, build, and kit details
- The player was actually Marcus Rashford
- Tattoo placement and physique differed significantly
- Model also cited incorrect club status for Rashford
RESPONSE & OUTCOME
- Challenged the model using specific tattoo comparisons and build differences
- Model initially resisted and maintained its claim
- After continued evidence it acknowledged uncertainty and revised its conclusion
KEY INSIGHT
The model demonstrated overconfidence in visual pattern matching without verifying distinguishing features or updated contextual information.
IMPACT
Users relying on this response could be misled by confident but incorrect visual identification.
EVIDENCE
- Screenshot: Original FC Barcelona post
- Screenshot: Grok initial response
- Screenshot: User correction
- Screenshot: Final model revision















