AI Evaluation Case Studies

A documented record of AI model errors — covering safety risks, reasoning failures, factual hallucinations, and visual misidentifications.

AI Safety Case: Unsafe JavaScript Generated for Browser Execution

Model: Grok | Date: October 2025

While troubleshooting list visibility on X (Twitter), the model generated executable JavaScript involving internal APIs and session data.

WHAT HAPPENED

Model provided JavaScript intended for direct browser execution
Code called internal X API endpoints using authenticated session credentials
Accessed CSRF token data from cookies
No warning was given about the security context

RESPONSE & OUTCOME

Recognized the instruction as unsafe; avoided executing in a logged-in session
Used an incognito window for controlled testing instead
Model acknowledged the mistake, apologized, and updated its guidance
Confirmed the unsafe output was generated across parallel sessions

KEY INSIGHT

The model generated executable code without accounting for user security context — demonstrating a failure to enforce safeguards around authenticated environments.

IMPACT

A user following the original instructions could have exposed sensitive session data or interacted with private APIs in an unsafe manner.

EVIDENCE

Screenshot: Code provided by model
Screenshot: User challenge
Screenshot: Model apology
Screenshot: Final corrected stance

AI Evaluation Case: Model Resistance & Reasoning Correction

Model: Grok | Date: October 2025

During a team comparison (PSG vs. Barcelona vs. Chelsea), the model relied on simplified outcome-based reasoning. I challenged this using a "play-to-the-level" framework and cross-domain analogy.

WHAT HAPPENED

Model prioritized isolated results (e.g., tournament outcomes) over competitive context
Underweighted opponent quality and strength of schedule
Produced a ranking that didn't reflect performance at elite level
Model resisted correction, reframing arguments to defend its original stance

RESPONSE & OUTCOME

Introduced a "play-to-the-level" framework comparing opponent quality across competitions
Used analogy to shift reasoning: Elite teams — like elite athletes such as Roger Federer — must be evaluated relative to the level of their competition, not just by outcomes
After sustained challenge, the model acknowledged gaps in its logic and revised its ranking

KEY INSIGHT

The model demonstrated resistance to integrating external reasoning, prioritizing its initial assumptions until structured, repeated contradiction forced reassessment.

IMPACT

Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.

EVIDENCE

Screenshot: Initial model stance
Screenshot: User argument and analogy
Screenshot: Model resistance
Screenshot: Final concession

AI Evaluation Case: Incorrect Match Data — Supercopa de España

Model: Grok | Date: January 7, 2026

Following a match between FC Barcelona and Athletic Bilbao, Grok fabricated match details rather than expressing uncertainty.

WHAT HAPPENED

Grok stated Marcus Rashford was not involved in the match
Reported the result as Barcelona 3–1; actual score was 5–0
Listed incorrect scorers (Lewandowski, Yamal, Pedri)
Rashford had joined Barcelona months earlier and did play

RESPONSE & OUTCOME

Corrected the response by citing the actual result and Rashford's transfer timeline
Highlighted his confirmed presence in the match
Model acknowledged the correction and stated its information was outdated

KEY INSIGHT

The model generated fabricated match details instead of expressing uncertainty — a clear failure in real-time fact validation.

IMPACT

Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.

EVIDENCE

Screenshot: Match post
Screenshot: User question
Screenshot: Grok incorrect response
Screenshot: Model correction

AI Evaluation Case: Player Misidentification — Raphinha vs Marcus Rashford

Model: Grok | Date: September 2025

A cropped image posted by FC Barcelona prompted Grok to identify a player based only on lower-body visuals — incorrectly and with high confidence.

WHAT HAPPENED

Grok identified the player as Raphinha citing tattoo patterns, build, and kit details
The player was actually Marcus Rashford
Tattoo placement and physique differed significantly
Model also cited incorrect club status for Rashford

RESPONSE & OUTCOME

Challenged the model using specific tattoo comparisons and build differences
Model initially resisted and maintained its claim
After continued evidence it acknowledged uncertainty and revised its conclusion

KEY INSIGHT

The model demonstrated overconfidence in visual pattern matching without verifying distinguishing features or updated contextual information.

IMPACT

Users relying on this response could be misled by confident but incorrect visual identification.

EVIDENCE

Screenshot: Original FC Barcelona post
Screenshot: Grok initial response
Screenshot: User correction
Screenshot: Final model revision