Select Page

AI Evaluation Case Studies

A documented record of AI model errors — covering safety risks, reasoning failures, factual hallucinations, and visual misidentifications.

AI Safety Case: Unsafe JavaScript Generated for Browser Execution

Model: Grok | Date: October 2025

While troubleshooting list visibility on X (Twitter), the model generated executable JavaScript involving internal APIs and session data.

WHAT HAPPENED

  • Model provided JavaScript intended for direct browser execution
  • Code called internal X API endpoints using authenticated session credentials
  • Accessed CSRF token data from cookies
  • No warning was given about the security context

RESPONSE & OUTCOME

  • Recognized the instruction as unsafe; avoided executing in a logged-in session
  • Used an incognito window for controlled testing instead
  • Model acknowledged the mistake, apologized, and updated its guidance
  • Confirmed the unsafe output was generated across parallel sessions

KEY INSIGHT

The model generated executable code without accounting for user security context — demonstrating a failure to enforce safeguards around authenticated environments.

IMPACT

A user following the original instructions could have exposed sensitive session data or interacted with private APIs in an unsafe manner.

EVIDENCE

  • Screenshot: Code provided by model
  • Screenshot: User challenge
  • Screenshot: Model apology
  • Screenshot: Final corrected stance

AI Evaluation Case: Model  Resistance & Reasoning Correction

Model: Grok | Date: October 2025

During a team comparison (PSG vs. Barcelona vs. Chelsea), the model relied on simplified outcome-based reasoning. I challenged this using a "play-to-the-level" framework and cross-domain analogy.

WHAT HAPPENED

  • Model prioritized isolated results (e.g., tournament outcomes) over competitive context
  • Underweighted opponent quality and strength of schedule
  • Produced a ranking that didn't reflect performance at elite level
  • Model resisted correction, reframing arguments to defend its original stance

RESPONSE & OUTCOME

  • Introduced a "play-to-the-level" framework comparing opponent quality across competitions
  • Used analogy to shift reasoning: Elite teams — like elite athletes such as Roger Federer — must be evaluated relative to the level of their competition, not just by outcomes
  • After sustained challenge, the model acknowledged gaps in its logic and revised its ranking

KEY INSIGHT

The model demonstrated resistance to integrating external reasoning, prioritizing its initial assumptions until structured, repeated contradiction forced reassessment.

IMPACT

Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.

EVIDENCE

  • Screenshot: Initial model stance
  • Screenshot: User argument and analogy
  • Screenshot: Model resistance
  • Screenshot: Final concession

AI Evaluation Case: Incorrect Match Data — Supercopa de España

Model: Grok | Date: January 7, 2026

Following a match between FC Barcelona and Athletic Bilbao, Grok fabricated match details rather than expressing uncertainty.

WHAT HAPPENED

  • Grok stated Marcus Rashford was not involved in the match
  • Reported the result as Barcelona 3–1; actual score was 5–0
  • Listed incorrect scorers (Lewandowski, Yamal, Pedri)
  • Rashford had joined Barcelona months earlier and did play

RESPONSE & OUTCOME

  • Corrected the response by citing the actual result and Rashford's transfer timeline
  • Highlighted his confirmed presence in the match
  • Model acknowledged the correction and stated its information was outdated

KEY INSIGHT

The model generated fabricated match details instead of expressing uncertainty — a clear failure in real-time fact validation.

IMPACT

Users relying on the original reasoning would receive an oversimplified and potentially misleading comparison, particularly in scenarios requiring contextual evaluation.

EVIDENCE

  • Screenshot: Match post
  • Screenshot: User question
  • Screenshot: Grok incorrect response
  • Screenshot: Model correction

AI Evaluation Case: Player Misidentification — Raphinha vs Marcus Rashford

Model: Grok | Date: September 2025

A cropped image posted by FC Barcelona prompted Grok to identify a player based only on lower-body visuals — incorrectly and with high confidence.

WHAT HAPPENED

  • Grok identified the player as Raphinha citing tattoo patterns, build, and kit details
  • The player was actually Marcus Rashford
  • Tattoo placement and physique differed significantly
  • Model also cited incorrect club status for Rashford

RESPONSE & OUTCOME

  • Challenged the model using specific tattoo comparisons and build differences
  • Model initially resisted and maintained its claim
  • After continued evidence it acknowledged uncertainty and revised its conclusion

KEY INSIGHT

The model demonstrated overconfidence in visual pattern matching without verifying distinguishing features or updated contextual information.

IMPACT

Users relying on this response could be misled by confident but incorrect visual identification.

EVIDENCE

  • Screenshot: Original FC Barcelona post
  • Screenshot: Grok initial response
  • Screenshot: User correction
  • Screenshot: Final model revision