ELO Ratings Questions
Manage episode 507088157 series 3610932
Key Argument
- Thesis: Using ELO for AI agent evaluation = measuring noise
- Problem: Wrong evaluators, wrong metrics, wrong assumptions
- Solution: Quantitative assessment frameworks
The Comparison (00:00-02:00)
Chess ELO
- FIDE arbiters: 120hr training
- Binary outcome: win/loss
- Test-retest: r=0.95
- Cohen's κ=0.92
AI Agent ELO
- Random users: Google engineer? CS student? 10-year-old?
- Undefined dimensions: accuracy? style? speed?
- Test-retest: r=0.31 (coin flip)
- Cohen's κ=0.42
Cognitive Bias Cascade (02:00-03:30)
- Anchoring: 34% rating variance in first 3 seconds
- Confirmation: 78% selective attention to preferred features
- Dunning-Kruger: d=1.24 effect size
- Result: Circular preferences (A>B>C>A)
The Quantitative Alternative (03:30-05:00)
Objective Metrics
- McCabe complexity ≤20
- Test coverage ≥80%
- Big O notation comparison
- Self-admitted technical debt
- Reliability: r=0.91 vs r=0.42
- Effect size: d=2.18
Dream Scenario vs Reality (05:00-06:00)
Dream
- World's best engineers
- Annotated metrics
- Standardized criteria
Reality
- Random internet users
- No expertise verification
- Subjective preferences
Key Statistics
Metric | Chess | AI Agents |
---|---|---|
Inter-rater reliability | κ=0.92 | κ=0.42 |
Test-retest | r=0.95 | r=0.31 |
Temporal drift | ±10 pts | ±150 pts |
Hurst exponent | 0.89 | 0.31 |
Takeaways
- Stop: Using preference votes as quality metrics
- Start: Automated complexity analysis
- ROI: 4.7 months to break even
Citations Mentioned
- Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
- Santos et al. (2022): Technical Debt Grading validation
- Regan & Haworth (2011): Chess arbiter reliability κ=0.92
- Chapman & Johnson (2002): 34% anchoring effect
Quotable Moments
"You can't rate chess with basketball fans"
"0.31 reliability? That's a coin flip with extra steps"
"Every preference vote is a data crime"
"The psychometrics are screaming"
Resources
- Technical Debt Grading (TDG) Framework
- PMAT (Pragmatic AI Labs MCP Agent Toolkit)
- McCabe Complexity Calculator
- Cohen's Kappa Calculator
🔥 Hot Course Offers:
- 🤖 Master GenAI Engineering - Build Production AI Systems
- 🦀 Learn Professional Rust - Industry-Grade Development
- 📊 AWS AI & Analytics - Scale Your ML in Cloud
- ⚡ Production GenAI on AWS - Deploy at Enterprise Scale
- 🛠️ Rust DevOps Mastery - Automate Everything
🚀 Level Up Your Career:
- 💼 Production ML Program - Complete MLOps & Cloud Mastery
- 🎯 Start Learning Now - Fast-Track Your ML Career
- 🏢 Trusted by Fortune 500 Teams
Learn end-to-end ML engineering from industry veterans at PAIML.COM
225 حلقات