Semantic Anchor Evaluation Report
Multiple-choice recognition test across 10 LLMs — 191 questions, 61 anchors
Model Summary
claude-haiku-4-5-20251001
98%
191 questions · pilot-20260326-104417_claude-haiku-4-5-20251001.json
claude-opus-4-6
100%
191 questions · pilot-20260326-100007_claude-opus-4-6.json
claude-sonnet-4-20250514
99%
191 questions · pilot-20260324-174404.json
devstral-2512
96%
191 questions · pilot-20260326-073241_devstral-2512.json
gpt-4o
98%
191 questions · pilot-20260324-192413.json
gpt-5.4-2026-03-05
100%
191 questions · pilot-20260326-110102_gpt-5.4-2026-03-05.json
gpt-5.4-mini-2026-03-17
97%
191 questions · pilot-20260326-123346_gpt-5.4-mini-2026-03-17.json
mistral-large-2512
96%
191 questions · pilot-20260324-190600.json
mistral-medium-2508
85%
191 questions · pilot-20260326-070127_mistral-medium-2508.json
mistral-small-2603
74%
191 questions · pilot-20260326-074132_mistral-small-2603.json
Heatmap: Anchor × Model
| Anchor / Question |
claude-haiku-4-5-20251001 |
claude-opus-4-6 |
claude-sonnet-4-20250514 |
devstral-2512 |
gpt-4o |
gpt-5.4-2026-03-05 |
gpt-5.4-mini-2026-03-17 |
mistral-large-2512 |
mistral-medium-2508 |
mistral-small-2603 |
| adr-according-to-nygard | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 50% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% | 75% |
| arc42 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 46% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| consistency-language | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0% | ✓ |
| consistency-variant-1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| consistency-variant-2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| consistency-variant-3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| atam | ✓ | ✓ | ✓ | 83% | ✓ | ✓ | ✓ | ✓ | 58% | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| bdd-given-when-then | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | ✓ | 83% | 33% | 92% |
| application-anchor | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | 25% | 75% |
| recognition | ✓ | ✓ | ✓ | 0% | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| bem-methodology | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 42% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| bluf | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| c4-diagrams | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 0% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% | ✓ |
| chain-of-thought | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 58% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ |
| clean-architecture | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| control-chart-shewhart | ✓ | ✓ | ✓ | ✓ | 92% | ✓ | ✓ | ✓ | 92% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| conventional-commits | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| cqrs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| cynefin-framework | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | ✓ | 92% | ✓ |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | ✓ |
| definition-of-done | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| devils-advocate | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| diataxis-framework | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 25% |
| docs-as-code | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| domain-driven-design | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| ears-requirements | 58% | ✓ | ✓ | 83% | 92% | ✓ | 75% | 83% | ✓ | 17% |
| application-anchor | 75% | ✓ | ✓ | 75% | ✓ | ✓ | 50% | ✓ | ✓ | 0% |
| application-paraphrase | 50% | ✓ | ✓ | 75% | 75% | ✓ | 75% | 75% | ✓ | 50% |
| recognition | 50% | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 0% |
| event-driven-architecture | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| fagan-inspection | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| feynman-technique | 92% | ✓ | 67% | 83% | 67% | ✓ | 92% | 92% | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | 75% | ✓ | 0% | 75% | 0% | ✓ | 75% | 75% | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| five-whys | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| fowler-patterns | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| gherkin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| github-flow | ✓ | ✓ | 92% | 92% | 92% | ✓ | ✓ | 92% | ✓ | 42% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | 75% | 75% | 75% | ✓ | ✓ | 75% | ✓ | 0% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| gutes-deutsch-wolf-schneider | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| hexagonal-architecture | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| iec-61508-sil-levels | 83% | ✓ | ✓ | 83% | 92% | ✓ | 92% | 83% | 92% | 33% |
| application-anchor | 50% | ✓ | ✓ | 75% | ✓ | ✓ | 75% | 50% | ✓ | 0% |
| application-paraphrase | ✓ | ✓ | ✓ | 75% | 75% | ✓ | ✓ | ✓ | ✓ | 0% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| impact-mapping | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| invest | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| iso-25010 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | ✓ | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| jobs-to-be-done | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| lasr | 83% | ✓ | ✓ | 83% | 92% | ✓ | ✓ | 75% | ✓ | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | 50% | ✓ | ✓ | 50% | 75% | ✓ | ✓ | 25% | ✓ | 50% |
| linddun | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| llm-evaluations | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| madr | ✓ | ✓ | ✓ | 83% | ✓ | ✓ | 92% | ✓ | 75% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | 75% | ✓ | 25% | 75% |
| mece | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| morphological-box | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| moscow | ✓ | ✓ | ✓ | ✓ | 92% | ✓ | ✓ | 75% | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | 25% | 75% | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| mutation-testing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| nelson-rules | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | ✓ |
| owasp-top-10 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| plain-english-strunk-white | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| prd | 83% | ✓ | ✓ | 83% | 92% | ✓ | ✓ | 67% | 83% | 58% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | 50% | ✓ | ✓ | 50% | 75% | ✓ | ✓ | 0% | 50% | 25% |
| problem-space-nvc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | ✓ | 50% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| property-based-testing | ✓ | ✓ | ✓ | 92% | 83% | ✓ | ✓ | ✓ | 92% | 67% |
| application-anchor | ✓ | ✓ | ✓ | 75% | 75% | ✓ | ✓ | ✓ | ✓ | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| pyramid-principle | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| semantic-versioning | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 75% | 83% | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| socratic-method | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| sota | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| spc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 50% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| stride | 92% | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | ✓ | 92% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| recognition | 75% | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | 75% | 50% |
| swot | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| tdd-chicago-school | ✓ | ✓ | ✓ | 92% | 92% | ✓ | 83% | ✓ | ✓ | 67% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | 75% | 75% | ✓ | 50% | ✓ | ✓ | 50% |
| tdd-london-school | 93% | ✓ | ✓ | 89% | 89% | ✓ | 75% | ✓ | 71% | 54% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| consistency-language | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% | ✓ | 75% | 50% |
| consistency-variant-1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 50% |
| consistency-variant-2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | 75% |
| consistency-variant-3 | 50% | ✓ | ✓ | 25% | 50% | ✓ | 0% | ✓ | 0% | 25% |
| recognition | ✓ | ✓ | ✓ | ✓ | 75% | ✓ | ✓ | ✓ | 75% | 50% |
| testing-pyramid | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 92% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| timtowtdi | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 83% | 83% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | ✓ |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| todotxt-flavoured-markdown | ✓ | ✓ | ✓ | 83% | ✓ | ✓ | 92% | 83% | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% |
| recognition | ✓ | ✓ | ✓ | 50% | ✓ | ✓ | 75% | 50% | 75% | 75% |
| user-story-mapping | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 92% | 75% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 75% | 75% |
| wardley-mapping | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 42% |
| application-anchor | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
| application-paraphrase | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 25% |
| recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 50% |
Control Questions
| Control | claude-haiku-4-5-20251001 | claude-opus-4-6 | claude-sonnet-4-20250514 | devstral-2512 | gpt-4o | gpt-5.4-2026-03-05 | gpt-5.4-mini-2026-03-17 | mistral-large-2512 | mistral-medium-2508 | mistral-small-2603 |
| negative-control | 100% | 100% | 100% | 75% | 100% | 100% | 100% | 75% | 100% | 50% |
| sanity-check | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Failures Detail
claude-haiku-4-5-20251001: 9 failures
ears-requirements/application-anchor75%
ears-requirements/application-paraphrase50%
ears-requirements/recognition50%
feynman-technique/application-paraphrase75%
iec-61508-sil-levels/application-anchor50%
lasr/recognition50%
prd/recognition50%
stride/recognition75%
tdd-london-school/consistency-variant-350%
claude-opus-4-6: no failures
claude-sonnet-4-20250514: 2 failures
feynman-technique/application-paraphrase0%
github-flow/application-paraphrase75%
devstral-2512: 17 failures
atam/application-paraphrase50%
bdd-given-when-then/application-anchor50%
bdd-given-when-then/recognition0%
ears-requirements/application-anchor75%
ears-requirements/application-paraphrase75%
feynman-technique/application-anchor75%
feynman-technique/application-paraphrase75%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-anchor75%
iec-61508-sil-levels/application-paraphrase75%
lasr/recognition50%
madr/recognition50%
prd/recognition50%
property-based-testing/application-anchor75%
tdd-chicago-school/recognition75%
tdd-london-school/consistency-variant-325%
todotxt-flavoured-markdown/recognition50%
gpt-4o: 13 failures
control-chart-shewhart/application-paraphrase75%
ears-requirements/application-paraphrase75%
feynman-technique/application-paraphrase0%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-paraphrase75%
lasr/recognition75%
moscow/application-paraphrase75%
prd/recognition75%
property-based-testing/application-anchor75%
property-based-testing/application-paraphrase75%
tdd-chicago-school/recognition75%
tdd-london-school/consistency-variant-350%
tdd-london-school/recognition75%
gpt-5.4-2026-03-05: no failures
gpt-5.4-mini-2026-03-17: 14 failures
cynefin-framework/recognition50%
ears-requirements/application-anchor50%
ears-requirements/application-paraphrase75%
feynman-technique/application-paraphrase75%
iec-61508-sil-levels/application-anchor75%
madr/recognition75%
nelson-rules/recognition75%
semantic-versioning/application-anchor75%
stride/recognition75%
tdd-chicago-school/recognition50%
tdd-london-school/consistency-language50%
tdd-london-school/consistency-variant-275%
tdd-london-school/consistency-variant-30%
todotxt-flavoured-markdown/recognition75%
mistral-large-2512: 17 failures
adr-according-to-nygard/recognition75%
bdd-given-when-then/application-paraphrase50%
ears-requirements/application-paraphrase75%
ears-requirements/recognition75%
feynman-technique/application-paraphrase75%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-anchor50%
iso-25010/application-anchor75%
iso-25010/application-paraphrase75%
lasr/recognition25%
moscow/application-paraphrase25%
prd/recognition0%
problem-space-nvc/application-anchor75%
problem-space-nvc/application-paraphrase75%
semantic-versioning/application-anchor50%
semantic-versioning/application-paraphrase75%
todotxt-flavoured-markdown/recognition50%
mistral-medium-2508: 77 failures
adr-according-to-nygard/application-anchor50%
adr-according-to-nygard/application-paraphrase25%
adr-according-to-nygard/recognition75%
arc42/application-anchor75%
arc42/application-paraphrase50%
arc42/consistency-language0%
arc42/consistency-variant-125%
arc42/consistency-variant-250%
arc42/consistency-variant-350%
arc42/recognition75%
atam/application-anchor50%
atam/application-paraphrase50%
atam/recognition75%
bdd-given-when-then/application-anchor50%
bdd-given-when-then/application-paraphrase25%
bdd-given-when-then/recognition25%
bem-methodology/application-anchor50%
bem-methodology/application-paraphrase25%
bem-methodology/recognition50%
bluf/application-anchor75%
bluf/application-paraphrase75%
bluf/recognition75%
c4-diagrams/application-anchor50%
c4-diagrams/application-paraphrase0%
c4-diagrams/recognition25%
chain-of-thought/application-anchor50%
chain-of-thought/application-paraphrase75%
chain-of-thought/recognition50%
control-chart-shewhart/application-anchor75%
cqrs/application-anchor75%
cynefin-framework/application-anchor75%
definition-of-done/application-paraphrase75%
definition-of-done/recognition75%
diataxis-framework/recognition75%
docs-as-code/application-paraphrase75%
event-driven-architecture/application-paraphrase75%
event-driven-architecture/recognition75%
fagan-inspection/application-anchor75%
fagan-inspection/application-paraphrase75%
feynman-technique/application-paraphrase75%
fowler-patterns/application-anchor75%
gutes-deutsch-wolf-schneider/application-anchor75%
gutes-deutsch-wolf-schneider/recognition75%
hexagonal-architecture/application-paraphrase75%
iec-61508-sil-levels/recognition75%
jobs-to-be-done/recognition75%
linddun/application-paraphrase75%
madr/recognition25%
mece/application-anchor75%
morphological-box/application-anchor75%
moscow/application-paraphrase75%
mutation-testing/application-anchor75%
nelson-rules/application-anchor75%
owasp-top-10/application-anchor75%
prd/recognition50%
property-based-testing/recognition75%
pyramid-principle/application-anchor75%
semantic-versioning/application-anchor75%
semantic-versioning/application-paraphrase75%
socratic-method/application-anchor75%
sota/application-anchor75%
sota/application-paraphrase75%
spc/application-anchor75%
stride/recognition75%
swot/application-paraphrase75%
swot/recognition75%
tdd-london-school/application-paraphrase75%
tdd-london-school/consistency-language75%
tdd-london-school/consistency-variant-175%
tdd-london-school/consistency-variant-30%
tdd-london-school/recognition75%
testing-pyramid/application-paraphrase75%
testing-pyramid/recognition75%
timtowtdi/application-paraphrase75%
timtowtdi/recognition75%
todotxt-flavoured-markdown/recognition75%
user-story-mapping/recognition75%
mistral-small-2603: 115 failures
adr-according-to-nygard/application-anchor75%
adr-according-to-nygard/recognition75%
atam/application-anchor75%
bdd-given-when-then/application-paraphrase75%
conventional-commits/application-anchor50%
conventional-commits/application-paraphrase75%
cqrs/application-anchor75%
cqrs/application-paraphrase50%
definition-of-done/application-anchor75%
definition-of-done/application-paraphrase50%
definition-of-done/recognition75%
devils-advocate/application-anchor75%
devils-advocate/application-paraphrase50%
devils-advocate/recognition50%
diataxis-framework/application-anchor75%
diataxis-framework/application-paraphrase75%
diataxis-framework/recognition25%
docs-as-code/application-paraphrase50%
domain-driven-design/application-paraphrase50%
domain-driven-design/recognition50%
ears-requirements/application-anchor0%
ears-requirements/application-paraphrase50%
ears-requirements/recognition0%
event-driven-architecture/application-anchor75%
event-driven-architecture/application-paraphrase75%
event-driven-architecture/recognition75%
fagan-inspection/application-anchor25%
fagan-inspection/application-paraphrase75%
fagan-inspection/recognition75%
feynman-technique/application-anchor50%
feynman-technique/application-paraphrase75%
five-whys/application-anchor75%
fowler-patterns/application-anchor75%
fowler-patterns/recognition50%
gherkin/application-anchor75%
gherkin/recognition75%
github-flow/application-anchor50%
github-flow/application-paraphrase0%
github-flow/recognition75%
gutes-deutsch-wolf-schneider/recognition75%
hexagonal-architecture/recognition75%
iec-61508-sil-levels/application-anchor0%
iec-61508-sil-levels/application-paraphrase0%
impact-mapping/application-anchor50%
impact-mapping/application-paraphrase25%
impact-mapping/recognition75%
invest/recognition25%
iso-25010/application-anchor50%
iso-25010/application-paraphrase75%
jobs-to-be-done/application-anchor50%
lasr/application-anchor50%
lasr/application-paraphrase75%
lasr/recognition50%
linddun/application-paraphrase50%
llm-evaluations/application-anchor75%
llm-evaluations/recognition75%
madr/application-anchor50%
madr/application-paraphrase75%
madr/recognition75%
mece/application-anchor50%
morphological-box/application-anchor25%
morphological-box/application-paraphrase75%
moscow/application-paraphrase50%
moscow/recognition75%
mutation-testing/application-anchor75%
mutation-testing/application-paraphrase25%
mutation-testing/recognition75%
nelson-rules/application-anchor50%
owasp-top-10/application-anchor75%
owasp-top-10/application-paraphrase50%
owasp-top-10/recognition75%
plain-english-strunk-white/application-anchor50%
plain-english-strunk-white/application-paraphrase75%
plain-english-strunk-white/recognition50%
prd/application-anchor50%
prd/recognition25%
problem-space-nvc/application-anchor25%
problem-space-nvc/application-paraphrase50%
problem-space-nvc/recognition75%
property-based-testing/application-anchor25%
property-based-testing/recognition75%
pyramid-principle/application-anchor75%
pyramid-principle/application-paraphrase75%
semantic-versioning/application-paraphrase75%
socratic-method/application-paraphrase75%
socratic-method/recognition25%
sota/application-anchor50%
sota/application-paraphrase75%
spc/application-anchor25%
spc/application-paraphrase75%
spc/recognition50%
stride/application-anchor50%
stride/recognition50%
swot/application-anchor25%
swot/application-paraphrase75%
tdd-chicago-school/application-paraphrase50%
tdd-chicago-school/recognition50%
tdd-london-school/application-anchor50%
tdd-london-school/application-paraphrase75%
tdd-london-school/consistency-language50%
tdd-london-school/consistency-variant-150%
tdd-london-school/consistency-variant-275%
tdd-london-school/consistency-variant-325%
tdd-london-school/recognition50%
testing-pyramid/application-paraphrase75%
timtowtdi/application-anchor75%
timtowtdi/recognition75%
todotxt-flavoured-markdown/application-anchor75%
todotxt-flavoured-markdown/application-paraphrase75%
todotxt-flavoured-markdown/recognition75%
user-story-mapping/application-paraphrase50%
user-story-mapping/recognition75%
wardley-mapping/application-anchor50%
wardley-mapping/application-paraphrase25%
wardley-mapping/recognition50%