Semantic Anchor Evaluation Report

Multiple-choice recognition test across 10 LLMs — 191 questions, 61 anchors

≥80%
50–79%
<50%

Model Summary

claude-haiku-4-5-20251001
98%
191 questions · pilot-20260326-104417_claude-haiku-4-5-20251001.json
claude-opus-4-6
100%
191 questions · pilot-20260326-100007_claude-opus-4-6.json
claude-sonnet-4-20250514
99%
191 questions · pilot-20260324-174404.json
devstral-2512
96%
191 questions · pilot-20260326-073241_devstral-2512.json
gpt-4o
98%
191 questions · pilot-20260324-192413.json
gpt-5.4-2026-03-05
100%
191 questions · pilot-20260326-110102_gpt-5.4-2026-03-05.json
gpt-5.4-mini-2026-03-17
97%
191 questions · pilot-20260326-123346_gpt-5.4-mini-2026-03-17.json
mistral-large-2512
96%
191 questions · pilot-20260324-190600.json
mistral-medium-2508
85%
191 questions · pilot-20260326-070127_mistral-medium-2508.json
mistral-small-2603
74%
191 questions · pilot-20260326-074132_mistral-small-2603.json

Heatmap: Anchor × Model

Anchor / Question claude-haiku-4-5-20251001 claude-opus-4-6 claude-sonnet-4-20250514 devstral-2512 gpt-4o gpt-5.4-2026-03-05 gpt-5.4-mini-2026-03-17 mistral-large-2512 mistral-medium-2508 mistral-small-2603
adr-according-to-nygard92%50%83%
application-anchor50%75%
application-paraphrase25%
recognition75%75%75%
arc4246%
application-anchor75%
application-paraphrase50%
consistency-language0%
consistency-variant-125%
consistency-variant-250%
consistency-variant-350%
recognition75%
atam83%58%92%
application-anchor50%75%
application-paraphrase50%50%
recognition75%
bdd-given-when-then50%83%33%92%
application-anchor50%50%
application-paraphrase50%25%75%
recognition0%25%
bem-methodology42%
application-anchor50%
application-paraphrase25%
recognition50%
bluf75%
application-anchor75%
application-paraphrase75%
recognition75%
c4-diagrams25%
application-anchor50%
application-paraphrase0%
recognition25%
chain-of-thought58%
application-anchor50%
application-paraphrase75%
recognition50%
clean-architecture
application-anchor
application-paraphrase
recognition
control-chart-shewhart92%92%
application-anchor75%
application-paraphrase75%
recognition
conventional-commits75%
application-anchor50%
application-paraphrase75%
recognition
cqrs92%75%
application-anchor75%75%
application-paraphrase50%
recognition
cynefin-framework83%92%
application-anchor75%
application-paraphrase
recognition50%
definition-of-done83%67%
application-anchor75%
application-paraphrase75%50%
recognition75%75%
devils-advocate58%
application-anchor75%
application-paraphrase50%
recognition50%
diataxis-framework92%58%
application-anchor75%
application-paraphrase75%
recognition75%25%
docs-as-code92%83%
application-anchor
application-paraphrase75%50%
recognition
domain-driven-design67%
application-anchor
application-paraphrase50%
recognition50%
ears-requirements58%83%92%75%83%17%
application-anchor75%75%50%0%
application-paraphrase50%75%75%75%75%50%
recognition50%75%0%
event-driven-architecture83%75%
application-anchor75%
application-paraphrase75%75%
recognition75%75%
fagan-inspection83%58%
application-anchor75%25%
application-paraphrase75%75%
recognition75%
feynman-technique92%67%83%67%92%92%92%75%
application-anchor75%50%
application-paraphrase75%0%75%0%75%75%75%75%
recognition
five-whys92%
application-anchor75%
application-paraphrase
recognition
fowler-patterns92%75%
application-anchor75%75%
application-paraphrase
recognition50%
gherkin83%
application-anchor75%
application-paraphrase
recognition75%
github-flow92%92%92%92%42%
application-anchor50%
application-paraphrase75%75%75%75%0%
recognition75%
gutes-deutsch-wolf-schneider83%92%
application-anchor75%
application-paraphrase
recognition75%75%
hexagonal-architecture92%92%
application-anchor
application-paraphrase75%
recognition75%
iec-61508-sil-levels83%83%92%92%83%92%33%
application-anchor50%75%75%50%0%
application-paraphrase75%75%0%
recognition75%
impact-mapping50%
application-anchor50%
application-paraphrase25%
recognition75%
invest75%
application-anchor
application-paraphrase
recognition25%
iso-2501083%75%
application-anchor75%50%
application-paraphrase75%75%
recognition
jobs-to-be-done92%83%
application-anchor50%
application-paraphrase
recognition75%
lasr83%83%92%75%58%
application-anchor50%
application-paraphrase75%
recognition50%50%75%25%50%
linddun92%83%
application-anchor
application-paraphrase75%50%
recognition
llm-evaluations83%
application-anchor75%
application-paraphrase
recognition75%
madr83%92%75%67%
application-anchor50%
application-paraphrase75%
recognition50%75%25%75%
mece92%83%
application-anchor75%50%
application-paraphrase
recognition
morphological-box92%67%
application-anchor75%25%
application-paraphrase75%
recognition
moscow92%75%92%75%
application-anchor
application-paraphrase75%25%75%50%
recognition75%
mutation-testing92%58%
application-anchor75%75%
application-paraphrase25%
recognition75%
nelson-rules92%92%83%
application-anchor75%50%
application-paraphrase
recognition75%
owasp-top-1092%67%
application-anchor75%75%
application-paraphrase50%
recognition75%
plain-english-strunk-white58%
application-anchor50%
application-paraphrase75%
recognition50%
prd83%83%92%67%83%58%
application-anchor50%
application-paraphrase
recognition50%50%75%0%50%25%
problem-space-nvc83%50%
application-anchor75%25%
application-paraphrase75%50%
recognition75%
property-based-testing92%83%92%67%
application-anchor75%75%25%
application-paraphrase75%
recognition75%75%
pyramid-principle92%83%
application-anchor75%75%
application-paraphrase75%
recognition
semantic-versioning92%75%83%92%
application-anchor75%50%75%
application-paraphrase75%75%75%
recognition
socratic-method92%67%
application-anchor75%
application-paraphrase75%
recognition25%
sota83%75%
application-anchor75%50%
application-paraphrase75%75%
recognition
spc92%50%
application-anchor75%25%
application-paraphrase75%
recognition50%
stride92%92%92%67%
application-anchor50%
application-paraphrase
recognition75%75%75%50%
swot83%67%
application-anchor25%
application-paraphrase75%75%
recognition75%
tdd-chicago-school92%92%83%67%
application-anchor
application-paraphrase50%
recognition75%75%50%50%
tdd-london-school93%89%89%75%71%54%
application-anchor50%
application-paraphrase75%75%
consistency-language50%75%50%
consistency-variant-175%50%
consistency-variant-275%75%
consistency-variant-350%25%50%0%0%25%
recognition75%75%50%
testing-pyramid83%92%
application-anchor
application-paraphrase75%75%
recognition75%
timtowtdi83%83%
application-anchor75%
application-paraphrase75%
recognition75%75%
todotxt-flavoured-markdown83%92%83%92%75%
application-anchor75%
application-paraphrase75%
recognition50%75%50%75%75%
user-story-mapping92%75%
application-anchor
application-paraphrase50%
recognition75%75%
wardley-mapping42%
application-anchor50%
application-paraphrase25%
recognition50%

Control Questions

Controlclaude-haiku-4-5-20251001claude-opus-4-6claude-sonnet-4-20250514devstral-2512gpt-4ogpt-5.4-2026-03-05gpt-5.4-mini-2026-03-17mistral-large-2512mistral-medium-2508mistral-small-2603
negative-control100%100%100%75%100%100%100%75%100%50%
sanity-check0%0%0%0%0%0%0%0%0%0%

Failures Detail

claude-haiku-4-5-20251001: 9 failures

ears-requirements/application-anchor75%
ears-requirements/application-paraphrase50%
ears-requirements/recognition50%
feynman-technique/application-paraphrase75%
iec-61508-sil-levels/application-anchor50%
lasr/recognition50%
prd/recognition50%
stride/recognition75%
tdd-london-school/consistency-variant-350%

claude-opus-4-6: no failures

claude-sonnet-4-20250514: 2 failures

feynman-technique/application-paraphrase0%
github-flow/application-paraphrase75%

devstral-2512: 17 failures

atam/application-paraphrase50%
bdd-given-when-then/application-anchor50%
bdd-given-when-then/recognition0%
ears-requirements/application-anchor75%
ears-requirements/application-paraphrase75%
feynman-technique/application-anchor75%
feynman-technique/application-paraphrase75%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-anchor75%
iec-61508-sil-levels/application-paraphrase75%
lasr/recognition50%
madr/recognition50%
prd/recognition50%
property-based-testing/application-anchor75%
tdd-chicago-school/recognition75%
tdd-london-school/consistency-variant-325%
todotxt-flavoured-markdown/recognition50%

gpt-4o: 13 failures

control-chart-shewhart/application-paraphrase75%
ears-requirements/application-paraphrase75%
feynman-technique/application-paraphrase0%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-paraphrase75%
lasr/recognition75%
moscow/application-paraphrase75%
prd/recognition75%
property-based-testing/application-anchor75%
property-based-testing/application-paraphrase75%
tdd-chicago-school/recognition75%
tdd-london-school/consistency-variant-350%
tdd-london-school/recognition75%

gpt-5.4-2026-03-05: no failures

gpt-5.4-mini-2026-03-17: 14 failures

cynefin-framework/recognition50%
ears-requirements/application-anchor50%
ears-requirements/application-paraphrase75%
feynman-technique/application-paraphrase75%
iec-61508-sil-levels/application-anchor75%
madr/recognition75%
nelson-rules/recognition75%
semantic-versioning/application-anchor75%
stride/recognition75%
tdd-chicago-school/recognition50%
tdd-london-school/consistency-language50%
tdd-london-school/consistency-variant-275%
tdd-london-school/consistency-variant-30%
todotxt-flavoured-markdown/recognition75%

mistral-large-2512: 17 failures

adr-according-to-nygard/recognition75%
bdd-given-when-then/application-paraphrase50%
ears-requirements/application-paraphrase75%
ears-requirements/recognition75%
feynman-technique/application-paraphrase75%
github-flow/application-paraphrase75%
iec-61508-sil-levels/application-anchor50%
iso-25010/application-anchor75%
iso-25010/application-paraphrase75%
lasr/recognition25%
moscow/application-paraphrase25%
prd/recognition0%
problem-space-nvc/application-anchor75%
problem-space-nvc/application-paraphrase75%
semantic-versioning/application-anchor50%
semantic-versioning/application-paraphrase75%
todotxt-flavoured-markdown/recognition50%

mistral-medium-2508: 77 failures

adr-according-to-nygard/application-anchor50%
adr-according-to-nygard/application-paraphrase25%
adr-according-to-nygard/recognition75%
arc42/application-anchor75%
arc42/application-paraphrase50%
arc42/consistency-language0%
arc42/consistency-variant-125%
arc42/consistency-variant-250%
arc42/consistency-variant-350%
arc42/recognition75%
atam/application-anchor50%
atam/application-paraphrase50%
atam/recognition75%
bdd-given-when-then/application-anchor50%
bdd-given-when-then/application-paraphrase25%
bdd-given-when-then/recognition25%
bem-methodology/application-anchor50%
bem-methodology/application-paraphrase25%
bem-methodology/recognition50%
bluf/application-anchor75%
bluf/application-paraphrase75%
bluf/recognition75%
c4-diagrams/application-anchor50%
c4-diagrams/application-paraphrase0%
c4-diagrams/recognition25%
chain-of-thought/application-anchor50%
chain-of-thought/application-paraphrase75%
chain-of-thought/recognition50%
control-chart-shewhart/application-anchor75%
cqrs/application-anchor75%
cynefin-framework/application-anchor75%
definition-of-done/application-paraphrase75%
definition-of-done/recognition75%
diataxis-framework/recognition75%
docs-as-code/application-paraphrase75%
event-driven-architecture/application-paraphrase75%
event-driven-architecture/recognition75%
fagan-inspection/application-anchor75%
fagan-inspection/application-paraphrase75%
feynman-technique/application-paraphrase75%
fowler-patterns/application-anchor75%
gutes-deutsch-wolf-schneider/application-anchor75%
gutes-deutsch-wolf-schneider/recognition75%
hexagonal-architecture/application-paraphrase75%
iec-61508-sil-levels/recognition75%
jobs-to-be-done/recognition75%
linddun/application-paraphrase75%
madr/recognition25%
mece/application-anchor75%
morphological-box/application-anchor75%
moscow/application-paraphrase75%
mutation-testing/application-anchor75%
nelson-rules/application-anchor75%
owasp-top-10/application-anchor75%
prd/recognition50%
property-based-testing/recognition75%
pyramid-principle/application-anchor75%
semantic-versioning/application-anchor75%
semantic-versioning/application-paraphrase75%
socratic-method/application-anchor75%
sota/application-anchor75%
sota/application-paraphrase75%
spc/application-anchor75%
stride/recognition75%
swot/application-paraphrase75%
swot/recognition75%
tdd-london-school/application-paraphrase75%
tdd-london-school/consistency-language75%
tdd-london-school/consistency-variant-175%
tdd-london-school/consistency-variant-30%
tdd-london-school/recognition75%
testing-pyramid/application-paraphrase75%
testing-pyramid/recognition75%
timtowtdi/application-paraphrase75%
timtowtdi/recognition75%
todotxt-flavoured-markdown/recognition75%
user-story-mapping/recognition75%

mistral-small-2603: 115 failures

adr-according-to-nygard/application-anchor75%
adr-according-to-nygard/recognition75%
atam/application-anchor75%
bdd-given-when-then/application-paraphrase75%
conventional-commits/application-anchor50%
conventional-commits/application-paraphrase75%
cqrs/application-anchor75%
cqrs/application-paraphrase50%
definition-of-done/application-anchor75%
definition-of-done/application-paraphrase50%
definition-of-done/recognition75%
devils-advocate/application-anchor75%
devils-advocate/application-paraphrase50%
devils-advocate/recognition50%
diataxis-framework/application-anchor75%
diataxis-framework/application-paraphrase75%
diataxis-framework/recognition25%
docs-as-code/application-paraphrase50%
domain-driven-design/application-paraphrase50%
domain-driven-design/recognition50%
ears-requirements/application-anchor0%
ears-requirements/application-paraphrase50%
ears-requirements/recognition0%
event-driven-architecture/application-anchor75%
event-driven-architecture/application-paraphrase75%
event-driven-architecture/recognition75%
fagan-inspection/application-anchor25%
fagan-inspection/application-paraphrase75%
fagan-inspection/recognition75%
feynman-technique/application-anchor50%
feynman-technique/application-paraphrase75%
five-whys/application-anchor75%
fowler-patterns/application-anchor75%
fowler-patterns/recognition50%
gherkin/application-anchor75%
gherkin/recognition75%
github-flow/application-anchor50%
github-flow/application-paraphrase0%
github-flow/recognition75%
gutes-deutsch-wolf-schneider/recognition75%
hexagonal-architecture/recognition75%
iec-61508-sil-levels/application-anchor0%
iec-61508-sil-levels/application-paraphrase0%
impact-mapping/application-anchor50%
impact-mapping/application-paraphrase25%
impact-mapping/recognition75%
invest/recognition25%
iso-25010/application-anchor50%
iso-25010/application-paraphrase75%
jobs-to-be-done/application-anchor50%
lasr/application-anchor50%
lasr/application-paraphrase75%
lasr/recognition50%
linddun/application-paraphrase50%
llm-evaluations/application-anchor75%
llm-evaluations/recognition75%
madr/application-anchor50%
madr/application-paraphrase75%
madr/recognition75%
mece/application-anchor50%
morphological-box/application-anchor25%
morphological-box/application-paraphrase75%
moscow/application-paraphrase50%
moscow/recognition75%
mutation-testing/application-anchor75%
mutation-testing/application-paraphrase25%
mutation-testing/recognition75%
nelson-rules/application-anchor50%
owasp-top-10/application-anchor75%
owasp-top-10/application-paraphrase50%
owasp-top-10/recognition75%
plain-english-strunk-white/application-anchor50%
plain-english-strunk-white/application-paraphrase75%
plain-english-strunk-white/recognition50%
prd/application-anchor50%
prd/recognition25%
problem-space-nvc/application-anchor25%
problem-space-nvc/application-paraphrase50%
problem-space-nvc/recognition75%
property-based-testing/application-anchor25%
property-based-testing/recognition75%
pyramid-principle/application-anchor75%
pyramid-principle/application-paraphrase75%
semantic-versioning/application-paraphrase75%
socratic-method/application-paraphrase75%
socratic-method/recognition25%
sota/application-anchor50%
sota/application-paraphrase75%
spc/application-anchor25%
spc/application-paraphrase75%
spc/recognition50%
stride/application-anchor50%
stride/recognition50%
swot/application-anchor25%
swot/application-paraphrase75%
tdd-chicago-school/application-paraphrase50%
tdd-chicago-school/recognition50%
tdd-london-school/application-anchor50%
tdd-london-school/application-paraphrase75%
tdd-london-school/consistency-language50%
tdd-london-school/consistency-variant-150%
tdd-london-school/consistency-variant-275%
tdd-london-school/consistency-variant-325%
tdd-london-school/recognition50%
testing-pyramid/application-paraphrase75%
timtowtdi/application-anchor75%
timtowtdi/recognition75%
todotxt-flavoured-markdown/application-anchor75%
todotxt-flavoured-markdown/application-paraphrase75%
todotxt-flavoured-markdown/recognition75%
user-story-mapping/application-paraphrase50%
user-story-mapping/recognition75%
wardley-mapping/application-anchor50%
wardley-mapping/application-paraphrase25%
wardley-mapping/recognition50%

Run Metadata

claude-haiku-4-5-20251001:
pilot-20260326-104417_claude-haiku-4-5-20251001.json · 16m 44s · 2026-03-26T10:44:17

claude-opus-4-6:
pilot-20260326-100007_claude-opus-4-6.json · 44m 9s · 2026-03-26T10:00:07

claude-sonnet-4-20250514:
pilot-20260324-174404.json · 81m 2s · 2026-03-24T17:44:04

devstral-2512:
pilot-20260326-073241_devstral-2512.json · 15m 19s · 2026-03-26T07:32:41

gpt-4o:
pilot-20260324-192413.json · 15m 38s · 2026-03-24T19:24:13

gpt-5.4-2026-03-05:
pilot-20260326-110102_gpt-5.4-2026-03-05.json · 92m 42s · 2026-03-26T11:01:02

gpt-5.4-mini-2026-03-17:
pilot-20260326-123346_gpt-5.4-mini-2026-03-17.json · 16m 20s · 2026-03-26T12:33:46

mistral-large-2512:
pilot-20260324-190600.json · 16m 58s · 2026-03-24T19:06:00

mistral-medium-2508:
pilot-20260326-070127_mistral-medium-2508.json · 31m 12s · 2026-03-26T07:01:27

mistral-small-2603:
pilot-20260326-074132_mistral-small-2603.json · 18m 9s · 2026-03-26T07:41:32

Generated by evaluations/generate-report.py · Position bias mitigation: 4 permutations per question · Scoring: deterministic MC (no LLM judge)