GPT-4 vs Claude-3 Code Generation

Completed

2025-12-01 - 2026-01-25

--
Total Users
--
Total Interactions
2
Variants
response_quality_score
Primary Metric
A/B Test Results
Metric Control Treatment Lift CI (95%) P-value Significant Effect Size
converted 0.9185 0.9403 +2.37% [0.0018, 0.0418] 0.0 Yes 0.0852
response_quality_score 8.99 9.17 +1.98% [-0.12, 0.48] 0.15 No 0.0888
latency_ms 1139.0 1346.0 +18.2% [157.0, 257.0] 0.01 Yes 0.4147
Conversion Rate Over Time
Metric Comparison
Data Quality Checks
  • Failed
  • Failed
  • Failed
Recommendations
  • Treatment (Claude-3-Opus) significantly outperforms control (GPT-4-Turbo)
Subgroup Analysis
User_Tier
Segment Control Rate Treatment Rate Lift Sample Size Significant
free 0.9096 0.9307 +2.32% 51522 No
enterprise 0.9646 0.9701 +0.57% 3041 No
pro 0.9284 0.9516 +2.5% 34502 No
Use_Case
Segment Control Rate Treatment Rate Lift Sample Size Significant
creative_writing 0.9103 0.9345 +2.66% 22285 No
qa 0.9469 0.9633 +1.73% 22184 No
code_generation 0.9091 0.9348 +2.82% 22255 No
text_summarization 0.9078 0.9288 +2.31% 22341 No
Region_Code
Segment Control Rate Treatment Rate Lift Sample Size Significant
LATAM 0.9181 0.938 +2.16% 24118 No
EU 0.918 0.9434 +2.77% 23365 No
NA 0.9213 0.9407 +2.11% 20183 No
APAC 0.9167 0.939 +2.43% 21399 No