GPT-3.5 vs GPT-4 Cost-Quality Tradeoff

Running

2026-01-10

--
Total Users
--
Total Interactions
2
Variants
response_quality_score
Primary Metric
A/B Test Results
Metric Control Treatment Lift CI (95%) P-value Significant Effect Size
converted 0.8727 0.9221 +5.66% [0.0294, 0.0694] 0.0 Yes 0.164
response_quality_score 8.29 9.01 +8.62% [0.41, 1.01] 0.05 Yes 0.3574
latency_ms 463.0 1183.0 +155.53% [670.0, 770.0] 0.01 Yes 1.4399
Conversion Rate Over Time
Metric Comparison
Data Quality Checks
  • Failed
  • Failed
  • Failed
Recommendations
  • Treatment (GPT-4-Turbo) significantly outperforms control (GPT-3.5-Turbo)
Subgroup Analysis
User_Tier
Segment Control Rate Treatment Rate Lift Sample Size Significant
free 0.8592 0.9122 +6.17% 52047 Yes
enterprise 0.9306 0.9667 +3.89% 3736 No
pro 0.8878 0.9324 +5.02% 32811 Yes
Use_Case
Segment Control Rate Treatment Rate Lift Sample Size Significant
creative_writing 0.8632 0.9145 +5.94% 22358 Yes
qa 0.9019 0.9485 +5.17% 22019 Yes
code_generation 0.8586 0.9145 +6.52% 22041 Yes
text_summarization 0.8673 0.911 +5.03% 22176 Yes
Region_Code
Segment Control Rate Treatment Rate Lift Sample Size Significant
LATAM 0.8753 0.9197 +5.07% 22366 Yes
EU 0.8728 0.9232 +5.78% 22689 Yes
NA 0.8763 0.9248 +5.53% 23206 Yes
APAC 0.8659 0.9202 +6.27% 20333 Yes