GPT-3.5 vs GPT-4 Cost-Quality Tradeoff - LLM A/B Test

GPT-3.5 vs GPT-4 Cost-Quality Tradeoff

Running

2026-01-10

--

Total Users

--

Total Interactions

2

Variants

response_quality_score

Primary Metric

A/B Test Results

Metric	Control	Treatment	Lift	CI (95%)	P-value	Significant	Effect Size
`converted`	0.8727	0.9221	+5.66%	[0.0294, 0.0694]	0.0	Yes	0.164
`response_quality_score`	8.29	9.01	+8.62%	[0.41, 1.01]	0.05	Yes	0.3574
`latency_ms`	463.0	1183.0	+155.53%	[670.0, 770.0]	0.01	Yes	1.4399

Conversion Rate Over Time

Metric Comparison

Data Quality Checks

Failed
Failed
Failed

Recommendations

Treatment (GPT-4-Turbo) significantly outperforms control (GPT-3.5-Turbo)

Subgroup Analysis

User_Tier

Segment	Control Rate	Treatment Rate	Lift	Sample Size	Significant
`free`	0.8592	0.9122	+6.17%	52047	Yes
`enterprise`	0.9306	0.9667	+3.89%	3736	No
`pro`	0.8878	0.9324	+5.02%	32811	Yes

Use_Case

Segment	Control Rate	Treatment Rate	Lift	Sample Size	Significant
`creative_writing`	0.8632	0.9145	+5.94%	22358	Yes
`qa`	0.9019	0.9485	+5.17%	22019	Yes
`code_generation`	0.8586	0.9145	+6.52%	22041	Yes
`text_summarization`	0.8673	0.911	+5.03%	22176	Yes

Region_Code

Segment	Control Rate	Treatment Rate	Lift	Sample Size	Significant
`LATAM`	0.8753	0.9197	+5.07%	22366	Yes
`EU`	0.8728	0.9232	+5.78%	22689	Yes
`NA`	0.8763	0.9248	+5.53%	23206	Yes
`APAC`	0.8659	0.9202	+6.27%	20333	Yes