Gemini-1.5 vs GPT-4 Summarization

Running

2026-01-15

--
Total Users
--
Total Interactions
2
Variants
converted
Primary Metric
A/B Test Results
Metric Control Treatment Lift CI (95%) P-value Significant Effect Size
converted 0.9178 0.916 -0.2% [-0.0218, 0.0182] 0.3405 No -0.0065
response_quality_score 8.98 8.86 -1.3% [-0.42, 0.18] 0.15 No -0.0584
latency_ms 1128.0 948.0 -15.96% [-230.0, -130.0] 0.01 Yes -0.36
Conversion Rate Over Time
Metric Comparison
Data Quality Checks
  • Failed
  • Failed
  • Failed
Recommendations
  • No statistically significant difference detected. Consider extending experiment.
Subgroup Analysis
User_Tier
Segment Control Rate Treatment Rate Lift Sample Size Significant
free 0.9072 0.9053 -0.2% 50079 No
enterprise 0.9634 0.954 -0.98% 3645 No
pro 0.9281 0.9299 +0.2% 31898 No
Use_Case
Segment Control Rate Treatment Rate Lift Sample Size Significant
creative_writing 0.9082 0.9109 +0.29% 21315 No
qa 0.9462 0.9457 -0.06% 21619 No
code_generation 0.9081 0.9002 -0.87% 21322 No
text_summarization 0.9085 0.9069 -0.18% 21366 No
Region_Code
Segment Control Rate Treatment Rate Lift Sample Size Significant
LATAM 0.9147 0.9167 +0.22% 21497 No
EU 0.9155 0.9181 +0.29% 21687 No
NA 0.9197 0.9148 -0.53% 21269 No
APAC 0.9212 0.9143 -0.75% 21169 No