Two-Sample Z-Test Calculator

Based on the standard normal distribution (μ = 0, σ = 1). Tests H₀: μ₁ = μ₂ for two independent samples with known σ.

Solution

—

Quick Answer

The two-sample z-test calculator computes the test statistic z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂) given each independent group's sample mean, known population standard deviation, and sample size. It returns the p-value for left-tailed, right-tailed, or two-tailed alternatives, the critical z at the chosen α, and a reject / fail-to-reject decision against H₀: μ₁ = μ₂. Use it when both σ's are known and you want to compare two independent group means against each other.

Your example: Comparing two production lines with x̄₁ = 105 (σ₁ = 15, n₁ = 36) vs. x̄₂ = 100 (σ₂ = 12, n₂ = 49), two-tailed at α = 0.05: standard error = √(225/36 + 144/49) ≈ 3.0313, so z = (105 − 100) / 3.0313 ≈ 1.65; the two-tailed p-value is 2 × (1 − Φ(1.65)) ≈ 0.0991, and since |1.65| < 1.96, fail to reject H₀.

Worked Examples

Two-Tailed

Bolt strength A vs. B: x̄₁ = 105 (n₁=36, σ₁=15) vs. x̄₂ = 100 (n₂=49, σ₂=12)

A factory tests whether two production lines produce bolts with different mean tensile strength at α = 0.05.

State H₀: μ₁ = μ₂; H₁: μ₁ ≠ μ₂ (two-tailed).
Standard error of the difference: √(15²/36 + 12²/49) ≈ 3.0313.
z = (105 − 100) / 3.0313 ≈ 1.6494.
Two-tailed p-value: 2 × (1 − Φ(1.6494)) ≈ 0.0991.
Critical values at α = 0.05: ±1.96.
Since |1.6494| < 1.96 and p > 0.05, fail to reject H₀.

A non-significant result means the data don't rule out equal means — but they also don't confirm it. Report the mean difference (5 N) and a confidence interval alongside the p-value to give context.

Right-Tailed

Treatment vs. control: x̄₁ = 78 (n₁=50, σ₁=10) vs. x̄₂ = 75 (n₂=50, σ₂=10)

A clinical study tests whether a new treatment raises a mean outcome above a control's at α = 0.05.

State H₀: μ₁ = μ₂; H₁: μ₁ > μ₂ (right-tailed).
Standard error: √(10²/50 + 10²/50) = √4 = 2.
z = (78 − 75) / 2 = 1.5.
Right-tailed p-value: 1 − Φ(1.5) ≈ 0.0668.
Critical value at α = 0.05: 1.6449.
Since 1.5 < 1.6449 and p > 0.05, fail to reject H₀.

Borderline result — at a more lenient α = 0.10 the test would reject. With equal n's and σ's the standard error simplifies to σ × √(2/n).

Left-Tailed

Battery brand A vs. B: x̄₁ = 9.2 (n₁=40, σ₁=1.5) vs. x̄₂ = 10 (n₂=40, σ₂=1.2)

A consumer-watchdog tests whether brand A's mean battery life is below brand B's at α = 0.05.

State H₀: μ₁ = μ₂; H₁: μ₁ < μ₂ (left-tailed).
Standard error: √(1.5²/40 + 1.2²/40) ≈ √(0.05625 + 0.036) ≈ 0.3037.
z = (9.2 − 10) / 0.3037 ≈ −2.6342.
Left-tailed p-value: Φ(−2.6342) ≈ 0.0042.
Critical value at α = 0.05: −1.6449.
Since −2.6342 < −1.6449 and p < 0.05, reject H₀.

Strong evidence that brand A's mean battery life is below brand B's. The result would still reject at the stricter α = 0.01 level (p ≈ 0.0042 < 0.01).

Two-Sample Z-Test Statistic

The z-statistic measures how many standard errors apart the two sample means are. The numerator is the observed difference in sample means; the denominator is the standard error of that difference, built from each group's known population variance and sample size. Larger samples shrink the standard error, letting smaller mean differences reach significance.

z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂)

How It Works

A two-sample z-test asks whether two independent groups' population means differ by more than chance would predict. You need six ingredients: each group's sample mean (x̄₁, x̄₂), its known population standard deviation (σ₁, σ₂), and its sample size (n₁, n₂). The calculator computes the standard error of the mean difference √(σ₁²/n₁ + σ₂²/n₂), divides the observed difference x̄₁ − x̄₂ by it to get the z-statistic, and converts that z to a p-value via the standard normal distribution. Compare the p-value to your significance level α to decide whether to reject H₀: μ₁ = μ₂. Use a left-tailed test when H₁ says μ₁ < μ₂, a right-tailed test when H₁ says μ₁ > μ₂, and a two-tailed test when the alternative is just μ₁ ≠ μ₂.

Example Problem

A factory tests two manufacturing lines. Line A produces 36 bolts with mean tensile strength x̄₁ = 105 N (known σ₁ = 15 N). Line B produces 49 bolts with mean tensile strength x̄₂ = 100 N (known σ₂ = 12 N). Test whether the lines produce bolts with different mean strength at α = 0.05 (two-tailed).

State H₀: μ₁ = μ₂ and H₁: μ₁ ≠ μ₂. The two-tailed alternative makes no directional claim.
Compute the standard error of the difference: √(15²/36 + 12²/49) = √(6.25 + 2.939) ≈ √9.189 ≈ 3.0313.
Compute the z-statistic: z = (105 − 100) / 3.0313 ≈ 1.6494.
Find the two-tailed p-value: p = 2 × (1 − Φ(|1.6494|)) ≈ 0.0991.
Find the critical values at α = 0.05: ±z_{0.975} = ±1.96.
Compare: |1.6494| < 1.96 and p ≈ 0.0991 > 0.05, so we fail to reject H₀.
Conclusion: at α = 0.05 the data do not provide enough evidence to conclude the two lines produce bolts with different mean tensile strength.

A non-significant result is not proof that the two lines are identical — it just says the difference observed (5 N) is within what sampling noise can produce when σ₁ = 15, σ₂ = 12, and the samples are this size. A larger study or a tighter σ would have more power to detect the same effect.

Key Concepts

The two-sample z-test rests on three quantities. First is the observed mean difference (x̄₁ − x̄₂), the raw effect. Second is the standard error of the difference √(σ₁²/n₁ + σ₂²/n₂) — this is the typical sampling fluctuation in that difference and shrinks as either sample grows or either σ shrinks. Third is the significance level α you choose in advance. Independence matters: the two samples must be drawn from separate populations, not paired or matched. When samples are paired (before/after on the same subjects), use a paired test on the differences instead. The test also assumes both σ's are known. When σ's are estimated from the samples and n is small, switch to Welch's t-test. With large samples, the z-test and Welch's t-test give nearly identical answers because the t-distribution converges to the normal.

Applications

Manufacturing — comparing mean output, dimensions, or strength between two production lines, machines, or shifts
A/B testing with known variance — comparing the mean of a metric between control and treatment when historical σ is reliable
Clinical research — comparing a new treatment group's mean outcome to a control group's, using historical σ from prior trials
Educational comparison — comparing mean test scores between two schools, programs, or curriculum versions when σ is established
Marketing — comparing mean purchase amounts, session times, or response rates between two customer segments
Quality assurance — comparing mean defect rates or measurement values between two suppliers or vendors

Common Mistakes

Using a two-sample z-test when σ₁ or σ₂ is unknown and the samples are small — switch to Welch's t-test
Treating paired data (before/after on the same subjects) as independent — use a paired test on the differences instead
Choosing the tail direction after seeing the data — pre-specify it from the alternative hypothesis
Ignoring sample independence — the two groups must come from separate populations, not overlapping or matched units
Using pooled σ when group variances are clearly different — the formula √(σ₁²/n₁ + σ₂²/n₂) does not assume equal variances; pooling is only appropriate when σ₁ = σ₂
Reporting only the p-value without the mean difference, standard error, or confidence interval for the difference

Frequently Asked Questions

What is a two-sample z-test?

A two-sample z-test compares the means of two independent groups when both population standard deviations are known. It computes z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂) and reports a p-value indicating how surprising the observed difference would be if the two population means were truly equal.

When should I use a z-test instead of a t-test?

Use a two-sample z-test when both population standard deviations σ₁ and σ₂ are known, or when both sample sizes are large enough (commonly n > 30 each) that the sample standard deviations are reliable proxies. Use Welch's t-test when σ's are unknown and samples are small. With large samples both methods converge.

Does this test assume the two groups have equal variances?

No. The standard error √(σ₁²/n₁ + σ₂²/n₂) lets each group contribute its own variance — it does not assume σ₁ = σ₂. This matches the structure of Welch's t-test and is appropriate even when the variances clearly differ. If σ₁ = σ₂ the calculator still works correctly.

When should I use a paired test instead?

Use a paired test when each observation in group 1 is naturally matched to one in group 2 — before/after measurements on the same subjects, twin pairs, or matched-pair experimental designs. The two-sample z-test assumes the groups are independent, so applying it to paired data wastes power and can be misleading.

What does the p-value tell me?

The p-value is the probability of observing a mean difference at least as extreme as yours, assuming the two population means are equal. Smaller p-values indicate stronger evidence against H₀: μ₁ = μ₂. A p-value below your chosen α leads to rejecting H₀, but the p-value does not measure the size or practical importance of the difference.

How do I choose the tail direction?

Choose left-tailed when your alternative hypothesis is μ₁ < μ₂, right-tailed when it is μ₁ > μ₂, and two-tailed when it is μ₁ ≠ μ₂. Decide before you look at the data — picking the tail post hoc inflates the false-positive rate.

Can I run the test if my groups have very different sample sizes?

Yes. The standard error formula √(σ₁²/n₁ + σ₂²/n₂) handles unequal sample sizes naturally. The smaller group dominates the standard error, so a 20-vs-200 design has roughly the precision of a balanced 20-vs-20 design — adding more to the larger group brings diminishing returns.

What does it mean to swap the order of the two groups?

Swapping group 1 and group 2 flips the sign of z and the sign of the mean difference, but the p-value is unchanged for a two-tailed test. For one-tailed tests, swapping the groups also reverses the tail direction (left becomes right). The conclusion about whether the means differ is the same either way.

Reference: The two-sample z-test computes z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂) and converts to a p-value using the standard normal cumulative distribution function via the Abramowitz and Stegun rational approximation. Critical values are produced from the inverse normal CDF (Acklam's rational approximation) at the chosen significance level α. The formula assumes the two samples are independent and both population standard deviations are known.

Related Calculators

One-Sample Z-Test — Compare a single sample mean to a hypothesized μ₀.
P-Value Calculator — Convert any z-score to a p-value directly.
Z-Score Calculator — Tail and middle-area probabilities for any z.
Inverse Z-Score — Critical z from probability or alpha level.
Z-Table — Standard normal lookup table.

Related Sites

Medical Equations — Clinical and medical calculators.
RN Calc — Nursing dosage calculators.
Statistics Calculator — Mean, median, standard deviation, and more.
Percent Error Calculator — Measurement accuracy as a percentage.
OptionsMath — Options trading P&L calculators.
InfantChart — Growth percentile charts.