Equivalence test

Equivalence tests are a variation of hypothesis tests used to draw statistical inferences from observed data. In equivalence tests, the null hypothesis is defined as an effect large enough to be deemed interesting, specified by an equivalence bound. The alternative hypothesis is any effect that is less extreme than said equivalence bound. The observed data is statistically compared against the equivalence bounds. If the statistical test indicates the observed data is surprising, assuming that true effects at least as extreme as the equivalence bounds, a Neyman-Pearson approach to statistical inferences can be used to reject effect sizes larger than the equivalence bounds with a pre-specified Type 1 error rate.

Equivalence testing originates from the field of pharmacokinetics.[1] One application is to show that a new drug that is cheaper than available alternatives works just as well as an existing drug. In essence, equivalence tests consist of calculating a confidence interval around an observed effect size, and rejecting effects more extreme than the equivalence bound when the confidence interval does not overlap with the equivalence bound. In two-sided tests an upper and lower equivalence bound is specified. In non-inferiority trials, where the goal is to test the hypothesis that a new treatment is not worse than existing treatments, only a lower equivalence bound is pre-specified.

Mean differences (black squares) and 90% confidence intervals (horizontal lines) with equivalence bounds ΔL = -0.5 and ΔU= 0.5 for four combinations of test results that are statistically equivalent or not and statistically different from zero or not. Pattern A is statistically equivalent, pattern B is statistically different from 0, pattern C is practically insignificant, and pattern D is inconclusive (neither statistically different from 0 nor equivalent).

Equivalence tests can be performed in addition to null-hypothesis significance tests.[2][3][4][5] This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect. Furthermore, equivalence tests can identify effects that are statistically significant but practically insignificant, whenever effects are statistically different from zero, but also statistically smaller than any effect size deemed worthwhile (see first Figure).[6]

TOST procedureEdit

"A very simple equivalence testing approach is the ‘two-one-sided t-tests’ (TOST) procedure.[7] In the TOST procedure an upper (ΔU) and lower (–ΔL) equivalence bound is specified based on the smallest effect size of interest (e.g., a positive or negative difference of d = 0.3). Two composite null hypotheses are tested: H01: Δ ≤ –ΔL and H02: Δ ≥ ΔU. When both these one-sided tests can be statistically rejected, we can conclude that –ΔL < Δ < ΔU, or that the observed effect falls within the equivalence bounds and is statistically smaller than any effect deemed worthwhile, and considered practically equivalent.[8]" [Lakens 2017] Alternatives to the TOST procedure have been developed as well.[9] A recent modification to TOST makes the approach feasible in cases of repeated measures and assessing multiple variables. [10]

Comparison between t-test and equivalence testEdit

The equivalence test can, for comparison purposes, be induced from the t-test.[11] Considering a t-test at the significance level αt-test achieving a power of 1-βt-test for a relevant effect size dr, both tests lead to the same inference whenever parameters Δ=dr as well as αequiv.-testt-test and βequiv.-testt-test coincide, i.e. the error types (type I and type II) are interchanged between the t-test and the equivalence test. To achieve this for the t-test, either the sample size calculation needs to be carried out correctly, or by adjusting the t-test significance level αt-test, referred to as the so-called revised t-test.[11] Both approaches have difficulties in practice, since sample size planning relies on unverifiable assumptions of the standard deviation  , and the revised t-test yields numerical problems.[11] Preserving the test behaviour, those limitations can be removed by using an equivalence test.

The second Figure allows a visual comparison of the equivalence test and the t-test when the sample size calculation is affected by differences between the a priori standard deviation   and the sample's standard deviation  , which is a common problem. Using an equivalence test instead of a t-test additionally ensures that αequiv.-test is bounded, which the t-test does not do in case that   with the type II error growing arbitrary large. On the other hand, having   results in the t-test being stricter than the dr specified in the planning, which may randomly penalize the sample source (e.g. a device manufacturer). This makes the equivalence test safer to use.

Chances to pass (a) the t-test and (b) the equivalence test, depending on the actual error 𝜇. For more details, see[11]

Further readingEdit

  • Walker, Esteban; Nowacki, Amy S. (February 2011). "Understanding Equivalence and Noninferiority Testing". Journal of General Internal Medicine. 26 (2): 192–6. doi:10.1007/s11606-010-1513-8. PMC 3019319. PMID 20857339.


  1. ^ Hauck, Walter W.; Anderson, Sharon (1984-02-01). "A new statistical procedure for testing equivalence in two-group comparative bioavailability trials". Journal of Pharmacokinetics and Biopharmaceutics. 12 (1): 83–91. doi:10.1007/BF01063612. ISSN 0090-466X. PMID 6747820. S2CID 29838725.
  2. ^ Rogers, James L.; Howard, Kenneth I.; Vessey, John T. (1993). "Using significance tests to evaluate equivalence between two experimental groups". Psychological Bulletin. 113 (3): 553–565. doi:10.1037/0033-2909.113.3.553. PMID 8316613.
  3. ^ Statistics applied to clinical trials (4th ed.). Springer. ISBN 978-1402095221.
  4. ^ Piaggio, Gilda; Elbourne, Diana R.; Altman, Douglas G.; Pocock, Stuart J.; Evans, Stephen J. W.; CONSORT Group, for the (8 March 2006). "Reporting of Noninferiority and Equivalence Randomized Trials" (PDF). JAMA. 295 (10): 1152–60. doi:10.1001/jama.295.10.1152. PMID 16522836.
  5. ^ Piantadosi, Steven (28 August 2017). Clinical trials : a methodologic perspective (Third ed.). p. 8.6.2. ISBN 978-1-118-95920-6.
  6. ^ Lakens, Daniël (2017-05-05). "Equivalence Tests". Social Psychological and Personality Science. 8 (4): 355–362. doi:10.1177/1948550617697177. PMC 5502906. PMID 28736600.
  7. ^ Schuirmann, Donald J. (1987-12-01). "A comparison of the Two One-Sided Tests Procedure and the Power Approach for assessing the equivalence of average bioavailability". Journal of Pharmacokinetics and Biopharmaceutics. 15 (6): 657–680. doi:10.1007/BF01068419. ISSN 0090-466X. PMID 3450848. S2CID 206788664.
  8. ^ Seaman, Michael A.; Serlin, Ronald C. (1998). "Equivalence confidence intervals for two-group comparisons of means". Psychological Methods. 3 (4): 403–411. doi:10.1037/1082-989x.3.4.403.
  9. ^ Wellek, Stefan (2010). Testing statistical hypotheses of equivalence and noninferiority. Chapman and Hall/CRC. ISBN 978-1439808184.
  10. ^ Rose, Evangeline M.; Mathew, Thomas; Coss, Derek A.; Lohr, Bernard; Omland, Kevin E. (2018). "A new statistical method to test equivalence: an application in male and female eastern bluebird song". Animal Behaviour. 145: 77–85. doi:10.1016/j.anbehav.2018.09.004. ISSN 0003-3472. S2CID 53152801.
  11. ^ a b c d Siebert, Michael; Ellenberger, David (2019-04-10). "Validation of automatic passenger counting: introducing the t-test-induced equivalence test". Transportation. doi:10.1007/s11116-019-09991-9. ISSN 0049-4488.