Difference between revisions of "2.1.6 Sample size and power analysis"

From EQIPD
Jump to: navigation, search
(C. Resources)
(C. Resources)
 
(14 intermediate revisions by 4 users not shown)
Line 9: Line 9:
 
== A. Background & Definitions ==
 
== A. Background & Definitions ==
  
Statistical power is defined as the probability of detecting as statistically significant a clinically or practically important difference of a pre-specified size, if such a difference truly exists. Formally, power is equal to 1 minus the Type II error rate (beta or ß). The Type II error rate is the probability of obtaining a non-significant result when the null hypothesis is false — in other words, failing to find a difference or relationship when one exists.
+
Statistical power is defined as the probability of detecting a statistically significant effect of a pre-specified size. Formally, power is equal to 1 minus the Type II error rate (beta or ß). The Type II error rate is the probability of obtaining a non-significant result when the null hypothesis is false — in other words, failing to find a difference or relationship when one exists.
  
 
Balancing sample size, effect size and power is critical to good study design. When the power is low, only large effects can be detected, and negative results cannot be reliably interpreted. The consequences of low power are particularly dire in the search for high-impact results, when the researcher may be willing to pursue low-likelihood hypotheses for a groundbreaking discovery (see Fig. 1 in [https://www.nature.com/articles/nmeth.2738 Krzywinski & Altman 2013]). Ensuring that sample sizes are large enough to detect the effects of interest is an essential part of study design.
 
Balancing sample size, effect size and power is critical to good study design. When the power is low, only large effects can be detected, and negative results cannot be reliably interpreted. The consequences of low power are particularly dire in the search for high-impact results, when the researcher may be willing to pursue low-likelihood hypotheses for a groundbreaking discovery (see Fig. 1 in [https://www.nature.com/articles/nmeth.2738 Krzywinski & Altman 2013]). Ensuring that sample sizes are large enough to detect the effects of interest is an essential part of study design.
Line 15: Line 15:
 
Studies with inadequate power are a waste of research resources and arguably unethical when subjects are exposed to potentially harmful or inferior experimental conditions.
 
Studies with inadequate power are a waste of research resources and arguably unethical when subjects are exposed to potentially harmful or inferior experimental conditions.
  
Statistical power analysis exploits the relationships among the four variables involved in statistical inference: sample size (N), significance criterion (α), effect size (ES), and statistical power. For any statistical model, these relationships are such that each is a function of the other thre​e.
+
Statistical power analysis exploits the relationships among the four variables involved in statistical inference: sample size (N), significance criterion (α), effect size (ES), and statistical power.
  
 
== ​B. Guidance & Expectations ==
 
== ​B. Guidance & Expectations ==
Line 24: Line 24:
 
* Use power prospectively for planning future studies.
 
* Use power prospectively for planning future studies.
 
* Put science before statistics. It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are effect sizes that are deemed scientifically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting scientific goals – not the other way around.
 
* Put science before statistics. It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are effect sizes that are deemed scientifically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting scientific goals – not the other way around.
* Do pilot studies. Investigators tend to try to answer all the world’s questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work incrementally. A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of freedom for error is generally quite adequate for obtaining reasonably reliable sample-size estimates.
+
* Do pilot studies. Investigators tend to try to answer all the world’s questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work incrementally. A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of freedom for error is generally adequate for obtaining reasonably reliable sample-size estimates.
* Effect size should be specified on the actual scale of measurement, not on a standardized scale.
+
* Generate sample size estimates for a range of power and effect size values to explore the gains and losses in power or detectable effect size due to increasing or decreasing n. This is why the term ‘sample size estimation’ is often preferred over ‘sample size calculation’. Although the arrival at a number for the required sample size is invariably based on (often complex) formulae, the term ‘calculation’ implies an unwarranted degree of precision. The purpose of sample size estimation is not to give an exact number but rather to subject the study design to scrutiny, including an assessment of the validity and reliability of data collection ([https://www.sciencedirect.com/science/article/pii/S1466853X05000714 Batterham & Atkinson 2005]).
* Generate sample size estimates for a range of power and effect size values to explore the gains and and losses in power or detectable effect size due to increasing or decreasing n. This is why the term ‘sample size estimation’ is often preferred over ‘sample size calculation’. Although the arrival at a number for the required sample size is invariably based on (often complex) formulae, the term ‘calculation’ implies an unwarranted degree of precision. The purpose of sample size estimation is not to give an exact number but rather to subject the study design to scrutiny, including an assessment of the validity and reliability of data collection ([https://www.sciencedirect.com/science/article/pii/S1466853X05000714 Batterham & Atkinson 2005]).
+
* Remember to consider attrition rate (i.e. possibility that some subjects or samples are lost during the conduct of the study or follow-up for technical and other data analysis-unrelated reasons)
  
  
Line 33: Line 33:
 
* Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science. Further, this method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!
 
* Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science. Further, this method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!
  
* Retrospective power calculations should generally be avoided, because they add no new information to an analysis (i.e. avoid using observed power to interpret the results of the statistical test). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis.  
+
* Retrospective power calculations should be avoided, because they add no new information to an analysis (i.e. avoid using observed power to interpret the results of the statistical test). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis.  
  
  
 
Guidance on sample size estimation:
 
Guidance on sample size estimation:
  
--- to be added ---
+
--- to be added / revised (please do not edit - placeholder) ---
 +
 
 +
.. getting a solid grip on the existing literature in one's topic, drilling down to what effects were identified and obtaining the corresponding ES values either directly from the publication or from appropriate calculations based on the printed documentation.
 +
 
 +
  .. being sure that the estimate you obtain is the one that fits the study design correctly; one cannot necessarily generalize across disparate research designs.
 +
 
 +
  .. and citing the algorithm or software used to generate the estimates.  A power calculation result given without this detail can be viewed with suspicion.
 +
 
 +
--- to be added / revised (please do not edit - placeholder) ---
  
  
Line 45: Line 53:
  
 
Limited budget, limited supply of research materials, or a difficult-to-overcome guidance from a collaborator, a funder or a senior colleagues may leave no choice but to consider running a study with a certain potentially small sample size.  What can be done in such situations?
 
Limited budget, limited supply of research materials, or a difficult-to-overcome guidance from a collaborator, a funder or a senior colleagues may leave no choice but to consider running a study with a certain potentially small sample size.  What can be done in such situations?
* consider study designs involving correlated data (e.g. repeated measures, crossover or matched-pairs designs) that are associated with greater statistical power than those involving separate samples allocated to different treatment groups .
+
 
explore and engage all means to minimize variation (from
+
* consider study designs involving correlated data (e.g. repeated measures, crossover or matched-pairs designs) that are usually associated with greater statistical power than those involving separate samples allocated to different treatment groups ([[https://www.sciencedirect.com/science/article/pii/S1466853X05000714 see section 2.1 here]).
It is hard to argue with budgets, journal editors, and superiors. But this does not mean that there is no sample-size problem. As we discuss in more detail in Section 5, sample size is but one of several quality characteristics of a statistical study; so if n is held fixed, we simply need to focus on other aspects of study quality. For instance, given the budgeted (or imposed) sample size, we can find the effect size θ ̈ such that π (θ ̈ , n, α , . . .) = π ̃ . Then the value of θ ̈ can be discussed and evaluated relative to scientific goals. If it is too large, then the study is under-powered, and then the recommendation depends on the situation. Perhaps this finding may be used to argue for a bigger budget. Perhaps a better instrument can be found that will bring the study up to a reasonable standard. Last (but definitely not least), re-consider possible improvements to the study design that will reduce the variance of the estimator of θ, e.g., using judicious stratification or blocking..
+
* consider intervening variables or pre-intervention measurements for stratification; if not possible, one can still improve statistical power by entering these variables as covariates in the analysis (this approach has its limitations and therefore should be consulted with the statisticians)
 +
* make sure that the most suited randomization schedule is used to control for random influences
 +
* explore and engage all other means to minimize variation (including using properly maintained and calibrated research instruments, adequate and well controlled environmental conditions, making sure that experiments are performed by competent and adequately trained scientists)
 +
* if a study has low power because of the given sample size, reflect this limitation in the study protocol and indicate to all stakeholders that the study cannot be run as knowledge-claiming (decision-enabling, confirmatory).
 +
* evaluate power not only for the given sample size for also for the values around and discuss the impact of the sample size on power with the stakeholders - in some cases, it may help to lift or revise the original sample size restrictions. These discussions make sense and are justifiable only if they take place prior to the conduct of the study (i.e. not post hoc).
  
 
== C. Resources ==
 
== C. Resources ==
Line 56: Line 68:
 
* [https://wise1.cgu.edu/power/index.asp WISE power tutorial]​
 
* [https://wise1.cgu.edu/power/index.asp WISE power tutorial]​
 
* [http://davidmlane.com/hyperstat/power.html JAVA applets for power and sample size]​
 
* [http://davidmlane.com/hyperstat/power.html JAVA applets for power and sample size]​
* [https://www.psychometrica.de/effect_size.html Computation of sample sizes @Psychometrica]
+
* [https://www.psychometrica.de/effect_size.html Computation of effect sizes @Psychometrica]
 +
* [http://powerandsamplesize.com/Calculators/ Overview of sample size and power calculators]
  
  
Line 72: Line 85:
 
Useful literature:
 
Useful literature:
  
[https://pdfs.semanticscholar.org/1325/24bdfe70504fcd67016b17305ccddb4bcd14.pdf Power in various ANOVA designs by Joel Levin]
+
* [https://pdfs.semanticscholar.org/1325/24bdfe70504fcd67016b17305ccddb4bcd14.pdf Power in various ANOVA designs by Joel Levin]
  
 
* [https://www.ncbi.nlm.nih.gov/books/NBK43321/ https://www.ncbi.nlm.nih.gov/books/NBK43321/]
 
* [https://www.ncbi.nlm.nih.gov/books/NBK43321/ https://www.ncbi.nlm.nih.gov/books/NBK43321/]
 
* [http://davidmlane.com/hyperstat/power.html http://davidmlane.com/hyperstat/power.html]
 
* [http://davidmlane.com/hyperstat/power.html http://davidmlane.com/hyperstat/power.html]
* [http://powerandsamplesize.com/Calculators/Test-1-Mean/1-Sample-Equality​ http://powerandsamplesize.com/Calculators/Test-1-Mean/1-Sample-Equality​]
+
* [http://powerandsamplesize.com/Calculators/Test-1-Mean/1-Sample-Equality​ http://powerandsamplesize.com/Calculators​]
  
 
Guidelines on reporting of sample size (in vivo research):​
 
Guidelines on reporting of sample size (in vivo research):​

Latest revision as of 07:43, 14 September 2021

​​​

UNDER CONSTRUCTION



A. Background & Definitions

Statistical power is defined as the probability of detecting a statistically significant effect of a pre-specified size. Formally, power is equal to 1 minus the Type II error rate (beta or ß). The Type II error rate is the probability of obtaining a non-significant result when the null hypothesis is false — in other words, failing to find a difference or relationship when one exists.

Balancing sample size, effect size and power is critical to good study design. When the power is low, only large effects can be detected, and negative results cannot be reliably interpreted. The consequences of low power are particularly dire in the search for high-impact results, when the researcher may be willing to pursue low-likelihood hypotheses for a groundbreaking discovery (see Fig. 1 in Krzywinski & Altman 2013). Ensuring that sample sizes are large enough to detect the effects of interest is an essential part of study design.

Studies with inadequate power are a waste of research resources and arguably unethical when subjects are exposed to potentially harmful or inferior experimental conditions.

Statistical power analysis exploits the relationships among the four variables involved in statistical inference: sample size (N), significance criterion (α), effect size (ES), and statistical power.

​B. Guidance & Expectations

General advice - DO:

  • Whenever possible, seek professional biostatistician support to estimate sample size.
  • Use power prospectively for planning future studies.
  • Put science before statistics. It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are effect sizes that are deemed scientifically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting scientific goals – not the other way around.
  • Do pilot studies. Investigators tend to try to answer all the world’s questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work incrementally. A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of freedom for error is generally adequate for obtaining reasonably reliable sample-size estimates.
  • Generate sample size estimates for a range of power and effect size values to explore the gains and losses in power or detectable effect size due to increasing or decreasing n. This is why the term ‘sample size estimation’ is often preferred over ‘sample size calculation’. Although the arrival at a number for the required sample size is invariably based on (often complex) formulae, the term ‘calculation’ implies an unwarranted degree of precision. The purpose of sample size estimation is not to give an exact number but rather to subject the study design to scrutiny, including an assessment of the validity and reliability of data collection (Batterham & Atkinson 2005).
  • Remember to consider attrition rate (i.e. possibility that some subjects or samples are lost during the conduct of the study or follow-up for technical and other data analysis-unrelated reasons)


General advice - DO NOT:

  • Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science. Further, this method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!
  • Retrospective power calculations should be avoided, because they add no new information to an analysis (i.e. avoid using observed power to interpret the results of the statistical test). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis.


Guidance on sample size estimation:

--- to be added / revised (please do not edit - placeholder) ---

.. getting a solid grip on the existing literature in one's topic, drilling down to what effects were identified and obtaining the corresponding ES values either directly from the publication or from appropriate calculations based on the printed documentation.
 .. being sure that the estimate you obtain is the one that fits the study design correctly; one cannot necessarily generalize across disparate research designs.
 .. and citing the algorithm or software used to generate the estimates.  A power calculation result given without this detail can be viewed with suspicion.

--- to be added / revised (please do not edit - placeholder) ---


What to do if you have no choice about sample size:

Limited budget, limited supply of research materials, or a difficult-to-overcome guidance from a collaborator, a funder or a senior colleagues may leave no choice but to consider running a study with a certain potentially small sample size. What can be done in such situations?

  • consider study designs involving correlated data (e.g. repeated measures, crossover or matched-pairs designs) that are usually associated with greater statistical power than those involving separate samples allocated to different treatment groups ([see section 2.1 here).
  • consider intervening variables or pre-intervention measurements for stratification; if not possible, one can still improve statistical power by entering these variables as covariates in the analysis (this approach has its limitations and therefore should be consulted with the statisticians)
  • make sure that the most suited randomization schedule is used to control for random influences
  • explore and engage all other means to minimize variation (including using properly maintained and calibrated research instruments, adequate and well controlled environmental conditions, making sure that experiments are performed by competent and adequately trained scientists)
  • if a study has low power because of the given sample size, reflect this limitation in the study protocol and indicate to all stakeholders that the study cannot be run as knowledge-claiming (decision-enabling, confirmatory).
  • evaluate power not only for the given sample size for also for the values around and discuss the impact of the sample size on power with the stakeholders - in some cases, it may help to lift or revise the original sample size restrictions. These discussions make sense and are justifiable only if they take place prior to the conduct of the study (i.e. not post hoc).

C. Resources

Tools to sample size estimation:


Educational instruments and resources:

  • Mayo clinical online simulator - Size matters [1]
  • Scientists talking to biostatisticians [2]


Useful literature (for non-statisticians):

Useful literature:

Guidelines on reporting of sample size (in vivo research):​



back to Toolbox

Next item: 2.1.7 Blinding