Difference between revisions of "2.1.6 Sample size and power analysis"

From EQIPD
Jump to: navigation, search
(​B. Guidance & Expectations)
(​B. Guidance & Expectations)
Line 33: Line 33:
 
General advice - DO NOT:
 
General advice - DO NOT:
  
1. Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science.  
+
1. Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science. Further, this method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!
Specify T-shirt effect sizes (“small”, “medium”, and “large”).
 
This is an elaborate way to arrive at the same sample size that has been used in past social science studies of large, medium, and small size (respectively). The method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!
 
  
 +
2. Retrospective power calculations should generally be avoided, because they add no new information to an analysis (i.e. avoid using observed power to interpret the results of the statistical test). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis.
  
2. Retrospective power calculations should generally be avoided, because they add no new information to an analysis (i.e. avoid using observed power to the results of the statistical test).
 
Retrospective power (a.k.a. observed power, post hoc power). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis. (Note: These comments refer to power computed based on the observed effect size and sample size. Considering a different sample size is obviously prospective in nature. Considering a different effect size might make sense, but probably what you really need to do instead is an equivalence test; see Hoenig and Heisey, 2001.)
 
  
 
+
We begin by setting the values of type I error (a) and power (1 – b) to be statistically adequate: traditionally 0.05 and 0.80, respectively. We then determine n on the basis of the smallest effect we wish to measure. If the required sample size is too large, we may need to reassess our objectives or more tightly control the experimental conditions to reduce the variance.
We begin by setting the values of type I error (a) and power (1 – b) to be statistically adequate: traditionally 0.05 and 0.80, respectively. We then determine n on the basis of the smallest effect we wish to measure. If the required sample size is too large, we may need to reassess our objectives or more tightly control the experimental conditions to reduce the variance.  
 
 
 
Please provide recommendations that will help the user to develop a method, document, protocol, tool or some other solution customized to their specific research environment.
 
 
 
This guidance should be short and concise, possibly in the form of bullet points.
 
 
 
For example, to develop a protocol for a specific key or support process, the guidance may include recommendations for the protocol to include the following information:
 
  
 
== C. Resources ==
 
== C. Resources ==

Revision as of 14:45, 14 October 2020

​​​

UNDER CONSTRUCTION



A. Background & Definitions

Balancing sample size, effect size and power is critical to good study design. When the power is low, only large effects can be detected, and negative results cannot be reliably interpreted. The consequences of low power are particularly dire in the search for high-impact results, when the researcher may be willing to pursue low-likelihood hypotheses for a groundbreaking discovery (see Fig. 1 in Krzywinski & Altman 2013). Ensuring that sample sizes are large enough to detect the effects of interest is an essential part of study design.

Studies with inadequate power are a waste of research resources and arguably unethical when subjects are exposed to potentially harmful or inferior experimental conditions. Addressing this short- coming is a priority—the Nature Publishing Group checklist for statistics and methods (http://www.nature.com/authors/policies/ checklist.pdf) includes as the first question: “How was the sample size chosen to ensure adequate power to detect a pre-specified effect size?” ​

Statistical power analysis exploits the relationships among the four variables involved in statistical inference: sample size (N), significance criterion (α), effect size (ES), and statistical power. For any statistical model, these relationships are such that each is a function of the other thre​e.

​B. Guidance & Expectations

General advice - DO:

1. Whenever possible, seek professional biostatistician support to estimate sample size.

2. Use power prospectively for planning future studies.

3. Put science before statistics. It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are effect sizes that are deemed scientifically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting scientific goals – not the other way around.

4. Do pilot studies. Investigators tend to try to answer all the world’s questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work incrementally. A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of freedom for error is generally quite adequate for obtaining reasonably reliable sample-size estimates.

5. Effect size should be specified on the actual scale of measurement, not on a standardized scale.


General advice - DO NOT:

1. Avoid using the definition of “small,” “medium,” or “large” effect size based on Cohen's d of .20, .50, or .80, respectively. Cohen's assessments are based on an extensive survey of statistics reported in the literature in the social sciences and may not apply to other fields of science. Further, this method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message!

2. Retrospective power calculations should generally be avoided, because they add no new information to an analysis (i.e. avoid using observed power to interpret the results of the statistical test). You’ve got the data, did the analysis, and did not achieve “significance.” So you compute power retrospectively to see if the test was powerful enough or not. This is an empty question. Of course it wasn’t powerful enough – that’s why the result isn’t significant. Power calculations are useful for design, not analysis.


We begin by setting the values of type I error (a) and power (1 – b) to be statistically adequate: traditionally 0.05 and 0.80, respectively. We then determine n on the basis of the smallest effect we wish to measure. If the required sample size is too large, we may need to reassess our objectives or more tightly control the experimental conditions to reduce the variance.

C. Resources

  • G*Power [1]
  • WISE power tutorial [2]
  • JAVA applets for power and sample size [3]
  • Computation of sample sizes @Psychometrica [4]

Mayo clinical online simulator - Size matters [5]

Scientists talking to biostatisticians [6]

Guidelines on reporting of sample size (in vivo research):​



back to Toolbox

Next item: 2.1.7 Blinding