ARRIVE Essential - Sample size

Jump to: navigation, search

​​DISCLAIMER: Information on this and related pages is based on or copied directly from the ARRIVE guidelines 2019 (please see the original guidelines for more information, references and examples that are not included on these pages):

ARRIVE Essential 10 - Item 2 - Sample size

2a. Specify the exact number of experimental units allocated to each group, and the total number in each experiment. Also indicate the total number of animals used.

Sample size relates to the number of experimental units in each group at the start of the study, and is usually represented by N (See item 1 - Study design for further guidance on identifying and reporting experimental units). This information is crucial to assess the validity of the statistical model and the robustness of the experimental results.

Report the exact value of N per group and the total number in each experiment. If the experimental unit is not the animal, also report the total number of animals to help readers understand the study design. For example, in a study investigating diet using cages of animals housed in pairs, the number of animals is double the number of experimental units. Reporting the total number of animals is also useful to identify if any were re-used between experiments.

b. Explain how the sample size was decided. Provide details of any a priori sample size calculation, if done. For any type of experiment, it is crucial to explain how the sample size was determined. For hypothesis-testing experiments, where inferential statistics are used to estimate the size of the effect and to determine the weight of evidence against the null hypothesis, the sample size needs to be justified to ensure experiments are of an optimal size to test the research question.

Power is the probability that a test of significance will detect an effect (i.e. a deviation from the null hypothesis), when the effect being investigated genuinely exists (i.e. true positive result). Sample sizes that are too small (i.e. underpowered studies) produce inconclusive results, whereas sample sizes that are too large (i.e. overpowered studies) raise ethical issues over unnecessary use of animals and may produce trivial findings that are statistically significant but not biologically relevant. Low power has three effects: first, within the experiment, real effects are more likely to be missed; second, where an effect is detected, this will often be an over-estimation of the true effect size; and finally, when low power is combined with publication bias, there is an increase in the false positive rate in the published literature. Consequently, low powered studies contribute to the poor internal validity of research and risk wasting animals used in inconclusive research.

Study design can influence the statistical power of an experiment. Split-plot designs, factorial designs, or group-sequential designs can increase the power of a study for a given number of animals. Statistical programs to help perform a priori sample size calculations exist for a variety of experimental designs and statistical analyses, for example G*power. Choosing the appropriate calculator or algorithm to use depends on the type of outcome measures and independent variables, and the number of groups. Consultation with a statistician is recommended, especially when the experimental design is complex or unusual.

Where the experiment tests the effect of an intervention on the mean of a continuous outcome measure, the sample size can be calculated a priori, based on a mathematical relationship between the desired effect size, variability estimated from prior data, chosen significance level, power and sample size. For an a priori sample size determination, report the analysis method (e.g. two-tailed student’s t- test with a 0.05 significance threshold), the effect size of interest and a justification explaining why this effect size is relevant, the estimate of variability used (e.g. standard deviation) and how it was estimated, and the power selected.

There are several types of studies where a priori sample size calculations are not appropriate. For example, the number of animals needed for antibody or tissue production is determined by the amount required and the production ability of an individual animal. For studies where the outcome is a successful generation of a sample or a condition (e.g. the production of transgenic animals), the number of animals is determined by the probability of success of the experimental procedure.

In early feasibility or pilot studies, the number of animals required depends on the purpose of the study. Where the objective of the preliminary study is to improve procedures and equipment, the number of animals needed is generally small. In such cases power calculations are not appropriate and sample sizes can be estimated based on operational capacity and constraints. Pilot studies alone are unlikely to provide adequate data on variability for a power calculation for future experiments. Systematic reviews and previous studies are more appropriate sources of information on variability.

Regardless of whether a power calculation was used or not, when explaining how the sample size was determined take into consideration any anticipated loss of animals or data, for example due to exclusion criteria established upfront or expected attrition (see item 3 – inclusion and exclusion criteria​).

back to ARRIVE 2.0 overview​