The online shiny app for rpact is available at https://shiny.rpact.com. The default settings when the Shiny app is loaded is for a fixed sample design, which means that there is only one look at the data (kMax = 1). In other words, the default setting is not for a sequential design, but a traditional design where the data is analyzed once. Moving the slider for the “Maximum number of stages” would increase the number of looks in the design (you can select up to up to 10 looks).

The rpact package focuses on Confirmatory Adaptive Clinical Trial Design and Analysis. In clinical trials, researchers mostly test directional predictions, and thus, the default setting is to perform a one-sided test. Outside of clinical trials, it might be less common to design studies testing a directional prediction, but it is often a good idea. In clinical trials, it is common to use a 0.025 significance level (or Type I error rate) for one-sided tests, as it is deemed preferable in regulatory settings to set the Type I error rate for one-sided tests at half the conventional Type I error used in two-sided tests. In other fields, such as psychology, researchers typically use a 0.05 significance level, regardless of whether they perform a one-sided or two-sided test. A default 0.2 Type II error rate (or power of 0.8) is common in many fields, and is thus the default setting for the Type II error rate in the Shiny app.

Remember that you always need to justify your error rates – the defaults are most often not optimal choices in any real-life design (and it might be especially useful to choose a higher power, if possible).

We can explore a group sequential design by moving the slider for the maximum number of stages to, say, kMax = 2. The option to choose a design appears above the slider in the form of three “Design” radio buttons (Group Sequential, Inverse Normal, and Fisher), which by default is set to a group sequential design – this is the type of designs we will focus on in this step by step tutorial. The other options are relevant for adaptive designs which we will not discuss here.

A new drop down menu has appeared below the box to choose a Type II error rate that asks you to specify the “Type of design”. This allows you to choose how you want to control the level across looks. By default the choice is an O’Brien-Fleming design. Set the “Type of Design” option to “Pocock (P)”. Note there is also a Pocock type -spending (asP) option – we will use that later.

Because most people in social sciences will probably have more experience with two-sided tests at an of 0.05, choose a two-sided test and an level of 0.05 choose those settings. The input window should now look like the example below:

Click on the “Plot” tab. The first plot in the drop-down menu shows the boundaries at each look. The critical score at each look it presented, as is a reference line at and . These reference lines are the critical value for a two-sided test with a single look (i.e., a fixed design) with an of 5%. We see that the boundaries on the scale have increased. This means we need to observe a more extreme score at an analysis to reject . Furthermore, we see that the critical bounds are constant across both looks. This is exactly the goal of the Pocock correction: The level is lowered so that the level is the same at each look, and the overall level across all looks at the data is controlled at 5%. It is conceptually very similar to the Bonferroni correction. We can reproduce the design and the plot in R using the following code:

```
design <- getDesignGroupSequential(
kMax = 2,
typeOfDesign = "P",
alpha = 0.05,
sided = 2
)
plot(design, type = 1)
```

In the drop-down menu, we can easily change the type of design from “Pocock (P)” to “O’Brien-Fleming (OF)” to see the effect of using different corrections for the critical values across looks in the plot. We see that the O’Brien-Fleming correction has a different goal. The critical value at the first look is very high (which also means the level for this look is very low), but the critical value at the final look is extremely close to the unadjusted critical value of 1.96 (or the level of 0.05).

```
design <- getDesignGroupSequential(
kMax = 2,
typeOfDesign = "OF",
alpha = 0.05,
sided = 2
)
plot(design, type = 1)
```

We can plot the corrections for different types of designs for each of 3 looks (2 interim looks and one final look) in the same plot in R. The plot below shows the Pocock, O’Brien-Fleming, Haybittle-Peto, and Wang-Tsiatis correction with = 0.25. We see that researchers can choose different approaches to spend their -level across looks. Researchers can choose to spend their conservatively (keeping most of the for the last look), or more liberally (spending more at the earlier looks, which increases the probability of stopping early for many true effect sizes).

```
# Comparison corrections
d1 <- getDesignGroupSequential(typeOfDesign = "OF", sided = 2, alpha = 0.05)
d2 <- getDesignGroupSequential(typeOfDesign = "P", sided = 2, alpha = 0.05)
d3 <- getDesignGroupSequential(
typeOfDesign = "WT", deltaWT = 0.25,
sided = 2, alpha = 0.05
)
d4 <- getDesignGroupSequential(typeOfDesign = "HP", sided = 2, alpha = 0.05)
designSet <- getDesignSet(designs = c(d1, d2, d3, d4), variedParameters = "typeOfDesign")
plot(designSet, type = 1, legendPosition = 5)
```

Because the statistical power of a test depends on the level (and the effect size and the sample size), this means that at the final look the statistical power of an O’Brien-Fleming or Haybittle-Peto design is very similar to the statistical power for a fixed design with only one look. If the is lowered, the sample size of a study needs to be increased to maintain the same statistical power at the last look. Therefore, the Pocock correction requires a remarkably larger increase in the maximum sample size than the O’Brien-Fleming or Haybittle-Peto correction. We will discuss these issues in more detail when we consider sample size planning below.

If you head to the “Report” tab, you can download an easily readable summary of the main results. Here, you can also see the level you would use for each look at the data (e.g., p < 0.0052, and p < 0.0480 for a O’Brien-Fleming type design with 2 looks).

Corrected levels can be computed to many digits, but this quickly reaches a level of precision that is meaningless in real life. The observed type I error rate for all tests you will do in your lifetime is not noticeably different if you set the level at 0.0194, 0.019, or 0.02 (see the concept of ‘significant digits’. Even as we calculate and use thresholds up to many digits in sequential tests, the messiness of most research makes these levels have false precision. Keep this in mind when interpreting your data.

Note that the rpact Shiny app usefully shows the R code required to reproduce the output.

```
design <- getDesignGroupSequential(
typeOfDesign = "OF",
informationRates = c(0.5, 1),
alpha = 0.05,
beta = 0.2,
sided = 2
)
kable(summary(design))
```

**Sequential analysis with a maximum of 2 looks (group sequential design)**

O’Brien & Fleming design, two-sided overall significance level 5%, power 80%, undefined endpoint, inflation factor 1.0078, ASN H1 0.9022, ASN H01 0.9897, ASN H0 1.0052.

Stage | 1 | 2 |
---|---|---|

Information rate | 50% | 100% |

Efficacy boundary (z-value scale) | 2.797 | 1.977 |

Stage levels (one-sided) | 0.0026 | 0.0240 |

Cumulative alpha spent | 0.0052 | 0.0500 |

Overall power | 0.2096 | 0.8000 |

An important contribution to the sequential testing literature was made by Lan and DeMets (1983) who proposed the -spending function approach. In the figure below, the O’Brien-Fleming-like -spending function is plotted against the discrete O’Brien-Fleming bounds. We can see that the two approaches are not identical, but very comparable. The main benefit of these spending functions is that the error rate of the study can be controlled, while neither the number nor the timing of the looks needs to be specified in advance. This makes -spending approaches much more flexible. When using an -spending function it is important that the decision to perform an interim analysis is not based on collected data, as this can still can increase the Type I error rate.

```
d1 <- getDesignGroupSequential(typeOfDesign = "P", kMax = 5)
d2 <- getDesignGroupSequential(typeOfDesign = "asP", kMax = 5)
d3 <- getDesignGroupSequential(typeOfDesign = "OF", kMax = 5)
d4 <- getDesignGroupSequential(typeOfDesign = "asOF", kMax = 5)
designSet <- getDesignSet(
designs = c(d1, d2, d3, d4),
variedParameters = "typeOfDesign"
)
plot(designSet, type = 1)
```

Although -spending functions control the Type I error rate even when there are deviations from the pre-planned number of looks, or their timing, this does require recalculating the boundaries used in the statistical test based on the amount of information that has been observed. Let us assume a researcher designs a study with three equally spaced looks at the data (two interim looks, one final look), using a Pocock-type spending function, where results will be analyzed in a two-sided t-test with an overall desired Type I error rate of 0.05, and a desired power of 0.9 for a Cohen’s d of 0.5. An a-priori power analysis (which we will explain later in this tutorial) shows that we achieve the desired power in our sequential design if we plan to look after 65.4, 130.9, and 196.3 observations in each condition. Since we cannot collect partial participants, we should round these numbers up, and because we have 2 independent groups, we will collect 66 observations for look 1 (33 in each condition), 132 at the second look (66 in each condition) and 198 at the third look (99 in each condition).

```
design <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "asP",
sided = 2,
alpha = 0.05,
beta = 0.1
)
kable(summary(design))
```

**Sequential analysis with a maximum of 3 looks (group sequential design)**

Pocock type alpha spending design, two-sided overall significance level 5%, power 90%, undefined endpoint, inflation factor 1.1542, ASN H1 0.7212, ASN H01 1.0288, ASN H0 1.1308.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 2.279 | 2.295 | 2.296 |

Stage levels (one-sided) | 0.0113 | 0.0109 | 0.0108 |

Cumulative alpha spent | 0.0226 | 0.0382 | 0.0500 |

Overall power | 0.3940 | 0.7316 | 0.9000 |

```
sampleSizeResult <- getSampleSizeMeans(
design = design,
groups = 2,
alternative = 0.5,
stDev = 1
)
kable(summary(sampleSizeResult))
```

**Sample size calculation for a continuous endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 5% (two-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 0.5, standard deviation = 1, power 90%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 2.279 | 2.295 | 2.296 |

Overall power | 0.3940 | 0.7316 | 0.9000 |

Number of subjects | 65.4 | 130.9 | 196.3 |

Expected number of subjects under H1 | 122.6 | ||

Cumulative alpha spent | 0.0226 | 0.0382 | 0.0500 |

Two-sided local significance level | 0.0226 | 0.0217 | 0.0217 |

Lower efficacy boundary (t) | -0.578 | -0.406 | -0.330 |

Upper efficacy boundary (t) | 0.578 | 0.406 | 0.330 |

Exit probability for efficacy (under H0) | 0.0226 | 0.0155 | |

Exit probability for efficacy (under H1) | 0.3940 | 0.3375 |

Legend:

*(t)*: treatment effect scale

Now imagine that due to logistical issues, we do not manage to analyze the data until we have collected data from 76 observations (38 in each condition) instead of the planned 66 observations. So our first look at the data does not occur at 33.3% of planned sample, but at 76/198 = 38.4% of the planned sample. We can recalculate the level we should use for each look at the data, based on the current look, and planned future looks. Instead of using the -levels 0.0226, 0.0217, and 0.0217 at the three respective looks (as indicated above in the summary of the originally planned design), we can adjust the information rates in the Shiny app (Double click on a cell to edit it; hit Ctrl+Enter to finish editing, or Esc to cancel):

The updated -levels are 0.0253 for the current look, 0.0204 for the second look, and 0.0216 for the final look. To compute updated bounds in R directly, we can use the code:

```
design <- getDesignGroupSequential(
typeOfDesign = "asP",
informationRates = c(76 / 198, 2 / 3, 1),
alpha = 0.05,
sided = 2
)
kable(summary(design))
```

**Sequential analysis with a maximum of 3 looks (group sequential design)**

Pocock type alpha spending design, two-sided overall significance level 5%, power 80%, undefined endpoint, inflation factor 1.1697, ASN H1 0.8167, ASN H01 1.0686, ASN H0 1.1464.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 38.4% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 2.236 | 2.318 | 2.296 |

Stage levels (one-sided) | 0.0127 | 0.0102 | 0.0108 |

Cumulative alpha spent | 0.0253 | 0.0382 | 0.0500 |

Overall power | 0.3597 | 0.5999 | 0.8000 |

It is also possible to correct the -level if the final look at the data changes, for example because you are not able to collect the intended sample size, or because due to unforeseen circumstances you collect more data than planned. If this happens, we can no longer use the -spending function we chose, and instead have to provide a user-defined -spending function by updating the timing and -spending function to reflect the data collection as it actually occurred up to the final look.

Assuming the second look in our earlier example occurred as originally planned, but the last look occurred at 206 participants instead of 198 we can compute an updated -level for the last look. Given the current total sample size, we need to recompute the -levels for the earlier looks, which now occurred at 72/206 = 0.369, 132/206 = 0.641, and for the last look at 206/206 = 1.

Because the first and second look occurred with the adjusted -levels we computed after the first adjustment (-levels of 0.0253 and 0.0204) we can look at the “Cumulative alpha spent” row and see how much of our Type I error rate we spent so far (0.0253 and 0.382). We also know we want to spend the remainder of our Type I error rate at the last look, for a total of 0.05.

Our actual -spending function is no longer captures by the Pocock spending function after collecting more data than planned, but instead, we have a user defined spending function. We can enter both the updated information rates and the final -spending function directly in the Shiny app by selecting the “User defined alpha spending (asUser)” option as “Type of design”:

The output shows the computed -level for this final look is 0.0210 instead of 0.0216. The difference is very small in this specific case, but might be larger depending on the situation. This example shows the flexibility of group designs when -spending functions are used. We can also perform these calculations in R directly:

```
design <- getDesignGroupSequential(
typeOfDesign = "asUser",
informationRates =
c(72 / 206, 132 / 206, 1),
alpha = 0.05,
sided = 2, userAlphaSpending = c(0.0253, 0.0382, 0.05)
)
kable(summary(design))
```

**Sequential analysis with a maximum of 3 looks (group sequential design)**

User defined alpha spending design (0.025, 0.038, 0.05), two-sided overall significance level 5%, power 80%, undefined endpoint, inflation factor 1.1833, ASN H1 0.8213, ASN H01 1.0795, ASN H0 1.1583.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 35% | 64.1% | 100% |

Efficacy boundary (z-value scale) | 2.237 | 2.329 | 2.312 |

Stage levels (one-sided) | 0.0127 | 0.0099 | 0.0104 |

Cumulative alpha spent | 0.0253 | 0.0382 | 0.0500 |

Overall power | 0.3317 | 0.5826 | 0.8000 |

We will once again start with the default settings of the Shiny app which is for a fixed design with one look. Click on the “Endpoint” tab to choose how you want to specify the desired endpoint in this study. We will assume we plan to perform a test, and therefore, that our endpoint is based on the means we observe.

Then click the “Trial Settings” tab. Here, you can specify if you want to calculate the required sample size (to achieve a desired power) or compute the expected power (based on a chosen sample size). By default, the calculation will be for a two-group (independent) test.

The same number of individuals are collected in each group (allocation ratio = 1). It is possible to choose to use a normal approximation (which some software programs use) but the default settings where the calculations are based on the distribution, will be (ever so slightly) more accurate.

The effect under the null hypothesis is 0 by default, the default effect under the alternative is 0.2, and the default standard deviation is 1. This means that by default the power analysis is for a standardized effect size of Cohen’s d = 0.2/1 = 0.2. That is a small effect. In this example we will assume a researcher is interested in detecting a somewhat more substantial effect size, a mean difference of 0.5. This can be specified by changing the effect under the alternative to 0.5. Note that it is possible to compute the power for multiple values by selecting a value larger than 1 in the “# values” drop-down menu (but we will calculate power for a single alternative for now).

We can also directly perform these calculations in R:

```
design <- getDesignGroupSequential(
kMax = 1,
alpha = 0.05,
sided = 2
)
kable(summary(getSampleSizeMeans(design, alternative = 0.5)))
```

**Sample size calculation for a continuous endpoint**

Fixed sample analysis, significance level 5% (two-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 0.5, standard deviation = 1, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 127.5 |

Two-sided local significance level | 0.0500 |

Lower efficacy boundary (t) | -0.350 |

Upper efficacy boundary (t) | 0.350 |

Legend:

*(t)*: treatment effect scale

These calculations show that for a fixed design we should collect 128 participants (64 in each condition) to achieve 80% power for a Cohen’s d of 0.5 (or a mean difference of 0.5 with an expected population standard deviation of 1).

This result is similar to what can be computed in power analysis software for non-sequential designs, such as G*power.

We will now look at power in a sequential design. Change the slider for the number of looks (kMax) to 3. Furthermore, change the Type II error rate to 0.1 (a default of 0.2 is, regardless of what Cohen thought, really a bit large). By default rpact assumes we will look at the data at equal times – after 33%, 67%, and 100% of the data is collected. The default design is an O’Brien-Fleming design, with a one-sided test. Set the alternative hypothesis in the “Trial Settings” tab to 0.5. We can compute the sample size we would need for a sequential group design to achieve the desired error rates for a specified alternative using the `getSampleSizeMeans()`

function in R.

```
seq_design_of <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "OF",
sided = 1,
alpha = 0.05,
beta = 0.1
)
# Compute the sample size we need
power_res_of <- getSampleSizeMeans(
design = seq_design_of,
groups = 2,
alternative = 0.5,
stDev = 1,
allocationRatioPlanned = 1,
normalApproximation = FALSE
)
kable(summary(power_res_of))
```

**Sample size calculation for a continuous endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 5% (one-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 0.5, standard deviation = 1, power 90%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 2.961 | 2.094 | 1.710 |

Overall power | 0.1055 | 0.6295 | 0.9000 |

Number of subjects | 47.3 | 94.6 | 141.8 |

Expected number of subjects under H1 | 107.1 | ||

Cumulative alpha spent | 0.0015 | 0.0187 | 0.0500 |

One-sided local significance level | 0.0015 | 0.0181 | 0.0437 |

Efficacy boundary (t) | 0.910 | 0.437 | 0.289 |

Exit probability for efficacy (under H0) | 0.0015 | 0.0172 | |

Exit probability for efficacy (under H1) | 0.1055 | 0.5240 |

Legend:

*(t)*: treatment effect scale

The same output is available in the Shiny app under the “Sample Size” tab.

This output shows that at the first look, with a very strict -level of 0.0015, we will have almost no power. Even if there is a true effect of d = 0.5, in only 10.55% of the studies we run will we be able to stop after collecting 33% of the data has been collected (as we see in the row “Overall power” or “Cumulative Power”). One might wonder whether it would even be worth looking at the data at this time point (the answer might very well be ‘no’, and it is not necessary to design equally spaced looks). At the second look overall power is 62.95%, which gives us a reasonable chance to stop if there is an effect, at the the final look it should be 90%, as this is what we designed the study to achieve. We can also print the full results (instead of just a summary), or select “Details” in the Shiny app:

`kable(power_res_of)`

**Design plan parameters and output for means**

**Design parameters**

*Information rates*: 0.333, 0.667, 1.000*Critical values*: 2.961, 2.094, 1.710*Futility bounds (binding)*: -Inf, -Inf*Cumulative alpha spending*: 0.001533, 0.018739, 0.050000*Local one-sided significance levels*: 0.001533, 0.018138, 0.043669*Significance level*: 0.0500*Type II error rate*: 0.1000*Test*: one-sided

**User defined parameters**

*Alternatives*: 0.5

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Standard deviation*: 1*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Maximum number of subjects*: 141.8*Maximum number of subjects (1)*: 70.9*Maximum number of subjects (2)*: 70.9*Number of subjects [1]*: 47.3*Number of subjects [2]*: 94.6*Number of subjects [3]*: 141.8*Reject per stage [1]*: 0.1055*Reject per stage [2]*: 0.5240*Reject per stage [3]*: 0.2705*Early stop*: 0.6295*Expected number of subjects under H0*: 140.9*Expected number of subjects under H0/H1*: 132*Expected number of subjects under H1*: 107.1*Critical values (treatment effect scale) [1]*: 0.910*Critical values (treatment effect scale) [2]*: 0.437*Critical values (treatment effect scale) [3]*: 0.289

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

We see that the maximum number of subjects we would need to collect is 141.8, or rounded up, 142. The expected number of subjects under (when there is no true effect) is 140.9 - we will almost always collect data up to the third look, unless we make a Type I error and stop at the first two looks.

The expected number of subjects under (i.e., d = 0.5) is 107.1. If there is a true effect of d = 0.5, we will stop early in some studies, and therefore the average expected sample size is lower than the maximum.

We can plot the results across a range of possible effect sizes:

```
sample_res_plot <- getPowerMeans(
design = seq_design_of,
groups = 2,
alternative = seq(0, 1, 0.01),
stDev = 1,
allocationRatioPlanned = 1,
maxNumberOfSubjects = 142, # rounded up
normalApproximation = FALSE
)
# code for plot (not run, we show an annotated version of this plot)
# plot(sample_res_plot, type = 6, legendPosition = 6)
```

To create this plot in the Shiny app, you need to specify the design, in the endpoint tab select “Means”, and in the trial settings select “Power” as the calculation target, two groups, and for the number of values, select 50 from the drop-down menu. Specify the lower (i.e., 0) and upper (i.e., 1) value of the mean difference (given the standard deviation of 1, these values will also be Cohen’s d effect sizes). The maximum number of subjects is set to 142 (based on the power analysis we performed above). Go to the “Plot” tab and select the “Sample Size [6]” plot.

If you click on the “Plot” tab and select the Sample Size graph [6], and set the max sample size (nMax) to 50, you see that depending on the true effect size, there is a decent probability of stopping early (blue line) compared to at the final look (green line). Furthermore, the larger the effect size, the lower the Average sample size will be (red line).

Without sequential analyses we would collect 50 participants (the maximum sample size specified). But when the true effect size is large, we have a high probability to stop early, and the sample size that one needs to collect will on average (in the long run of doing many sequential designs) be lower.

After this general introduction to the benefits of group sequential designs to efficiently design well powered studies, we will look at more concrete examples of how to perform an a-priori power analysis for sequential designs.

When designing a study where the goal is to test whether a specific effect can be statistically rejected researchers often want to make sure their sample size is large enough to have sufficient power for an effect size of interest. This is done by performing an a-priori power analysis. Given a specified effect size, -level, and desired power, an a-priori power analysis will indicate the number of observations that should be collected.

An informative study has a high probability of correctly concluding an effect is present when it is present, and absent when it is absent. An a-priori power analysis is used to choose a sample size to achieve desired Type I and Type II error rates, in the long run, given assumptions about the null and alternative model.

We will assume that we want to design a study that can detect a difference of 0.5, with an assumed standard deviation in the population of 1, which means the expected effect is a Cohen’s d of 0.5. If we plan to analyze our hypothesis in a one-sided test (given our directional prediction), set the overall -level to 0.05, and want to achieve a Type II error probability of 0.1 (or a power of 0.9). Finally, we believe it is feasible to perform 2 interim analyses, and one final analysis (e.g., collect the data across three weeks, and we are willing to stop the data collection after any Friday). How many observations would we need?

The decision depends on the final factor we need to decide in a sequential design: the -spending function. We can choose an -spending function as we design our experiment, and compare different choices of a spending function. We will start by examining the sample size we need to collect if we choose an O’Brien-Fleming -spending function.

On the “Endpoint” tab we specify means. Then we move to the “Trial Design” tab. It is easy in rpact to plot power across a range of effect sizes, by selecting multiple values from the drop-down menu (i.e., 5). We set 0.3 and 0.7 as the lower and upper value, and keep the standard deviation at 1, so that we get the sample sizes for the range of Cohen’s d 0.3 to 0.7.

Sometimes you might have a clearly defined effect size to test against – such as a theoretically predicted effect size, or a smallest practically relevant effect size. Other times, you might primarily know the sample size you can achieve to collect, and you want to perform a sensitivity analysis, where you examine which effect size you can detect with a desired power, given a certain sample size. Plotting power across a range of effect sizes is typically useful. Even if you know which effect size you expect, you might want to look at what would be the consequences of the true effect size being slightly different than expected.

Open the “Plot” tab and from the drop-down menu select “Sample size [6]”. You will see a plot like the one below, created with the R package. From the results (in the row “Maximum number of subjects”), we see that if the true effect size is indeed d = 0.5, we would need to collect at most 141 participants (the result differs very slightly from the power analysis reported above, as we use the O’Brien-Fleming alpha spending function, and not the O’Brien-Fleming correction). In the two rows below, we see that this is based on 71 (rounded up) participants in each condition, so in practice we would actually collect a total of 142 participants due to upward rounding within each condition.

```
design <- getDesignGroupSequential(
typeOfDesign = "asOF",
alpha = 0.05, beta = 0.1
)
sample_res <- (getSampleSizeMeans(design,
alternative = c(0.3, 0.4, 0.5, 0.6, 0.7)
))
plot(sample_res, type = 5, legendPosition = 4)
```

This maximum is only slightly higher than if we had used a fixed design. For a fixed design (which you can examine by moving the slider for the maximum number of stages back to 1), we would need to collect 69.2 participants, or 138.4 in total, while for a sequential design, the maximum sample size per condition is 70.5.

The difference between a fixed design and a sequential design can be calculated by looking at the “Inflation factor”. We can find the inflation factor for the sequential design in the “Characteristics” in the “Design” tab (select for the R output “Details + characteristics”, or “Summary + details + characteristics”) which is 1.0187. In other words, the maximum sample size increased to 69.2 x 1.0187 = 70.5 per condition. The inflation is essentially caused by the reduction in the -level at the final look, and differs between designs (e.g., for a Pocock type alpha spending function, the inflation factor for the current design is larger, namely 1.1595)

However, the maximum sample size is not the expected sample size for this design, because of the possibility that we can stop data collection at an earlier look in the sequential design. In the long run, if d = 0.5, and we use an O’Brien-Fleming -spending function, and ignoring upward rounding because we can only collect a complete number of observations, we will sometimes collect 47 participants and stop after the first look see the row “Number of subjects [1]”), sometimes 94 and stop after the second look (see the row “Number of subjects [2]”)), and sometimes 141 and stop after the last look (see the row “Number of subjects [1]”)).

As we see in the row “Exit probability for efficacy (under H1)” we can stop early 6.75% of the time after look 1, 54.02% after look two, and in the remaining cases we will stop 1 - (0.0675 + 0.5402) = 39.23% of the time at the last look.

This means that, assuming there is a true effect of d = 0.5, the *expected* sample size on average is the probability of stopping at each look, multiplied by the number of observations we collect at each look, so 0.0675 * 47.0 + 0.5402 * 94.0 + ((1 - (0.0675 + 0.5402)) * 141.0) = 109.3, which matches the row “Expected number of subjects under H1” (again, assuming the alternative hypothesis of d = 0.5 is correct). So, in any single study we might need to collect slightly more data than in a fixed design, but on average we will need to collect less observations in a sequential design, namely 109.3, instead of 138.4 in a fixed design (assuming the alternative hypothesis is true).

```
design <- getDesignGroupSequential(typeOfDesign = "asOF", alpha = 0.05, beta = 0.1)
# getDesignCharacteristics(design)$inflationFactor
sample_res <- (getSampleSizeMeans(design, alternative = c(0.5)))
kable(sample_res)
```

**Design plan parameters and output for means**

**Design parameters**

*Information rates*: 0.333, 0.667, 1.000*Critical values*: 3.200, 2.141, 1.695*Futility bounds (binding)*: -Inf, -Inf*Cumulative alpha spending*: 0.0006869, 0.0163747, 0.0500000*Local one-sided significance levels*: 0.0006869, 0.0161445, 0.0450555*Significance level*: 0.0500*Type II error rate*: 0.1000*Test*: one-sided

**User defined parameters**

*Alternatives*: 0.5

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Standard deviation*: 1*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Maximum number of subjects*: 141*Maximum number of subjects (1)*: 70.5*Maximum number of subjects (2)*: 70.5*Number of subjects [1]*: 47*Number of subjects [2]*: 94*Number of subjects [3]*: 141*Reject per stage [1]*: 0.06749*Reject per stage [2]*: 0.54022*Reject per stage [3]*: 0.29230*Early stop*: 0.6077*Expected number of subjects under H0*: 140.2*Expected number of subjects under H0/H1*: 132.3*Expected number of subjects under H1*: 109.3*Critical values (treatment effect scale) [1]*: 0.995*Critical values (treatment effect scale) [2]*: 0.448*Critical values (treatment effect scale) [3]*: 0.287

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

For a Pocock -spending function the maximum sample size is larger (you can check by changing the spending function). The reason is that the -level at the final look is lower for a Pocock spending function than for the O’Brien-Fleming spending function, and the sample size required to achieve a desired power is thus higher. However, because the -level at the first look is higher, there is a higher probability of stopping early, and therefore the expected sample size is lower for a Pocock spending function (97.7 compared to 109.3). It is up to the researcher to choose a spending function, and weigh how desirable it would be to stop early, given some risk in any single study of increasing the sample size at the final look. For these specific design parameters, the Pocock -spending function might be more efficient on average, but also more risky in any single study.

So far, the sequential design would only stop at an interim analysis if we can reject . It is also possible to stop for futility, for example, based on a -spending function. We can directly compare the previous design with a design where we stop for futility. Just as we are willing to distribute our Type I error rate across interim analyses, we can distribute our Type II error rate across looks, and decide to stop for futility when we can reject the presence of an effect at least as large as 0.5, even if we are then making a Type II error.

If there actually is no effect, such designs are more efficient. One can choose in advance to stop data collection when the presence of the effect the study was designed to detect can be rejected (i.e., binding -spending), but it is typically recommended to allow the possibility to continue data collection (i.e., non-binding beta-spending). Adding futility bounds based on -spending functions reduce power, and increase the required sample size to reach a desired power, but this is on average compensated by the fact that studies stop earlier due to futility, which can make designs more efficient.

When an -spending function is chosen in the rpact Shiny app, a new drop-drown menu appears that allows users to choose a beta-spending function. In the R package, we simply add `typeBetaSpending = "bsOF"`

to the specification of the design. You do not need to choose the same spending approach for and as is done in this example.

```
design <- getDesignGroupSequential(
typeOfDesign = "asOF",
alpha = 0.05, beta = 0.1, typeBetaSpending = "bsOF"
)
sample_res <- (getSampleSizeMeans(design, alternative = 0.5))
kable(sample_res)
```

**Design plan parameters and output for means**

**Design parameters**

*Information rates*: 0.333, 0.667, 1.000*Critical values*: 3.200, 2.141, 1.695*Futility bounds (binding)*: -0.873, 0.751*Cumulative alpha spending*: 0.0006869, 0.0163747, 0.0500000*Local one-sided significance levels*: 0.0006869, 0.0161445, 0.0450555*Significance level*: 0.0500*Type II error rate*: 0.1000*Test*: one-sided

**User defined parameters**

*Alternatives*: 0.5

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Standard deviation*: 1*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Maximum number of subjects*: 148.2*Maximum number of subjects (1)*: 74.1*Maximum number of subjects (2)*: 74.1*Number of subjects [1]*: 49.4*Number of subjects [2]*: 98.8*Number of subjects [3]*: 148.2*Reject per stage [1]*: 0.07327*Reject per stage [2]*: 0.55751*Reject per stage [3]*: 0.26921*Overall futility stop*: 0.04395*Futility stop per stage [1]*: 0.004386*Futility stop per stage [2]*: 0.039568*Early stop*: 0.6747*Expected number of subjects under H0*: 99.6*Expected number of subjects under H0/H1*: 121*Expected number of subjects under H1*: 111*Critical values (treatment effect scale) [1]*: 0.968*Critical values (treatment effect scale) [2]*: 0.437*Critical values (treatment effect scale) [3]*: 0.280*Futility bounds (treatment effect scale) [1]*: -0.251*Futility bounds (treatment effect scale) [2]*: 0.152*Futility bounds (one-sided p-value scale) [1]*: 0.8085*Futility bounds (one-sided p-value scale) [2]*: 0.2264

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

We see that with a -spending function the Expected number of subjects under has increased from 109.3 to 111.0. The maximum number of subjects has increased from 141 to 148.2. So, if the alternative hypothesis is true, stopping for futility comes at a cost. However, it is possible that is true.

At the last look in our sequential design, which we designed to have 90% power, we are willing to act as if is true with a 10% error rate. We can reverse the null and alternative hypothesis, and view the same decision process as an equivalence test. In this view, we test whether we can reject the presence of a meaningful effect. For example, if our smallest effect size of interest is a mean difference of 0.5, and we observe a mean difference that is surprisingly far away from 0.5, we can reject the presence of an effect that is large enough to care about. In essence, in such an equivalence test the Type II error of the original null hypothesis significance test has now become the Type I error rate. Because we have designed our null hypothesis significance test to have 90% power for a mean difference of 0.5, 10% of the time we would incorrectly decide to act as if an effect of at least 0.5 is absent. This is statistically comparable to performing an equivalence test with an -level of 10%, and decide to act as if we can reject the presence of an effect at least as large as 0.5, which should also happen 10% of the time, in the long run.

If we can reject the presence of a meaningful effect, whenever is true, at an earlier look, we would save resources when is true. We see that the expected number of subjects under was 140.2. In other words, when is true, we would continue to the last look most of the time (unless we made a Type 1 error at look 1 or 2). With a -spending function, the expected number of subjects under has decreased substantially, to 99.6. The choice of whether you want to use a -spending function depends on the goals of your study. If you believe there is a decent probability is true, and you would like to efficiently conclude this from the data, the use of a -spending approach might be worth considering.

A challenge when wanting to interpret the observed effect size is that whenever a study is stopped early when rejecting , there is a risk that we stopped because due to random variation we happened to observe a large effect size at the time of the interim analysis. This which means that the observed effect size at these interim analyses over-estimates the true effect.

A similar issue is at play when reporting values and confidence intervals. When a sequential design is used, the distribution of a value that does not account for the sequential nature of the design is no longer uniform when is true. A value is the probability of observing a result at least as extreme as the result that was observed, given that is true. It is no longer straightforward to determine what ‘at least as extreme’ means a sequential design (Cook, 2002). It is possible to compute adjusted effect size estimates, confidence intervals, and values in rpact. This currently cannot be done in the Shiny app.

`Warning: 'thetaH1' (0.5) will be ignored because 'nPlanned' is not defined`

`Warning: 'assumedStDev' (1) will be ignored because 'nPlanned' is not defined`

Imagine we have performed a study planned to have at most 3 equally spaced looks at the data, where we perform a two-sided test with an of 0.05, and we use a Pocock type -spending function, and we observe mean differences between the two conditions of , 95% CI , , at stage 1, , 95% CI , , at stage 2, and , 95% CI , , at the last look. Based on a Pocock-like -spending function with three equally spaced looks the -level at each look for a two-sided test is 0.02264, 0.02174, and 0.02168. We can thus reject after look 3. But we would also like to report an effect size, and adjusted values and confidence intervals.

The first step is to create a dataset with the results at each look, consisting of the sample sizes, means, and standard deviations. Note that these are the sample sizes, means, and standard deviations only based on the data at each stage. In other words, we compute the means and standard deviations of later looks by excluding the data in earlier looks, so every mean and standard deviation in this example is based on 33 observations in each condition.

```
data_means <- getDataset(
n1 = c(33, 33, 33),
n2 = c(33, 33, 33),
means1 = c(0.6067868, 0.2795294, 0.7132186),
means2 = c(0.01976029, 0.08212538, 0.08982903),
stDevs1 = c(1.135266, 1.35426, 1.013671),
stDevs2 = c(1.068052, 0.9610714, 1.225192)
)
kable(summary(data_means))
```

**Dataset of means**

The dataset contains the sample sizes, means, and standard deviations of one treatment and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage | 1 | 1 | 2 | 2 | 3 | 3 |
---|---|---|---|---|---|---|

Group | 1 | 2 | 1 | 2 | 1 | 2 |

Stage-wise sample size | 33 | 33 | 33 | 33 | 33 | 33 |

Cumulative sample size | 33 | 33 | 66 | 66 | 99 | 99 |

Stage-wise mean | 0.607 | 0.020 | 0.280 | 0.082 | 0.713 | 0.090 |

Cumulative mean | 0.607 | 0.020 | 0.443 | 0.051 | 0.533 | 0.064 |

Stage-wise standard deviation | 1.135 | 1.068 | 1.354 | 0.961 | 1.014 | 1.225 |

Cumulative standard deviation | 1.135 | 1.068 | 1.251 | 1.009 | 1.179 | 1.079 |

We then take our design:

```
seq_design <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "asP",
sided = 2,
alpha = 0.05,
beta = 0.1
)
```

and compute the results based on the data we entered:

```
res <- getAnalysisResults(
seq_design,
equalVariances = FALSE,
dataInput = data_means,
thetaH1 = 0.5,
assumedStDev = 1
)
```

`Warning: 'thetaH1' (0.5) will be ignored because 'nPlanned' is not defined`

`Warning: 'assumedStDev' (1) will be ignored because 'nPlanned' is not defined`

We can then print a summary of the results:

`kable(summary(res))`

**Analysis results for a continuous endpoint**

Sequential analysis with 3 looks (group sequential design). The results were calculated using a two-sample t-test (two-sided, alpha = 0.05), unequal variances option. H0: mu(1) - mu(2) = 0 against H1: mu(1) - mu(2) != 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.333 | 0.667 | 1 |

Efficacy boundary (z-value scale) | 2.279 | 2.295 | 2.296 |

Cumulative alpha spent | 0.0226 | 0.0382 | 0.0500 |

Stage level | 0.0113 | 0.0109 | 0.0108 |

Cumulative effect size | 0.587 | 0.392 | 0.469 |

Cumulative (pooled) standard deviation | 1.102 | 1.136 | 1.130 |

Overall test statistic | 2.163 | 1.983 | 2.921 |

Overall p-value | 0.0171 | 0.0248 | 0.0019 |

Test action | continue | continue | reject |

Conditional rejection probability | 0.3411 | 0.2303 | |

95% repeated confidence interval | [-0.047; 1.221] | [-0.067; 0.852] | [0.097 ; 0.841] |

Repeated p-value | 0.0757 | 0.1067 | 0.0105 |

Final p-value | 0.0393 | ||

Final confidence interval | [0.022; 0.743] | ||

Median unbiased estimate | 0.403 |

The results show that the action after look 1 and 2 was to continue data collection, and that we could reject at the third look. The unadjusted mean difference is provided in the row “Overall effect size” and at the final look this was 0.469. The adjusted mean difference is provided in the row “Median unbiased estimate” and is lower, and the adjusted confidence interval is in the row “Final confidence interval”, giving the result 0.403, 95% CI [0.022, 0.743].

The unadjusted values for a one-sided tests are reported in the row “Overall p-value”. The actual values for our two-sided test would be twice as large, so 0.0342596, 0.0495679, 0.0038994. The adjusted value at the final look is provided in the row “Final p-value” and it is 0.03928.

The probability of finding a significant result, given the data that have been observed up to an interim analysis, is called *conditional power*. This approach can be useful in adaptive designs - designs where the final sample sizes is updated based on an early look at the data. In *blinded* sample size recalculation no effect size is calculated at an earlier look, but other aspects of the design, such as the standard deviation, are updated. In an *unblinded* sample size recalculation, the effect size estimate at an early look is used to determine the final sample size.

Let us imagine that we perform a sequential design using a Pocock - and -spending function:

```
seq_design <- getDesignGroupSequential(
sided = 1,
alpha = 0.05,
beta = 0.1,
typeOfDesign = "asP",
typeBetaSpending = "bsP",
bindingFutility = FALSE
)
```

We perform an a-priori power analysis based on a smallest effect size of interest of d = 0.38, which yields a maximum number of subjects of 330.

```
power_res <- getSampleSizeMeans(
design = seq_design,
groups = 2,
alternative = 0.38,
stDev = 1,
allocationRatioPlanned = 1,
normalApproximation = FALSE
)
kable(summary(power_res))
```

**Sample size calculation for a continuous endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 5% (one-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 0.38, standard deviation = 1, power 90%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 2.002 | 1.994 | 1.980 |

Futility boundary (z-value scale) | 0.293 | 1.175 | |

Overall power | 0.4933 | 0.8045 | 0.9000 |

Number of subjects | 109.8 | 219.6 | 329.4 |

Expected number of subjects under H1 | 173.5 | ||

Cumulative alpha spent | 0.0226 | 0.0382 | 0.0500 |

Cumulative beta spent | 0.0453 | 0.0763 | 0.1000 |

One-sided local significance level | 0.0226 | 0.0231 | 0.0238 |

Efficacy boundary (t) | 0.387 | 0.271 | 0.219 |

Futility boundary (t) | 0.056 | 0.159 | |

Overall exit probability (under H0) | 0.6378 | 0.2888 | |

Overall exit probability (under H1) | 0.5386 | 0.3422 | |

Exit probability for efficacy (under H0) | 0.0226 | 0.0148 | |

Exit probability for efficacy (under H1) | 0.4933 | 0.3112 | |

Exit probability for futility (under H0) | 0.6152 | 0.2740 | |

Exit probability for futility (under H1) | 0.0453 | 0.0311 |

Legend:

*(t)*: treatment effect scale

We first looked at the data after we collected 110 observations. At this time, we observed a mean difference of 0.1. Let us say we assume the population standard deviation is 1, and that we are willing to collect 330 observations in total, as this gave us 90% power for the effect we wanted to detect, a mean difference of 0.5. Given the effect sie we observed, which is smaller than our smallest effect size of interest, what is the probability we will find a significant effect if we continue? We create a dataset:

```
data_means <- getDataset(
n1 = c(55),
n2 = c(55),
means1 = c(0.1), # for directional test, means 1 > means 2
means2 = c(0),
stDevs1 = c(1),
stDevs2 = c(1)
)
```

and analyze the results:

```
stage_res <- getStageResults(seq_design,
equalVariances = TRUE,
dataInput = data_means
)
kable(stage_res)
```

**Stage results of means**

**Default parameters**

*Stages*: 1, 2, 3*Theta H0*: 0*Direction*: upper*Normal approximation*: FALSE*Equal variances*: TRUE

**Output**

*Overall test statistics*: 0.524, NA, NA*Overall p-values*: 0.3005, NA, NA*Cumulative means (1)*: 0.1, NA, NA*Cumulative means (2)*: 0, NA, NA*Cumulative standard deviations (1)*: 1, NA, NA*Cumulative standard deviations (2)*: 1, NA, NA*Cumulative sample sizes (1)*: 55, NA, NA*Cumulative sample sizes (2)*: 55, NA, NA*Stage-wise test statistics*: 0.524, NA, NA*Stage-wise p-values*: 0.3005, NA, NA*Cumulative effect sizes*: 0.1, NA, NA

**Legend**

*(i)*: values of treatment arm i

We can now perform a conditional power analysis based on the data we have observed so far. An important question is which effect size should be entered. Irrespective of the effect size we expected when designing the study, we have observed an effect of d = 0.1, and the smallest effect size of interest was a d = 0.38. We can compute the power under the assumption that the true effect size is d = 0.1 and d = 0.38:

```
# Compute conditional power after the first look
con_power_1 <- getConditionalPower(
design = seq_design,
stageResults = stage_res,
nPlanned = c(110, 110), # The sample size planned for the subsequent stages.
thetaH1 = 0.1, # alternative effect
assumedStDev = 1 # standard deviation
)
kable(con_power_1)
```

**Conditional power results means**

**User defined parameters**

*Planned sample size*: NA, 110, 110*Assumed effect under alternative*: 0.1*Assumed standard deviation*: 1

**Default parameters**

*Planned allocation ratio*: 1

**Output**

*Conditional power*: NA, 0.0382, 0.0904

If the true effect size is 0.1, the power is 0.09 at the final look. Under this assumption, there is little use in continuing the data collection. Under the assumption that the smallest effect size of interest would be true:

```
# Compute conditional power after the first look
con_power_2 <- getConditionalPower(
design = seq_design,
stageResults = stage_res,
nPlanned = c(110, 110), # The sample size planned for the subsequent stages.
thetaH1 = 0.38, # alternative effect
assumedStDev = 1 # standard deviation
)
kable(con_power_2)
```

**Conditional power results means**

**User defined parameters**

*Planned sample size*: NA, 110, 110*Assumed effect under alternative*: 0.38*Assumed standard deviation*: 1

**Default parameters**

*Planned allocation ratio*: 1

**Output**

*Conditional power*: NA, 0.3805, 0.7126

Under the assumption that the smallest effect size of interest exists, there is a reasonable probability of still observing a significant result at the last look (71.26%).

Because of the flexibility in choosing the number of looks, and the -spending function, it is important to preregister your statistical analysis plan. Preregistration allows other researchers to evaluate the severity of a test – how likely were you to find an effect if it is there, and how likely were you to not find an effect if there was no effect. Flexibility in the data analysis increases the Type 1 error rate, or the probability of finding an effect if there actually isn’t any effect (i.e., a false positive), and preregistering your sequential analysis plan can reveal to future readers that you severely tested your prediction.

The use of sequential analyses gives researchers more flexibility. To make sure this flexibility is not abused, the planned experimental design should be preregistered. The easiest way to do this is by either adding the rpact R code, or when the Shiny app is used, to use the export function and store the planned design as a PDF, R Markdown, or R file.

The **sample size** for a trial with binary endpoints can be calculated using the function `getSampleSizeRates()`

. This function is fully documented in the help page (`?getSampleSizeRates`

). Hence, we only provide some examples below.

First, load the rpact package.

```
library(rpact)
packageVersion("rpact")
```

`[1] '3.5.1'`

To get the **direction** of the effects correctly, note that in rpact the **index “2” in an argument name always refers to the control group, “1” to the intervention group, and treatment effects compare treatment versus control**. Specifically, for binary endpoints, the probabilities of an event in the control group and intervention group, respectively, are given by arguments `pi2`

and `pi1`

. The default treatment effect is the absolute risk difference `pi1 - pi2`

but the relative risk scale `pi1/pi2`

is also supported if the argument `riskRatio`

is set to `TRUE`

.

```
# Example of a standard trial:
# - probability 25% in control (pi2 = 0.25) vs. 40% (pi1 = 0.4) in intervention
# - one-sided test (sided = 1)
# - Type I error 0.025 (alpha = 0.025) and power 80% (beta = 0.2)
sampleSizeResult <- getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2
)
kable(sampleSizeResult)
```

**Design plan parameters and output for rates**

**Design parameters**

*Critical values*: 1.960*Significance level*: 0.0250*Type II error rate*: 0.2000*Test*: one-sided

**User defined parameters**

*Assumed treatment rate*: 0.400*Assumed control rate*: 0.250

**Default parameters**

*Risk ratio*: FALSE*Theta H0*: 0*Normal approximation*: TRUE*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Direction upper*: TRUE*Number of subjects fixed*: 303.7*Number of subjects fixed (1)*: 151.9*Number of subjects fixed (2)*: 151.9*Critical values (treatment effect scale)*: 0.103

**Legend**

*(i)*: values of treatment arm i

As per the output above, the required **total sample size** is 304 and the critical value corresponds to a minimal detectable difference (on the absolute risk difference scale, the default) of approximately 0.103. This calculation assumes that pi2 = 0.25 is the observed rate in treatment group 2.

A useful summary is provided with the generic `summary()`

function:

`kable(summary(sampleSizeResult))`

**Sample size calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, H1: treatment rate pi(1) = 0.4, control rate pi(2) = 0.25, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 303.7 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.103 |

Legend:

*(t)*: treatment effect scale

You can change the randomization allocation between the treatment groups using `allocationRatioPlanned`

:

```
# Example: Extension of standard trial
# - 2(intervention):1(control) randomization (allocationRatioPlanned = 2)
kable(summary(getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2,
allocationRatioPlanned = 2
)))
```

**Sample size calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, H1: treatment rate pi(1) = 0.4, control rate pi(2) = 0.25, planned allocation ratio = 2, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 346.3 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.104 |

Legend:

*(t)*: treatment effect scale

`allocationRatioPlanned = 0`

can be defined in order to obtain the optimum allocation ratio minimizing the overall sample size (the optimum ample size is only slightly smaller than sample size with equal allocation; practically, this has no effect):

```
# Example: Extension of standard trial
# optimum randomization ratio
kable(summary(getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2,
allocationRatioPlanned = 0
)))
```

**Sample size calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, H1: treatment rate pi(1) = 0.4, control rate pi(2) = 0.25, optimum planned allocation ratio = 0.953, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 303.6 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.103 |

Legend:

*(t)*: treatment effect scale

**Power** at given sample size can be calculated using the function `getPowerRates()`

. This function has the same arguments as `getSampleSizeRates()`

except that the maximum total sample size needs to be defined (`maxNumberOfSubjects`

) and the Type II error `beta`

is no longer needed. For one-sided tests, the direction of the test is also required. The default `directionUpper = TRUE`

indicates that for the alternative the probability in the intervention group `pi1`

is larger than the probability in the control group `pi2`

(`directionUpper = FALSE`

is the other direction):

```
# Example: Calculate power for a simple trial with total sample size 304
# as in the example above in case of pi2 = 0.25 (control) and
# pi1 = 0.37 (intervention)
powerResult <- getPowerRates(
pi2 = 0.25, pi1 = 0.37,
maxNumberOfSubjects = 304, sided = 1, alpha = 0.025
)
kable(powerResult)
```

**Design plan parameters and output for rates**

**Design parameters**

*Critical values*: 1.960*Significance level*: 0.0250*Test*: one-sided

**User defined parameters**

*Assumed treatment rate*: 0.370*Assumed control rate*: 0.250*Maximum number of subjects*: 304

**Default parameters**

*Risk ratio*: FALSE*Theta H0*: 0*Normal approximation*: TRUE*Treatment groups*: 2*Planned allocation ratio*: 1*Direction upper*: TRUE

**Power and output**

*Effect*: 0.12*Overall reject*: 0.6196*Number of subjects fixed*: 304*Number of subjects fixed (1)*: 152*Number of subjects fixed (2)*: 152*Critical values (treatment effect scale)*: 0.103

**Legend**

*(i)*: values of treatment arm i

The calculated **power** is provided in the output as **“Overall reject”** and is 0.620 for the example.

The `summary()`

command produces the output

`kable(summary(powerResult))`

**Power calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, power directed towards larger values, H1: treatment rate pi(1) = 0.37, control rate pi(2) = 0.25, number of subjects = 304.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Power | 0.6196 |

Number of subjects | 304.0 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.103 |

Legend:

*(t)*: treatment effect scale

The `getPowerRates()`

(as well as `getSampleSizeRates()`

) functions can also be called with a vector argument for the probability `pi1`

in the intervention group. This is illustrated below via a plot of power depending on this probability. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

```
# Example: Calculate power for simple design (with sample size 304 as above)
# for probabilities in intervention ranging from 0.3 to 0.5
powerResult <- getPowerRates(
pi2 = 0.25, pi1 = seq(0.3, 0.5, by = 0.01),
maxNumberOfSubjects = 304, sided = 1, alpha = 0.025
)
# one of several possible plots, this one plotting true effect size vs power
plot(powerResult, type = 7)
```

Sample size calculation proceeds in the same fashion as for superiority trials except that the role of the null and the alternative hypothesis are reversed. I.e., in this case, the non-inferiority margin corresponds to the treatment effect under the null hypothesis (`thetaH0`

) which one aims to reject. Testing in non-inferiority trials is always one-sided.

```
# Example: Sample size for a non-inferiority trial
# Assume pi(control) = pi(intervention) = 0.2
# Test H0: pi1 - pi2 = 0.1 (risk increase in intervention >= Delta = 0.1)
# vs. H1: pi1 - pi2 < 0.1
sampleSizeNoninf <- getSampleSizeRates(
pi2 = 0.2, pi1 = 0.2,
thetaH0 = 0.1, sided = 1, alpha = 0.025, beta = 0.2
)
kable(sampleSizeNoninf)
```

**Design plan parameters and output for rates**

**Design parameters**

*Critical values*: 1.960*Significance level*: 0.0250*Type II error rate*: 0.2000*Test*: one-sided

**User defined parameters**

*Theta H0*: 0.1*Assumed treatment rate*: 0.200

**Default parameters**

*Risk ratio*: FALSE*Normal approximation*: TRUE*Assumed control rate*: 0.200*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Direction upper*: FALSE*Number of subjects fixed*: 508.4*Number of subjects fixed (1)*: 254.2*Number of subjects fixed (2)*: 254.2*Critical values (treatment effect scale)*: 0.0285

**Legend**

*(i)*: values of treatment arm i

`kable(summary(sampleSizeNoninf))`

**Sample size calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0.1, H1: treatment rate pi(1) = 0.2, control rate pi(2) = 0.2, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 508.4 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.028 |

Legend:

*(t)*: treatment effect scale

The function `getSampleSizeRates()`

allows to set the number of `groups`

(which is 2 by default) to 1 for the design of single arm trials. The probability under the null hypothesis can be specified with the argument `thetaH0`

and the specific alternative hypothesis which is used for the sample size calculation with the argument `pi1`

. The sample size calculation can be based either on a normal approximation (`normalApproximation = TRUE`

, the default) or on exact binomial probabilities (`normalApproximation = FALSE`

).

```
# Example: Sample size for a single arm trial which tests
# H0: pi = 0.1 vs. H1: pi = 0.25
# (use conservative exact binomial calculation)
samplesSizeResults <- getSampleSizeRates(
groups = 1, thetaH0 = 0.1, pi1 = 0.25,
normalApproximation = FALSE, sided = 1, alpha = 0.025, beta = 0.2
)
kable(summary(samplesSizeResults))
```

**Sample size calculation for a binary endpoint**

Fixed sample analysis, significance level 2.5% (one-sided). The results were calculated for a one-sample test for rates (exact test), H0: pi = 0.1, H1: treatment rate pi = 0.25, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 53.0 |

One-sided local significance level | 0.0250 |

Efficacy boundary (t) | 0.181 |

Legend:

*(t)*: treatment effect scale

Sample size calculation for a group sequential trial is performed in **two steps**:

**Define the (abstract) group sequential design**using the function`getDesignGroupSequential()`

. For details regarding this step, see the vignette Defining group sequential boundaries with rpact.**Calculate sample size**for the binary endpoint by feeding the abstract design into the function`getSampleSizeRates()`

. Note that the power 1 - beta needs to be defined in the design function, and not in`getSampleSizeRates()`

.

In general, rpact supports both one-sided and two-sided group sequential designs. However, if futility boundaries are specified, only one-sided tests are permitted.

R code for a simple example is provided below:

```
# Example: Group-sequential design with O'Brien & Fleming type alpha-spending and
# one interim at 60% information
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.6, 1), typeOfDesign = "asOF"
)
# Sample size calculation assuming event probabilities are 25% in control
# (pi2 = 0.25) vs 40% (pi1 = 0.4) in intervention
sampleSizeResultGS <- getSampleSizeRates(design, pi2 = 0.25, pi1 = 0.4)
# Standard rpact output (sample size object only, not design object)
kable(sampleSizeResultGS)
```

**Design plan parameters and output for rates**

**Design parameters**

*Information rates*: 0.600, 1.000*Critical values*: 2.669, 1.981*Futility bounds (binding)*: -Inf*Cumulative alpha spending*: 0.003808, 0.025000*Local one-sided significance levels*: 0.003808, 0.023798*Significance level*: 0.0250*Type II error rate*: 0.2000*Test*: one-sided

**User defined parameters**

*Assumed treatment rate*: 0.400*Assumed control rate*: 0.250

**Default parameters**

*Risk ratio*: FALSE*Theta H0*: 0*Normal approximation*: TRUE*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Direction upper*: TRUE*Maximum number of subjects*: 306.3*Maximum number of subjects (1)*: 153.2*Maximum number of subjects (2)*: 153.2*Number of subjects [1]*: 183.8*Number of subjects [2]*: 306.3*Reject per stage [1]*: 0.3123*Reject per stage [2]*: 0.4877*Early stop*: 0.3123*Expected number of subjects under H0*: 305.9*Expected number of subjects under H0/H1*: 299.3*Expected number of subjects under H1*: 268.1*Critical values (treatment effect scale) [1]*: 0.187*Critical values (treatment effect scale) [2]*: 0.104

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

The `summary()`

command produces the output

`kable(summary(sampleSizeResultGS))`

**Sample size calculation for a binary endpoint**

Sequential analysis with a maximum of 2 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, H1: treatment rate pi(1) = 0.4, control rate pi(2) = 0.25, power 80%.

Stage | 1 | 2 |
---|---|---|

Information rate | 60% | 100% |

Efficacy boundary (z-value scale) | 2.669 | 1.981 |

Overall power | 0.3123 | 0.8000 |

Number of subjects | 183.8 | 306.3 |

Expected number of subjects under H1 | 268.1 | |

Cumulative alpha spent | 0.0038 | 0.0250 |

One-sided local significance level | 0.0038 | 0.0238 |

Efficacy boundary (t) | 0.187 | 0.104 |

Exit probability for efficacy (under H0) | 0.0038 | |

Exit probability for efficacy (under H1) | 0.3123 |

Legend:

*(t)*: treatment effect scale

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2023). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact")
```

`[1] '3.5.1'`

Suppose a trial should be conducted in 3 stages where at the first stage 50%, at the second stage 75%, and at the final stage 100% of the information should be observed. O’Brien & Fleming boundaries should be used with one-sided and non-binding futility bounds 0 and 0.5 for the first and the second stage, respectively, on the -value scale.

The endpoints are binary (failure rates) and should be compared in a parallel group design, i.e., the null hypothesis to be tested is which is tested against the alternative

The necessary sample size to achieve 90% power if the failure rates are assumed to be and can be obtained as follows:

```
dGS <- getDesignGroupSequential(
informationRates = c(0.5, 0.75, 1), alpha = 0.025, beta = 0.1,
futilityBounds = c(0, 0.5)
)
r <- getSampleSizeRates(dGS, pi1 = 0.4, pi2 = 0.6)
```

The `summary()`

command creates a nice table for the study design parameters:

`kable(summary(r))`

**Sample size calculation for a binary endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) - pi(2) = 0, H1: treatment rate pi(1) = 0.4, control rate pi(2) = 0.6, power 90%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 50% | 75% | 100% |

Efficacy boundary (z-value scale) | 2.863 | 2.337 | 2.024 |

Futility boundary (z-value scale) | 0 | 0.500 | |

Overall power | 0.2958 | 0.6998 | 0.9000 |

Number of subjects | 133.1 | 199.7 | 266.3 |

Expected number of subjects under H1 | 198.3 | ||

Cumulative alpha spent | 0.0021 | 0.0105 | 0.0250 |

One-sided local significance level | 0.0021 | 0.0097 | 0.0215 |

Efficacy boundary (t) | -0.248 | -0.165 | -0.124 |

Futility boundary (t) | 0.000 | -0.035 | |

Overall exit probability (under H0) | 0.5021 | 0.2275 | |

Overall exit probability (under H1) | 0.3058 | 0.4095 | |

Exit probability for efficacy (under H0) | 0.0021 | 0.0083 | |

Exit probability for efficacy (under H1) | 0.2958 | 0.4040 | |

Exit probability for futility (under H0) | 0.5000 | 0.2191 | |

Exit probability for futility (under H1) | 0.0100 | 0.0056 |

Legend:

*(t)*: treatment effect scale

Note that the calculation of the efficacy boundaries on the treatment effect scale is performed under the assumption that is the observed failure rate in the control group and states the *treatment difference to be observed* in order to reach significance (or stop the trial due to futility).

The optimum allocation ratio yields the smallest overall sample size and depends on the choice of and . It can be obtained by specifying `allocationRatioPlanned = 0`

. In our case, due to , the optimum allocation ratio is 1 but calculated numerically, therefore slightly unequal 1:

```
r <- getSampleSizeRates(dGS, pi1 = 0.4, pi2 = 0.6, allocationRatioPlanned = 0)
r$allocationRatioPlanned
```

`[1] 0.9999976`

`round(r$allocationRatioPlanned, 5)`

`[1] 1`

The decision boundaries can be illustrated on different scales.

On the -value scale:

`plot(r, type = 1)`

On the effect size scale:

`plot(r, type = 2)`

On the -value scale:

`plot(r, type = 3)`

Suppose that subjects were planned for the study. The power if the failure rate in the active treatment group is or can be achieved as follows:

```
power <- getPowerRates(dGS,
maxNumberOfSubjects = 280,
pi1 = c(0.4, 0.5), pi2 = 0.6, directionUpper = FALSE
)
power$overallReject
```

`[1] 0.914045 0.377853`

Note that `directionUpper = FALSE`

is used because the study is powered for alternatives being smaller than 0.

The power for (37.8%) is much reduced as compared to the case (where it exceeds 90%):

We also can graphically illustrate the power, the expected sample size, and the early stopping and futility stopping probabilities for a range of alternative values. This can be done by specifying the lower and the upper bound for in `getPowerRates()`

and use the generic `plot()`

command with `type = 6`

:

```
power <- getPowerRates(dGS,
maxNumberOfSubjects = 280,
pi1 = c(0.3, 0.6), pi2 = 0.6, directionUpper = FALSE
)
plot(power, type = 6)
```

Suppose that, using an adaptive design, the sample size from the above example can be increased *in the last interim* up to a 4-fold of the originally planned sample size for the last stage. Conditional power 90% *based on the observed effect sizes (failure rates)* should be used to increase the sample size. We want to use the inverse normal method to allow for the sample size increase and compare the test characteristics with the group sequential design from the above example.

To assess the test characteristics of this adaptive design we first define the inverse normal design and then perform two simulations, one without and one with SSR:

```
dIN <- getDesignInverseNormal(
informationRates = c(0.5, 0.75, 1),
alpha = 0.025, beta = 0.1, futilityBounds = c(0, 0.5)
)
sim1 <- getSimulationRates(dIN,
plannedSubjects = c(140, 210, 280),
pi1 = seq(0.4, 0.5, 0.01), pi2 = 0.6, directionUpper = FALSE,
maxNumberOfIterations = 1000, conditionalPower = 0.9,
minNumberOfSubjectsPerStage = c(140, 70, 70),
maxNumberOfSubjectsPerStage = c(140, 70, 70), seed = 1234
)
sim2 <- getSimulationRates(dIN,
plannedSubjects = c(140, 210, 280),
pi1 = seq(0.4, 0.5, 0.01), pi2 = 0.6, directionUpper = FALSE,
maxNumberOfIterations = 1000, conditionalPower = 0.9,
minNumberOfSubjectsPerStage = c(NA, 70, 70),
maxNumberOfSubjectsPerStage = c(NA, 70, 4 * 70), seed = 1234
)
```

Note that the sample sizes will be calculated under the assumption that the *conditional power for the subsequent stage* is 90%. If the resulting sample size is larger, the upper bound (4*70 = 280) is used.

Note also that `sim1`

can also be *calculated* using `getPowerRates()`

or can also *easier be simulated* without specifying `conditionalPower`

, `minNumberOfSubjectsPerStage`

, and `maxNumberOfSubjectsPerStage`

(which obviously is redundant for `sim1`

) but this way ensures that the calculated objects `sim1`

and `sim2`

*contain exactly the same parameters* and therefore can easier be combined (see below).

We can look at the power and the expected sample size of the two procedures and assess the power gain of using the adaptive design which comes along with an increased expected sample size:

`sim1$pi1`

` [1] 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50`

`round(sim1$overallReject, 3)`

` [1] 0.921 0.890 0.853 0.810 0.752 0.721 0.675 0.582 0.526 0.469 0.405`

`round(sim2$overallReject, 3)`

` [1] 0.976 0.971 0.940 0.912 0.882 0.869 0.800 0.718 0.695 0.601 0.475`

`round(sim1$expectedNumberOfSubjects, 1)`

` [1] 202.1 209.3 216.4 219.5 222.0 230.9 231.1 238.3 234.2 237.9 236.5`

`round(sim2$expectedNumberOfSubjects, 1)`

` [1] 240.7 251.6 270.6 278.8 286.6 305.8 323.8 330.9 336.4 336.8 349.3`

We now want to graphically illustrate the gain in power when using the adaptive sample size recalculation. We use ggplot2 (see ggplot2.tidyverse.org) for doing this. First, a dataset `df`

combining `sim1`

and `sim2`

is defined with the additional variable SSR. Defining `mytheme`

and using the following ggplot2 commands, the difference in power and ASN of the two strategies is illustrated. It shows that at least for (absolute) effect difference > 0.15 an overall power of more than around 85% can be achieved with the proposed sample size recalculation strategy.

```
library(ggplot2)
dataSim1 <- as.data.frame(sim1, niceColumnNamesEnabled = FALSE)
dataSim2 <- as.data.frame(sim2, niceColumnNamesEnabled = FALSE)
dataSim1$SSR <- rep("no SSR", nrow(dataSim1))
dataSim2$SSR <- rep("SSR", nrow(dataSim2))
df <- rbind(dataSim1, dataSim2)
myTheme <- theme(
axis.title.x = element_text(size = 12), axis.text.x = element_text(size = 12),
axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5)
)
p <- ggplot(
data = df,
aes(x = effect, y = overallReject, group = SSR, color = SSR)
) +
geom_line(size = 1.1) +
geom_line(aes(
x = effect, y = expectedNumberOfSubjects / 400,
group = SSR, color = SSR
), size = 1.1, linetype = "dashed") +
scale_y_continuous("Power",
sec.axis = sec_axis(~ . * 400, name = "ASN"),
limits = c(0.2, 1)
) +
xlab("effect") +
ggtitle("Power and ASN", "Power solid, ASN dashed") +
geom_hline(size = 0.5, yintercept = 0.8, linetype = "dotted") +
geom_hline(size = 0.5, yintercept = 0.9, linetype = "dotted") +
geom_vline(size = 0.5, xintercept = c(-0.2, -0.15), linetype = "dashed") +
theme_classic() +
myTheme
plot(p)
```

For saving the graph, use

`ggplot2::ggsave(filename = "c:/yourdirectory/comparison.png",`

`plot = ggplot2::last_plot(), device = NULL, path = NULL,`

`scale = 1.2, width = 20, height = 12, units = "cm", dpi = 600,`

`limitsize = TRUE)`

For another example of using ggplot2 in rpact see also the vignette Supplementing and enhancing rpact’s graphical capabilities with ggplot2.

Finally, we create a histogram for the attained sample size of the study *when using the adaptive sample size recalculation*.

With the `getData()`

command the simulation results are obtained and `str(simdata)`

provides information of the struture of this data:

```
simData <- getData(sim2)
str(simData)
```

```
'data.frame': 24579 obs. of 19 variables:
$ iterationNumber : num 1 2 2 2 3 3 4 4 4 5 ...
$ stageNumber : num 1 1 2 3 1 2 1 2 3 1 ...
$ pi1 : num 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 ...
$ pi2 : num 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 ...
$ numberOfSubjects : num 140 140 70 147 140 70 140 70 91 140 ...
$ numberOfCumulatedSubjects: num 140 140 210 357 140 210 140 210 301 140 ...
$ rejectPerStage : num 1 0 0 1 0 1 0 0 1 1 ...
$ futilityPerStage : num 0 0 0 0 0 0 0 0 0 0 ...
$ testStatistic : num 3.05 2.03 2.07 4.07 2.03 ...
$ testStatisticsPerStage : num 3.054 2.028 0.718 4.547 2.029 ...
$ overallRate1 : num 0.329 0.414 0.438 0.369 0.429 ...
$ overallRate2 : num 0.586 0.586 0.581 0.607 0.6 ...
$ stagewiseRates1 : num 0.329 0.414 0.486 0.27 0.429 ...
$ stagewiseRates2 : num 0.586 0.586 0.571 0.644 0.6 ...
$ sampleSizesPerStage1 : num 70 70 35 74 70 35 70 35 46 70 ...
$ sampleSizesPerStage2 : num 70 70 35 73 70 35 70 35 45 70 ...
$ trialStop : logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ conditionalPowerAchieved : num NA NA 0.602 0.9 NA ...
$ pValue : num 0.00112984 0.02126124 0.23628281 0.00000272 0.02121903 ...
```

Depending on (in this example, for ), you can create the histogram of the simulated total sample size as follows:

```
simDataPart <- simData[simData$pi1 == 0.5, ]
overallSampleSizes <-
sapply(1:1000, function(i) {
sum(simDataPart[simDataPart$iterationNumber == i, ]$numberOfSubjects)
})
hist(overallSampleSizes, main = "Histogram", xlab = "Achieved sample size")
```

How often the maximum and other sample sizes are reached over the stages can be obtained as follows:

```
subjectsRange <- cut(simDataPart$numberOfSubjects, c(69, 70, 139, 140, 210, 279, 280),
labels = c(
"(69,70]", "(70,139]", "(139,140]",
"(140,210]", "(210,279]", "(279,280]"
)
)
kable(round(prop.table(table(simDataPart$stageNumber, subjectsRange), margin = 1) * 100, 1))
```

(69,70] | (70,139] | (139,140] | (140,210] | (210,279] | (279,280] |
---|---|---|---|---|---|

0 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 |

100 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

0 | 9.1 | 0.3 | 7.9 | 7.1 | 75.5 |

For this simulation, the originally planned sample size (70) was never selected for the third stage and in most of the cases the maximum of sample size (280) was used.

Gernot Wassmer and Werner Brannath,

*Group Sequential and Confirmatory Adaptive Designs in Clinical Trials*, Springer 2016, ISBN 978-3319325606R-Studio,

*Data Visualization with ggplot2 - Cheat Sheet*, version 2.1, 2016, https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2023). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

In rpact, **sample size calculation for a group sequential trial proceeds by following the same two steps regardless of whether the endpoint is a continuous, binary, or a time-to-event endpoint**:

**Define the (abstract) group sequential boundaries**of the design using the function`getDesignGroupSequential()`

.**Calculate sample size for the endpoint of interest**by feeding the abstract boundaries from step 1. into specific functions for the endpoint of interest. This step uses functions such as`getSampleSizeMeans()`

(for continuous endpoints),`getSampleSizeRates()`

(for binary endpoints), and`getSampleSizeSurvival()`

(for survival endpoints).

The mathematical rationale for this two-step approach is that all group sequential trials, regardless of the chosen endpoint type, rely on the fact that the -scores at different interim stages follow the same “canonical joint multivariate distribution” (at least asymptotically).

This document covers the more abstract first step, **Step 2 is not covered in this document but it is covered in the separate endpoint-specific R Markdown files for continuous, binary, and time to event endpoints.** Of note, step 1 can be omitted for trials without interim analyses.

These examples are not intended to replace the official rpact documentation and help pages but rather to supplement them.

In general, rpact supports both one-sided and two-sided group sequential designs. If futility boundaries are specified, however, only one-sided tests are permitted. **For simplicity, it is often preferred to use one-sided tests for group sequential designs** (typically, with ).

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact")
```

`[1] '3.5.1'`

**Example:**

- Interim analyses at information fractions 33%, 67%, and 100% (
`informationRates = c(0.33, 0.67, 1)`

). [Note: For equally spaced interim analyses, one can also specify the maximum number of stages (`kMax`

, including the final analysis) instead of the`informationRates`

.] - Lan & DeMets -spending approximation to the O’Brien & Fleming boundaries (
`typeOfDesign = "asOF"`

) - -spending approaches allow for flexible timing of interim analyses and corresponding adjustment of boundaries.

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF"
)
```

The originally published O’Brien & Fleming boundaries are obtained via `typeOfDesign = "OF"`

which is also the default (therefore, if you do not specify `typeOfDesign`

, this type is selected). Note that strict Type I error control is only guaranteed for standard boundaries without -spending if the pre-defined interim schedule (i.e., the information fractions at which interim analyses are conducted) is exactly adhered to.

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "OF"
)
```

Pocock (`typeOfDesign = "P"`

for constant boundaries over the stages, `typeOfDesign = "asP"`

for corresponding -spending version) or Haybittle & Peto (`typeOfDesign = "HP"`

) boundaries (reject at interim if -value exceeds 3) is obtained with, for example,

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "P"
)
```

- Kim & DeMets -spending (
`typeOfDesign = "asKD`

) with parameter`gammaA`

(power function:`gammaA = 1`

is linear spending,`gammaA = 2`

quadratic) - Hwang, Shi & DeCani -spending (
`typeOfDesign = "asHSD"`

) with parameter`gammaA`

(for details, see Wassmer & Brannath 2016, p. 76) - Standard Wang & Tsiatis Delta classes (
`typeOfDesign = "WT"`

) and (`typeOfDesign = "WToptimum"`

)

```
# Quadratic Kim & DeMets alpha-spending
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asKD", gammaA = 2
)
```

User-defined -spending functions (`typeOfDesign = "asUser"`

) can be obtained via the argument `userAlphaSpending`

which must contain a numeric vector with elements that define the values of the cumulative alpha-spending function at each interim analysis.

```
# Example: User-defined alpha-spending function which is very conservative at
# first interim (spend alpha = 0.001), conservative at second (spend an additional
# alpha = 0.01, i.e., total cumulative alpha spent is 0.011 up to second interim),
# and spends the remaining alpha at the final analysis (i.e., cumulative
# alpha = 0.025)
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1),
typeOfDesign = "asUser",
userAlphaSpending = c(0.001, 0.01 + 0.001, 0.025)
)
# $stageLevels below extract local significance levels across interim analyses.
# Note that the local significance level is exactly 0.001 at the first
# interim, but slightly >0.01 at the second interim because the design
# exploits correlations between interim analyses.
design$stageLevels
```

`[1] 0.00100000 0.01052883 0.02004781`

- The argument
`futilityBounds`

contains a vector of futility bounds (on the -value scale) for each interim (but not the final analysis). - A futility bound of corresponds to an estimated treatment effect of zero or “null”, i.e., in this case futility stopping is recommended if the treatment effect estimate at the interim analysis is zero or “goes in the wrong direction”. Futility bounds of (which are numerically equivalent to ) correspond to no futility stopping at an interim.
- Due to the design of rpact, it is not possible to directly define futility boundaries on the treatment effect scale. If this is desired, one would need to manually convert the treatment effect scale to the -scale or, alternatively, experiment by varying the boundaries on the -scale until this implies the targeted critical values on the treatment effect scale. (Critical values on treatment effect scales are routinely provided by sample size functions for different endpoint types such as
`getSampleSizeMeans()`

(for continuous endpoints),`getSampleSizeRates()`

(for binary endpoints), and`getSampleSizeSurvival()`

(for survival endpoints). Please see the R Markdown files for these endpoint types for further details.) - By default, all futility boundaries are non-binding (
`bindingFutility = FALSE`

). Binding futility boundaries (`bindingFutility = TRUE`

) are not recommended although they are provided for the sake of completeness.

```
# Example: non-binding futility boundary at each interim in case
# estimated treatment effect is null or goes in "the wrong direction"
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

Formal -spending functions are defined in the same way as for -spending functions, e.g., a Pocock type -spending can be specified as `typeBetaSpending = "bsP"`

and `beta`

needs to be specified, the default is `beta = 0.20`

.

```
# Example: beta-spending function approach with O'Brien & Fleming alpha-spending
# function and Pocock beta-spending function
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.1,
typeOfDesign = "asOF",
typeBetaSpending = "bsP"
)
```

Another way to formally derive futility bounds is through the Pampallona and Tsiatis approach. This is through defining `typeBetaSpending = "PT"`

, and the specification of two parameters, `deltaPT1`

(shape of decision regions for rejecting the null) and `deltaPT0`

(shape of shifted decision regions for rejecting the alternative), for example

```
# Example: beta-spending function approach with O'Brien & Fleming boundaries for
# rejecting the null and Pocock boundaries for rejecting H1
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.1,
typeOfDesign = "PT",
deltaPT1 = 0, deltaPT0 = 0.5
)
```

Note that both the -spending as well as the Pampallona & Tsiatis approach can be selected to be one-sided or two-sided, the bounds for rejecting the alternative to be binding (`bindingFutility = TRUE`

) or non-binding (`bindingFutility = FALSE`

).

Such designs can be implemented by using a user-defined -spending function which spends all of the Type I error at the final analysis. Note that such designs do not allow stopping for efficacy regardless how persuasive the effect is.

```
# Example: non-binding futility boundary using an O'Brien & Fleming type
# beta spending function. No early stopping for efficacy (i.e., all alpha
# is spent at the final analysis).
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asUser",
userAlphaSpending = c(0, 0, 0.025), typeBetaSpending = "bsOF",
bindingFutility = FALSE
)
```

`Changed type of design to 'noEarlyEfficacy'`

As indicated, you can specifiy `typeOfDesign = "noEarlyEfficacy"`

which is a shortcut for `typeOfDesign = "asUser"`

and `userAlphaSpending = c(0, 0, 0.025)`

.

We use the design with an O’Brien & Fleming -spending function and prespecified futility bounds:

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

`design`

object`kable(design)`

**Design parameters and output of group sequential design**

**User defined parameters**

*Type of design*: O’Brien & Fleming type alpha spending*Information rates*: 0.330, 0.670, 1.000*futilityBoundsNonBinding*: 0.000, 0.000

**Derived from user defined parameters**

*Maximum number of stages*: 3

**Default parameters**

*Stages*: 1, 2, 3*Significance level*: 0.0250*Type II error rate*: 0.2000*Two-sided power*: FALSE*Binding futility*: FALSE*Test*: one-sided*Tolerance*: 0.00000001*Type of beta spending*: none

**Output**

*Cumulative alpha spending*: 0.00009549, 0.00617560, 0.02500000*Critical values*: 3.731, 2.504, 1.994*Stage levels (one-sided)*: 0.00009549, 0.00614213, 0.02309189

The key information is contained in the object including **critical values on the -scale** (“Critical values” in rpact output, `design$criticalValues`

) and **one-sided local significance levels** (“Stage levels” in rpact output,`design$stageLevels`

). Note that the local significance levels are always given as one-sided levels in rpact even if a two-sided design is specified.

`names(design)`

provides names of all objects included in the `design`

object and `as.data.frame(design)`

collects all design information into one data frame. `summary(design)`

gives a slightly more detailed output. For more details about applying R generics to rpact objects, please refer to the separte R Markdown file How to use R generics with rpact.

`names(design)`

```
[1] "kMax" "alpha" "stages"
[4] "informationRates" "userAlphaSpending" "criticalValues"
[7] "stageLevels" "alphaSpent" "bindingFutility"
[10] "tolerance" "typeOfDesign" "beta"
[13] "deltaWT" "deltaPT1" "deltaPT0"
[16] "futilityBounds" "gammaA" "gammaB"
[19] "optimizationCriterion" "sided" "betaSpent"
[22] "typeBetaSpending" "userBetaSpending" "power"
[25] "twoSidedPower" "constantBoundsHP" "betaAdjustment"
[28] "delayedInformation" "decisionCriticalValues" "reversalProbabilities"
```

`summary()`

creates a nice presentation of the design that also contains information about the sample size of the design (see below):

`kable(summary(design))`

**Sequential analysis with a maximum of 3 looks (group sequential design)**

O’Brien & Fleming type alpha spending design, non-binding futility, one-sided overall significance level 2.5%, power 80%, undefined endpoint, inflation factor 1.0605, ASN H1 0.8628, ASN H01 0.8689, ASN H0 0.6589.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33% | 67% | 100% |

Efficacy boundary (z-value scale) | 3.731 | 2.504 | 1.994 |

Stage levels (one-sided) | <0.0001 | 0.0061 | 0.0231 |

Futility boundary (z-value scale) | 0 | 0 | |

Cumulative alpha spent | <0.0001 | 0.0062 | 0.0250 |

Overall power | 0.0191 | 0.4430 | 0.8000 |

Futility probabilities under H1 | 0.049 | 0.003 |

`getDesignCharacteristics(design)`

provides more detailed information about the design:

```
designChar <- getDesignCharacteristics(design)
kable(designChar)
```

**Group sequential design characteristics**

*Number of subjects fixed*: 7.8489*Shift*: 8.3241*Inflation factor*: 1.0605*Informations*: 2.747, 5.577, 8.324*Power*: 0.01907, 0.44296, 0.80000*Rejection probabilities under H1*: 0.01907, 0.42389, 0.35704*Futility probabilities under H1*: 0.048720, 0.003437*Ratio expected vs fixed sample size under H1*: 0.8628*Ratio expected vs fixed sample size under a value between H0 and H1*: 0.8689*Ratio expected vs fixed sample size under H0*: 0.6589

`names(designChar)`

```
[1] "nFixed" "shift" "inflationFactor"
[4] "stages" "information" "power"
[7] "rejectionProbabilities" "futilityProbabilities" "averageSampleNumber1"
[10] "averageSampleNumber01" "averageSampleNumber0"
```

**Note that the design characteristics depend on beta that needs to be specified in getDesignGroupSequential(). By default, beta = 0.20.**

Explanations regarding the output:

**Maximum sample size inflation factor**(`$inflationFactor`

): This is the maximal sample size a group sequential trial requires relative to the sample size of a fixed design without interim analyses.- Probabilities of stopping due to a significant result at each interim or the final analysis (
`$rejectionProbabilities`

), cumulative power (`$power`

), and probability of stopping for futility at each interim (`$futilityProbabilities`

). All of these are calculated under the alternative H1. **Expected sample size**of group sequential design (relative to fixed design) under the alternative hypothesis H1 (`$averageSampleNumber1`

), under the null hypothesis H0 (`$averageSampleNumber0`

), and under the parameter in the middle between H0 and H1.- In addition,
`getDesignCharacteristics(design)`

provides the required sample size for an abstract group sequential single arm trial with a normal outcome, effect size 1, and standard deviation 1 (i.e., the simplest group sequential setting from a mathematical point of view). The sample size for such a trial without interim analyses is given as`$nFixed`

and the maximum sample size of the corresponding group sequential design as`$shift`

.

The practical relevance of this abstract design is that the **properties of the design** (critical values, sample size inflation factor, rejection probabilies, etc) **carry over to group sequential designs regardless of the endpoint (e.g. continuous, binary, or survial)** as they all share the same underlying canonical multivariate normal distribution of the -scores.

**Overall stopping probabilities, rejection probabilities, and futility probabilities under the null (H0) and the alternative (H1)** (overall and at each stage) can be calculated using the function `getPowerAndAverageSampleNumber()`

. To get these numbers, one needs to provide the maximum sample size and the effect size (0 under H0, 1 under H1) of the corresponding type of design.

```
# theta = 0 for calculations under H0
kable(getPowerAndAverageSampleNumber(design,
theta = c(0), nMax = designChar$shift
))
```

**Power and average sample size (ASN)**

**User defined parameters**

*N_max*: 8.3241*Effect*: 0

**Output**

*Average sample sizes (ASN)*: 5.172*Power*: 0.02377*Early stop*: 0.6323*Early stop [1]*: 0.5001*Early stop [2]*: 0.1322*Early stop [3]*: NA*Overall reject*: 0.02377*Reject per stage [1]*: 0.00009549*Reject per stage [2]*: 0.00605889*Reject per stage [3]*: 0.01761940*Overall futility*: 0.6262*Futility stop per stage [1]*: 0.5000*Futility stop per stage [2]*: 0.1262

**Legend**

*[k]*: values at stage k

```
# theta = 1 for calculations under alternative H1
kable(getPowerAndAverageSampleNumber(design,
theta = 1, nMax = designChar$shift
))
```

**Power and average sample size (ASN)**

**User defined parameters**

*N_max*: 8.3241*Effect*: 1

**Output**

*Average sample sizes (ASN)*: 6.772*Power*: 0.8000*Early stop*: 0.4951*Early stop [1]*: 0.06779*Early stop [2]*: 0.42733*Early stop [3]*: NA*Overall reject*: 0.8000*Reject per stage [1]*: 0.01907*Reject per stage [2]*: 0.42389*Reject per stage [3]*: 0.35704*Overall futility*: 0.05216*Futility stop per stage [1]*: 0.048720*Futility stop per stage [2]*: 0.003437

**Legend**

*[k]*: values at stage k

Note that the power under H0, i.e., the significance level, is slightly below 0.025 in this example as it is calculated under the assumption that the non-binding futility boundaries are adhered to.

Both (and even more) values can be obtained with one command `getPowerAndAverageSampleNumber(design, theta = c(0, 1), nMax = designChar$shift)`

We use again the design with an O’Brien & Fleming -spending function and prespecified futility bounds:

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

Boundaries can be plotted using the `plot`

(or `plot.TrialDesign`

) function which produces a ggplot2 object.

The most relevant plots for (abstract) boundaries without easily interpretable treatment effect are boundary plots on -scale (`type = 1`

) or -value-scale (`type = 3`

) as well as plots of the -spending function (`type = 4`

). Conveniently, argument `showSource = TRUE`

also provides the source data for the plot. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

`plot(design, type = 1, showSource = TRUE)`

```
Source data of the plot (type 1):
x-axis: design$informationRates
y-axes:
y1: c(design$futilityBounds, design$criticalValues[length(design$criticalValues)])
y2: design$criticalValues
Simple plot command examples:
plot(design$informationRates, c(design$futilityBounds, design$criticalValues[length(design$criticalValues)]), type = "l")
plot(design$informationRates, design$criticalValues, type = "l")
```

`plot(design, type = 3, showSource = TRUE)`

```
Source data of the plot (type 3):
x-axis: design$informationRates
y-axis: design$stageLevels
Simple plot command example:
plot(design$informationRates, design$stageLevels, type = "l")
```

`plot(design, type = 4, showSource = TRUE)`

```
Source data of the plot (type 4):
x-axis: design$informationRates
y-axis: design$alphaSpent
Simple plot command example:
plot(design$informationRates, design$alphaSpent, type = "l")
```

Decision regions for two-sided tests with futility bounds are displayed accordingly:

```
design <- getDesignGroupSequential(
sided = 2, alpha = 0.05, beta = 0.2,
informationRates = c(0.33, 0.67, 1),
typeOfDesign = "asOF",
typeBetaSpending = "bsP",
bindingFutility = FALSE
)
plot(design, type = 1)
```

Multiple designs can be combined into a design set (`getDesignSet()`

) and their properties plotted jointly:

```
# O'Brien & Fleming, 3 equally spaced stages
d1 <- getDesignGroupSequential(typeOfDesign = "OF", kMax = 3)
# Pocock
d2 <- getDesignGroupSequential(typeOfDesign = "P", kMax = 3)
designSet <- getDesignSet(designs = c(d1, d2), variedParameters = "typeOfDesign")
plot(designSet, type = 1)
```

Even simpler, in rpact 3.0, you can also use `plot(d1, d2)`

.

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2023). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

Group-sequential designs based on -spending functions protect the Type I error exactly even if the pre-planned interim schedule is not exactly adhered to. However, this requires re-calculation of the group sequential boundaries at each interim analysis based on actually observed information fractions. Unless deviations from the planned information fractions are substantial, the re-calculated boundaries are quite similar to the pre-planned boundaries and the re-calculation will affect the actual test decision only on rare occasions.

Importantly, it is not allowed that the timing of future interim analyses is “motivated” by results from earlier interim analyses as this could inflate the Type I error rate. Deviations from the planned information fractions should thus only occur due to operational reasons (as it is difficult to hit the exact number of events exactly in a real trial) or due to external evidence.

The general principles for these boundary re-calculation are as follows (see also, Wassmer & Brannath, 2016, p78f):

- Updates at interim analyses prior to the final analysis:
- Information fractions are updated according to the actually observed information fraction at the interim analysis relative to the
**planned**maximum information. - The planned -spending function is then applied to these updated information fractions.

- Information fractions are updated according to the actually observed information fraction at the interim analysis relative to the
- Updates at the final analysis in case the observed information at the final analysis is larger (“over-running”) or smaller (“under-running”) than the planned maximum information:
- Information fractions are updated according to the actually observed information fraction at all interim analyses relative to the
**observed**maximum information. Information fraction at final analysis is re-set to 1 but information fractions for earlier interim analyses are also changed. - The originally planned -spending function cannot be applied to these updated information fractions because this would modify the critical boundaries of earlier interim analyses which is clearly not allowed. Instead, one uses the that has actually been spent at earlier interim analyses and spends all remaining at the final analysis.

- Information fractions are updated according to the actually observed information fraction at all interim analyses relative to the

This general principle be implemented via a user-defined -spending function and is illustrated for an example trial with a survival endpoint below. We provide two solutions to the problem: the first is a way how existing tools in rpact can directly be used to solve the problem, the second is an automatic recalculation of the boundaries using a new parameter set (`maxInformation`

and `informationEpsilon`

) which is available in the `getAnalysisResults()`

function since rpact version 3.1. This solution is described in Section @ref(sec:automatic) at the end of this document.

**First, load the rpact package **

```
library(rpact)
packageVersion("rpact") # version should be version 3.1 or later
```

`[1] '3.5.1'`

The original trial design for this example is based on a standard O’Brien & Fleming type -spending function with planned efficacy interim analyses after 50% and 75% of information as specified below.

```
# Initial design
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.5, 0.75, 1), typeOfDesign = "asOF"
)
# Initial sample size calculation
sampleSizeResult <- getSampleSizeSurvival(
design = design,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000
)
# Summarize design
kable(summary(sampleSizeResult))
```

**Sample size calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 50% | 75% | 100% |

Efficacy boundary (z-value scale) | 2.963 | 2.359 | 2.014 |

Overall power | 0.1680 | 0.5400 | 0.8000 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Cumulative number of events | 193.4 | 290.1 | 386.8 |

Analysis time | 39.082 | 52.710 | 69.107 |

Expected study duration | 58.0 | ||

Cumulative alpha spent | 0.0015 | 0.0096 | 0.0250 |

One-sided local significance level | 0.0015 | 0.0092 | 0.0220 |

Efficacy boundary (t) | 0.653 | 0.758 | 0.815 |

Exit probability for efficacy (under H0) | 0.0015 | 0.0081 | |

Exit probability for efficacy (under H1) | 0.1680 | 0.3720 |

Legend:

*(t)*: treatment effect scale

Assume that the first interim was conducted after 205 rather than the planned 194 events.

The updated design is calculated as per the code below. Note that for the calculation of boundary values on the treatment effect scale, we use the function `getPowerSurvival()`

with the updated design rather than the function `getSampleSizeSurvival()`

as we are only updating the boundary, not the sample size or the maximum number of events.

```
# Update design using observed information fraction at first interim.
# Information fraction of later interim analyses is not changed.
designUpdate1 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205 / 387, 0.75, 1), typeOfDesign = "asOF"
)
# Recalculate the power to get boundary values on the effect scale
# (Use original maxNumberOfEvents and sample size)
powerUpdate1 <- getPowerSurvival(
design = designUpdate1,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 387, directionUpper = FALSE
)
```

The updated information rates and corresponding boundaries as per the output above are summarized as follows:

**Power calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, power directed towards smaller values, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, maximum number of events = 387, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 53% | 75% | 100% |

Efficacy boundary (z-value scale) | 2.867 | 2.366 | 2.015 |

Overall power | 0.2097 | 0.5391 | 0.8001 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Expected number of events | 317.0 | ||

Cumulative number of events | 205.0 | 290.2 | 387.0 |

Analysis time | 40.600 | 52.733 | 69.144 |

Expected study duration | 57.8 | ||

Cumulative alpha spent | 0.0021 | 0.0096 | 0.0250 |

One-sided local significance level | 0.0021 | 0.0090 | 0.0220 |

Efficacy boundary (t) | 0.670 | 0.758 | 0.815 |

Exit probability for efficacy (under H0) | 0.0021 | 0.0076 | |

Exit probability for efficacy (under H1) | 0.2097 | 0.3294 |

Legend:

*(t)*: treatment effect scale

Assume that the efficacy boundary was not crossed at the first interim analysis and the trial continued to the second interim analysis which was conducted after 285 rather than the planned 291 events. The updated design is calculated in the same way as for the first interim analysis as per the code below. The idea is to use the cumulative spent from the first stage and an updated cumulative that is spent for the second stage. For the second stage, this can be obtained with the original O’Brien & Fleming -spending function:

```
# Update design using observed information fraction at first and second interim.
designUpdate2 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205 / 387, 285 / 387, 1), typeOfDesign = "asOF"
)
# Recalculate power to get boundary values on effect scale
# (Use original maxNumberOfEvents and sample size)
powerUpdate2 <- getPowerSurvival(
design = designUpdate2,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 387, directionUpper = FALSE
)
kable(summary(powerUpdate2))
```

**Power calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, power directed towards smaller values, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, maximum number of events = 387, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 53% | 73.6% | 100% |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.011 |

Overall power | 0.2097 | 0.5198 | 0.8004 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Expected number of events | 317.2 | ||

Cumulative number of events | 205.0 | 285.0 | 387.0 |

Analysis time | 40.600 | 51.931 | 69.144 |

Expected study duration | 57.8 | ||

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

One-sided local significance level | 0.0021 | 0.0084 | 0.0222 |

Efficacy boundary (t) | 0.670 | 0.753 | 0.815 |

Exit probability for efficacy (under H0) | 0.0021 | 0.0069 | |

Exit probability for efficacy (under H1) | 0.2097 | 0.3101 |

Legend:

*(t)*: treatment effect scale

Assume that the efficacy boundary was also not crossed at the second interim analysis and the trial continued to the final analysis which was conducted after 393 rather than the planned 387 events. The updated design is calculated as per the code below. The idea here to use the cumulative spent from the first *and* the second stage stage and the final that is spent for the last stage. An updated correlation has to be used and the original O’Brien & Fleming -spending function cannot be used anymore. Instead, the -spending function needs to be user defined as follows:

```
# Update boundary with information fractions as per actually observed event numbers
# !! use user-defined alpha-spending and spend alpha according to actual alpha spent
# according to the second interim analysis
designUpdate3 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205, 285, 393) / 393,
typeOfDesign = "asUser",
userAlphaSpending = designUpdate2$alphaSpent
)
# Recalculate power to get boundary values on effect scale
# (Use planned sample size and **observed** maxNumberOfEvents)
powerUpdate3 <- getPowerSurvival(
design = designUpdate3,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 393, directionUpper = FALSE
)
kable(summary(powerUpdate3))
```

**Power calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, power directed towards smaller values, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, maximum number of events = 393, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 52.2% | 72.5% | 100% |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.014 |

Overall power | 0.2097 | 0.5198 | 0.8060 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Expected number of events | 320.1 | ||

Cumulative number of events | 205.0 | 285.0 | 393.0 |

Analysis time | 40.600 | 51.931 | 70.280 |

Expected study duration | 58.4 | ||

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

One-sided local significance level | 0.0021 | 0.0084 | 0.0220 |

Efficacy boundary (t) | 0.670 | 0.753 | 0.816 |

Exit probability for efficacy (under H0) | 0.0021 | 0.0069 | |

Exit probability for efficacy (under H1) | 0.2097 | 0.3101 |

Legend:

*(t)*: treatment effect scale

For easier comparison, all discussed boundary updates and power calculations are summarized more conveniently below. Note that each update only affects boundaries for the current or later analyses, i.e., earlier boundaries are never retrospectively modified.

**Sample size calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 50% | 75% | 100% |

Efficacy boundary (z-value scale) | 2.963 | 2.359 | 2.014 |

Overall power | 0.1680 | 0.5400 | 0.8000 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Cumulative number of events | 193.4 | 290.1 | 386.8 |

Analysis time | 39.082 | 52.710 | 69.107 |

Expected study duration | 58.0 | ||

Cumulative alpha spent | 0.0015 | 0.0096 | 0.0250 |

One-sided local significance level | 0.0015 | 0.0092 | 0.0220 |

Efficacy boundary (t) | 0.653 | 0.758 | 0.815 |

Exit probability for efficacy (under H0) | 0.0015 | 0.0081 | |

Exit probability for efficacy (under H1) | 0.1680 | 0.3720 |

Legend:

*(t)*: treatment effect scale

**Power calculation for a survival endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample logrank test, H0: hazard ratio = 1, power directed towards smaller values, H1: hazard ratio = 0.75, control lambda(2) = 0.012, maximum number of subjects = 1000, maximum number of events = 393, accrual time = 33.333, accrual intensity = 30, dropout rate(1) = 0.025, dropout rate(2) = 0.025, dropout time = 12.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 52.2% | 72.5% | 100% |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.014 |

Overall power | 0.2097 | 0.5198 | 0.8060 |

Number of subjects | 1000.0 | 1000.0 | 1000.0 |

Expected number of subjects under H1 | 1000.0 | ||

Expected number of events | 320.1 | ||

Cumulative number of events | 205.0 | 285.0 | 393.0 |

Analysis time | 40.600 | 51.931 | 70.280 |

Expected study duration | 58.4 | ||

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

One-sided local significance level | 0.0021 | 0.0084 | 0.0220 |

Efficacy boundary (t) | 0.670 | 0.753 | 0.816 |

Exit probability for efficacy (under H0) | 0.0021 | 0.0069 | |

Exit probability for efficacy (under H1) | 0.2097 | 0.3101 |

Legend:

*(t)*: treatment effect scale

We now show how a concrete data analysis with an -spending function design can be performed by specifying the parameter `maxInformation`

in the `getAnalysisResults()`

function. As above, we start with an initial design, which in this situation is arbitrary and can be considered as a dummy design. Note that neither the number of stages nor the information rates need to be fixed.

```
# Dummy design
dummy <- getDesignGroupSequential(sided = 1, alpha = 0.025, typeOfDesign = "asOF")
```

The survival design was planned with a maximum of 387 events, the first interim took place after the observation of 205 events, the second after 285 events. Specifying the parameter `maxInformation`

makes it now extremly easy to perform the analysis for the first and the second stage. Assume that we have observed log-rank statistics 1.87 and 2.19 at the first and the second interim, respectively. This observation together with the event numbers is defined in the `getDataset()`

function through

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285),
cumulativeLogRanks = c(1.87, 2.19)
)
```

Note that it is important to define **cumulative**Events and **cumulative**LogRanks because otherwise the stage wise events and logrank statistics should be entered (in the given case, these will be calculated).

We now can enter the planned maximum number of events in the `getAnalysisResults()`

function as follows:

```
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387
)
```

This provides the summary:

**Analysis results for a survival endpoint**

Sequential analysis with 3 looks (group sequential design). The results were calculated using a two-sample logrank test (one-sided, alpha = 0.025). H0: hazard ratio = 1 against H1: hazard ratio > 1.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.53 | 0.736 | 1 |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.011 |

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

Stage level | 0.0021 | 0.0084 | 0.0222 |

Cumulative effect size | 1.299 | 1.296 | |

Overall test statistic | 1.870 | 2.190 | |

Overall p-value | 0.0307 | 0.0143 | |

Test action | continue | continue | |

Conditional rejection probability | 0.1927 | 0.3987 | |

95% repeated confidence interval | [0.870; 1.938] | [0.976; 1.721] | |

Repeated p-value | 0.1159 | 0.0380 |

We see that the boundaries are correctly calculated according to the observed information rates. If there is overrunning, i.e., the final analysis was conducted after 393 rather than the planned 387 events, first define the observed dataset

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285, 393),
cumulativeLogRanks = c(1.87, 2.19, 2.33)
)
```

and then use the `getAnalysisResults()`

function as before:

```
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387
)
```

The messages describe the way of how the critical value for the last stage using the recalculated information rates (leaving the critical values for the first two stages unchanged) was calculated. This way was described in Section @ref(sec:update). The last warning indicates that for this case, since there is no “natural” family of decision boundaries, repeated p-values for the final stage of the trial are not calculated.

The summary shows that indeed the recalculated boundary for the last stage and the already used boundaries for the first two stages are used for decision making:

**Analysis results for a survival endpoint**

Sequential analysis with 3 looks (group sequential design). The results were calculated using a two-sample logrank test (one-sided, alpha = 0.025). H0: hazard ratio = 1 against H1: hazard ratio > 1.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.522 | 0.725 | 1 |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.014 |

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

Stage level | 0.0021 | 0.0084 | 0.0220 |

Cumulative effect size | 1.299 | 1.296 | 1.265 |

Overall test statistic | 1.870 | 2.190 | 2.330 |

Overall p-value | 0.0307 | 0.0143 | 0.0099 |

Test action | continue | continue | reject |

Conditional rejection probability | 0.1910 | 0.3883 | |

95% repeated confidence interval | [0.870; 1.938] | [0.976; 1.721] | [1.032; 1.550] |

Repeated p-value | 0.1159 | 0.0380 | |

Final p-value | 0.0148 | ||

Final confidence interval | [1.023; 1.534] | ||

Median unbiased estimate | 1.255 |

We also can consider the case of underrunning which is the case if, for example, it was decided **before conducting the analysis** that, say, also if up to 3 less events than the considered maximum number will be observed, this should be considered as the final analysis (i.e., the final stage is reached if 384 or more events were observed). Inserting the parameter `informationEpsilon`

in the `getAnalysisResults()`

function can be used for this. There are two ways for defining this parameter. You can do it

- in an absolute sense: the parameter
`informationEpsilon`

specifies the number of events to be allowed to deviate from the maximum number of events. This is achieved by specifying a positive integer number for`informationEpsilon`

. - in a relative sense: if a number x < 1 for
`informationEpsilon`

is specified, the stage is considered as the final stage if x% of`maxInformation`

is observed.

Both ways yield a correct calculation of the critical value to be used for the final stage. Suppose, for example, 385 events were observed and `informationEpsilon`

was set equal to 3. Then, since `387 - 385 < 3`

, this is an underrunning case and the critical value at the final stage is provided in the summary:

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285, 385),
cumulativeLogRanks = c(1.87, 2.19, 2.21)
)
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387,
informationEpsilon = 3
)
```

**Analysis results for a survival endpoint**

Sequential analysis with 3 looks (group sequential design). The results were calculated using a two-sample logrank test (one-sided, alpha = 0.025). H0: hazard ratio = 1 against H1: hazard ratio > 1.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.532 | 0.74 | 1 |

Efficacy boundary (z-value scale) | 2.867 | 2.393 | 2.010 |

Cumulative alpha spent | 0.0021 | 0.0090 | 0.0250 |

Stage level | 0.0021 | 0.0084 | 0.0222 |

Cumulative effect size | 1.299 | 1.296 | 1.253 |

Overall test statistic | 1.870 | 2.190 | 2.210 |

Overall p-value | 0.0307 | 0.0143 | 0.0136 |

Test action | continue | continue | reject |

Conditional rejection probability | 0.1932 | 0.4023 | |

95% repeated confidence interval | [0.870; 1.938] | [0.976; 1.721] | [1.021; 1.538] |

Repeated p-value | 0.1159 | 0.0380 | |

Final p-value | 0.0175 | ||

Final confidence interval | [1.016; 1.524] | ||

Median unbiased estimate | 1.246 |

We see that again the recalculated boundary for the last stage and the already used boundaries for the first two stages are used for decision making.

In summary, `maxInformation`

in the `getAnalysisResults()`

function can be used to perform an -spending function approach in practice. Also, if at the analysis stage overrunning or (pre-defined) underrunnig takes place the use of the parameters `maxInformation`

and `informationEpsilon`

in the function provides an easy was to perform a correct analysis with the specified design.

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

These examples are not intended to replace the official rpact documentation and help pages but rather to supplement them. They also only cover a selection of all rpact features.

General convention: In rpact, arguments containing the **index “2”** always refer to the **control group**, **“1”** refer to the **intervention group**, and **treatment effects compare treatment versus control**.

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact") # version should be version 3.0 or later
```

`[1] '3.5.1'`

The **sample size** for a trial with continuous endpoints can be calculated using the function `getSampleSizeMeans()`

. This function is fully documented in the relevant help page (`?getSampleSizeMeans`

). Some examples are provided below.

`getSampleSizeMeans()`

requires that the mean difference between the two arms is larger under the alternative than under the null hypothesis. For superiority trials, this implies that **rpact requires that the targeted mean difference is >0 under the alternative hypothesis**. If this is not the case, the function produces an error message. To circumvent this and power for a negative mean difference, **one can simply switch the two arms** (leading to a positive mean difference) as the situation is perfectly symmetric.

By default, `getSampleSizeMeans()`

tests hypotheses about the mean difference. rpact also supports testing hypotheses about mean ratios if the argument `meanRatio`

is set to `TRUE`

but this will not be discussed further in this document.

By default, rpact uses sample size formulas for the -test, i.e., it assumes that the standard deviation in the two groups is equal but unknown and estimated from the data. If sample size calculations for the -test are desired, one can set the argument `normalApproximation`

to `TRUE`

but this is usually not recommended.

```
# Example of a standard trial:
# - targeted mean difference is 10 (alternative = 10)
# - standard deviation in both arms is assumed to be 24 (stDev = 24)
# - two-sided test (sided = 2), Type I error 0.05 (alpha = 0.05) and power 80%
# - (beta = 0.2)
sampleSizeResult <- getSampleSizeMeans(
alternative = 10, stDev = 24, sided = 2,
alpha = 0.05, beta = 0.2
)
kable(sampleSizeResult)
```

**Design plan parameters and output for means**

**Design parameters**

*Critical values*: 1.960*Two-sided power*: FALSE*Significance level*: 0.0500*Type II error rate*: 0.2000*Test*: two-sided

**User defined parameters**

*Alternatives*: 10*Standard deviation*: 24

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Number of subjects fixed*: 182.8*Number of subjects fixed (1)*: 91.4*Number of subjects fixed (2)*: 91.4*Lower critical values (treatment effect scale)*: -7.006*Upper critical values (treatment effect scale)*: 7.006*Local one-sided significance levels*: 0.0500

**Legend**

*(i)*: values of treatment arm i

The generic `summary()`

function produces the output

`kable(summary(sampleSizeResult))`

**Sample size calculation for a continuous endpoint**

Fixed sample analysis, significance level 5% (two-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 10, standard deviation = 24, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 182.8 |

Two-sided local significance level | 0.0500 |

Lower efficacy boundary (t) | -7.006 |

Upper efficacy boundary (t) | 7.006 |

Legend:

*(t)*: treatment effect scale

As per the output above, the required **total sample size** for the trial is 183 and the critical value corresponds to a minimal detectable mean difference of approximately 7.01.

Unequal randomization between the treatment groups can be defind with `allocationRatioPlanned`

, for example,

```
# Extension of standard trial:
# - 2(intervention):1(control) randomization (allocationRatioPlanned = 2)
kable(summary(getSampleSizeMeans(
alternative = 10, stDev = 24,
allocationRatioPlanned = 2, sided = 2, alpha = 0.05, beta = 0.2
)))
```

**Sample size calculation for a continuous endpoint**

Fixed sample analysis, significance level 5% (two-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 10, standard deviation = 24, planned allocation ratio = 2, power 80%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Number of subjects | 205.4 |

Two-sided local significance level | 0.0500 |

Lower efficacy boundary (t) | -7.004 |

Upper efficacy boundary (t) | 7.004 |

Legend:

*(t)*: treatment effect scale

**Power** for a given sample size can be calculated using the function `getPowerMeans()`

which has the same arguments as `getSampleSizeMeans()`

except that the maximum total sample is given (`maxNumberOfSubjects`

) instead of the Type II error (`beta`

).

```
# Calculate power for the 2:1 rendomized trial with total sample size 206
# (as above) assuming a larger difference of 12
powerResult <- getPowerMeans(
alternative = 12, stDev = 24, sided = 2,
allocationRatioPlanned = 2, maxNumberOfSubjects = 206, alpha = 0.05
)
kable(powerResult)
```

**Design plan parameters and output for means**

**Design parameters**

*Critical values*: 1.960*Significance level*: 0.0500*Test*: two-sided

**User defined parameters**

*Alternatives*: 12*Standard deviation*: 24*Planned allocation ratio*: 2*Direction upper*: NA*Maximum number of subjects*: 206

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Treatment groups*: 2

**Power and output**

*Effect*: 12*Overall reject*: 0.9203*Number of subjects fixed*: 206*Number of subjects fixed (1)*: 137.3*Number of subjects fixed (2)*: 68.7*Lower critical values (treatment effect scale)*: -6.994*Upper critical values (treatment effect scale)*: 6.994*Local one-sided significance levels*: 0.0500

**Legend**

*(i)*: values of treatment arm i

The calculated **power** is provided in the output as **“Overall reject”** and is 0.92 for the example `alternative = 12`

.

The `summary()`

function produces

`kable(summary(powerResult))`

**Power calculation for a continuous endpoint**

Fixed sample analysis, significance level 5% (two-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 12, standard deviation = 24, number of subjects = 206, planned allocation ratio = 2.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 1.960 |

Power | 0.9203 |

Number of subjects | 206.0 |

Two-sided local significance level | 0.0500 |

Lower efficacy boundary (t) | -6.994 |

Upper efficacy boundary (t) | 6.994 |

Legend:

*(t)*: treatment effect scale

`getPowerMeans()`

(as well as `getSampleSizeMeans()`

) can also be called with a vector argument for the mean difference under the alternative H1 (`alternative`

). This is illustrated below via a plot of power depending on these values. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

```
# Example: Calculate power for design with sample size 206 as above
# alternative values ranging from 5 to 15
powerResult <- getPowerMeans(
alternative = 5:15, stDev = 24, sided = 2,
allocationRatioPlanned = 2, maxNumberOfSubjects = 206, alpha = 0.05
)
plot(powerResult, type = 7) # one of several possible plots
```

The sample size calculation proceeds in the same fashion as for superiority trials except that the role of the null and the alternative hypothesis are reversed and the test is always one-sided. In this case, the non-inferiority margin corresponds to the treatment effect under the null hypothesis (`thetaH0`

) which one aims to reject.

```
# Example: Non-inferiority trial with margin delta = 10, standard deviation = 14
# - One-sided alpha = 0.05, 1:1 randomization
# - H0: treatment difference <= -12 (i.e., = -12 for calculations, thetaH0 = -1)
# vs. alternative H1: treatment difference = 0 (alternative = 0)
sampleSizeNoninf <- getSampleSizeMeans(
thetaH0 = -12, alternative = 0,
stDev = 14, alpha = 0.025, beta = 0.2, sided = 1
)
kable(sampleSizeNoninf)
```

**Design plan parameters and output for means**

**Design parameters**

*Critical values*: 1.960*Significance level*: 0.0250*Type II error rate*: 0.2000*Test*: one-sided

**User defined parameters**

*Theta H0*: -12*Alternatives*: 0*Standard deviation*: 14

**Default parameters**

*Mean ratio*: FALSE*Normal approximation*: FALSE*Treatment groups*: 2*Planned allocation ratio*: 1

**Sample size and output**

*Number of subjects fixed*: 44.7*Number of subjects fixed (1)*: 22.4*Number of subjects fixed (2)*: 22.4*Critical values (treatment effect scale)*: -3.556

**Legend**

*(i)*: values of treatment arm i

Sample size calculation for a group sequential trials is performed in **two steps**:

**Define the (abstract) group sequential design**using the function`getDesignGroupSequential()`

. For details regarding this step, see the R markdown file Defining group sequential boundaries with rpact.**Calculate sample size**for the continuous endpoint by feeding the abstract design into the function`getSampleSizeMeans()`

.

In general, rpact supports both one-sided and two-sided group sequential designs. However, if futility boundaries are specified, only one-sided tests are permitted. **For simplicity, it is often preferred to use one-sided tests for group sequential designs** (typically, with ).

R code for a simple example is provided below:

```
# Example: Group-sequential design with O'Brien & Fleming type alpha-spending
# and one interim at 60% information
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.6, 1), typeOfDesign = "asOF"
)
# Trial assumes an effect size of 10 as above, a stDev = 24, and an allocation
# ratio of 2
sampleSizeResultGS <- getSampleSizeMeans(
design,
alternative = 10, stDev = 24, allocationRatioPlanned = 2
)
# Standard rpact output (sample size object only, not design object)
kable(sampleSizeResultGS)
```

**Design plan parameters and output for means**

**Design parameters**

*Information rates*: 0.600, 1.000*Critical values*: 2.669, 1.981*Futility bounds (binding)*: -Inf*Cumulative alpha spending*: 0.003808, 0.025000*Local one-sided significance levels*: 0.003808, 0.023798*Significance level*: 0.0250*Type II error rate*: 0.2000*Test*: one-sided

**User defined parameters**

*Alternatives*: 10*Standard deviation*: 24*Planned allocation ratio*: 2

**Default parameters**

*Mean ratio*: FALSE*Theta H0*: 0*Normal approximation*: FALSE*Treatment groups*: 2

**Sample size and output**

*Maximum number of subjects*: 207.1*Maximum number of subjects (1)*: 138.1*Maximum number of subjects (2)*: 69*Number of subjects [1]*: 124.3*Number of subjects [2]*: 207.1*Number of subjects (1) [1]*: 82.9*Number of subjects (1) [2]*: 138.1*Number of subjects (2) [1]*: 41.4*Number of subjects (2) [2]*: 69*Reject per stage [1]*: 0.3123*Reject per stage [2]*: 0.4877*Early stop*: 0.3123*Expected number of subjects under H0*: 206.8*Expected number of subjects under H0/H1*: 202.4*Expected number of subjects under H1*: 181.3*Critical values (treatment effect scale) [1]*: 12.393*Critical values (treatment effect scale) [2]*: 7.050

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

```
# Summary rpact output for sample size object
kable(summary(sampleSizeResultGS))
```

**Sample size calculation for a continuous endpoint**

Sequential analysis with a maximum of 2 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 10, standard deviation = 24, planned allocation ratio = 2, power 80%.

Stage | 1 | 2 |
---|---|---|

Information rate | 60% | 100% |

Efficacy boundary (z-value scale) | 2.669 | 1.981 |

Overall power | 0.3123 | 0.8000 |

Number of subjects | 124.3 | 207.1 |

Expected number of subjects under H1 | 181.3 | |

Cumulative alpha spent | 0.0038 | 0.0250 |

One-sided local significance level | 0.0038 | 0.0238 |

Efficacy boundary (t) | 12.393 | 7.050 |

Exit probability for efficacy (under H0) | 0.0038 | |

Exit probability for efficacy (under H1) | 0.3123 |

Legend:

*(t)*: treatment effect scale

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document describes how sample size and power calculations for count data can be performed using rpact. This is shown for the fixed sample and the group sequential case thereby illustrating different ways of entering recruitment and observation schemes. It also describes how blinded sample size recalculation procedures can be performed.

Examples for count data described in the literature are

- exacerbations in asthma and chronic obstructive pulmonary disease (COPD)
- counts of brain lesions by MRI in Multiple Sclerosis (MS)
- relapses in pediatric MS
- hospitalizations in heart failure trials
- number of occurrences of adverse events

Typically, the count outcome is assumed to be distributed according to a negative binomial distribution and the hypothesis to be tested is

where are the mean rates (in one time unit) of a negative binomial distributed random variable with overdispersion (shape) parameter ,

where refers to the exposure time of subject in treatment group at interim stage of the group sequential test procedure (cf., Mütze et al., 2019). The expectation and variance of are given by

respectively, i.e., the case refers to the case where is Poisson distributed. For the fixed sample case, the index for the interim stage is omitted. In superiority trials, , whereas, for non-inferiority trials, a positive non-inferiority margin is specified.

In many cases, each subject is observed a given length of time, e.g., one year. In this case, , and, as will be shown below, the sample size formulas described in the literature are applicable. If subjects entering the study have different exposure times, typically an accrual time is followed by an additional follow-up time. If subjects entering the study in an accrual period and the study time is , at time point , the time under exposure for subject in treatment at stage of the trial is with and denoting the recruitment time for subject in treatment . This more general approach is specifically necessary if the observation times at interim stages need to be estimated. This will also be illustrated by examples later on.

For group sequential designs, the test statistic is based on the Wald statistic which is the difference of the rates on the log-scale divided by its standard error. As shown in Mütze et al. (2019), if Maximum Likelihood estimates are used to estimate the true parameters, the sequence of Wald statistics asymptotically has the independent and normally distributed increments property. For designs with interim stages, it is essential that interim analyses take place after specified amounts of information. The information level of the Wald statistic (the Fisher information) at stage is given by

which simplifies to

if , i.e., if all subjects have complete observations. From these terms, essentially, the sample size and other calculations for a count data type design are derived.

The sample size calculation to meet power for two-sample comparisons is performed for

- for group sequential designs, the type of design (e.g., -spending),
- an assumed and ,
- assumed exposure times for treatment , and subjects , at interim stage ,
- a planned allocation ratio ,
- and an assumed overdispersion .

`getSampleSizeCounts()`

performs sample size and power calculations for count data designs. You can specify

- a group sequential or a fixed sample size setting
- either and , or and , or the pooled and , the latter being essential for blinded sample size reassessment (SSR) procedures (see below). and can be vectors
- different ways of calculation: fixed exposure time, accrual and study time, or accrual and fixed number of subjects
- staggered subjects entry

The usage of the function (listing the parameters that can be specified) is as follows:

```
getSampleSizeCounts(
design = NULL,
...,
lambda1 = NA_real_,
lambda2 = NA_real_,
lambda = NA_real_,
theta = NA_real_,
thetaH0 = 1,
overdispersion = 0,
fixedExposureTime = NA_real_,
accrualTime = NA_real_,
accrualIntensity = NA_real_,
followUpTime = NA_real_,
maxNumberOfSubjects = NA_real_,
allocationRatioPlanned = NA_real_
)
```

which will now be illustrated by examples.

`getPowerCounts()`

conversely calculates the power at given sample sizes, and essentially the same parameters can be specified.

Consider the clinical trial in COPD subjects from Zhu and Lakkis (2014). Assume that a new therapy is assumed to decrease the exacerbation rate from 0.80 to 0.68 (15% decrease relative to control) within an observation period of 0.75 years, i.e., each subject has a equal follow-up of 0.75 years. Subjects are randomly allocated to treatment and control with equal allocation 1:1.

The sample size that yields 90% power for detecting such a difference, if the overdispersion is assumed to be equal to 0.4, is obtained as follows.

First, load the `rpact`

package

```
library(rpact)
packageVersion("rpact") # version should be version 3.5.0 or higher
```

`[1] '3.5.1'`

The `example1$nFixed1`

element is the number of subjects in the treatment group, `example1$nFixed2`

refers to the number of subjects in the control group:

```
example1 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
fixedExposureTime = 0.75
)
c(example1$nFixed1, example1$nFixed2)
```

`[1] 1316 1316`

and we conclude that N = 2632 subjects in total are needed to provide 80% power.

Conversely, `getPowerCounts()`

performs the power calculation at given sample size, and note `directionUpper = FALSE`

to specify that the power is directed for :

```
example2 <- getPowerCounts(
alpha = 0.025,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
fixedExposureTime = 0.75,
directionUpper = FALSE,
maxNumberOfSubjects = example1$nFixed
)
example2$overallReject
```

`[1] 0.8000924`

The following graph illustrates the sample sizes for stronger effects . Note that for this plot only the lower and upper bound of need to be specified:

```
getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = c(0.75, 0.85),
overdispersion = 0.4,
fixedExposureTime = 0.75
) |>
plot()
```

In the fixed sample case this is the only available plot type (`type = 5`

).

For `getPowerCounts()`

the only available plot type in the fixed sample case is `type = 7`

, the following graph also illustrates how elements can be added to the `ggplot2`

object:

```
getPowerCounts(
alpha = 0.025,
lambda2 = 0.8,
theta = c(0.8, 1),
overdispersion = 0.4,
fixedExposureTime = 0.75,
directionUpper = FALSE,
maxNumberOfSubjects = example1$nFixed
) |>
plot() +
ylab("Power") +
ggtitle("Power for count data design for varying effect") +
geom_hline(linewidth = 0.5, yintercept = 0.025, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 0.8, linetype = "dotted")
```

The influence of the overdispersion parameter on the total sample size is illustrated in the following graph for increasing effect :

```
results <- c()
for (theta in seq(0.75, 0.85, 0.05)) {
for (phi in seq(0, 1, 0.1)) {
results <- rbind(
results,
getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = theta,
overdispersion = phi,
fixedExposureTime = 0.75
) |>
as.data.frame()
)
}
}
ggplot(
data = results,
aes(x = overdispersion, y = nFixed, group = theta, color = as.factor(theta))
) +
xlab("Overdispersion") +
ylab("Total sample size") +
geom_line(linewidth = 1.1) +
geom_hline(linewidth = 0.5, yintercept = 1000, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 2000, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 3000, linetype = "dotted") +
labs(color = "Theta") +
theme_classic()
```

Zhu and Lakkis (2014) proposed three methods for calculating the sample size and the methodology implemented in `rpact`

refers to the M2 method described in their paper. The M2 method corresponds to the sample size formulas given in, e.g., Friede and Schmidli (2010a, 2010b) and Mütze et al (2019). It is in fact easy to recalculate the sample sizes in Table 1 of their paper:

```
results <- c()
for (phi in c(0.4, 0.7, 1, 1.5)) {
for (theta in c(0.85, 1.15)) {
for (lambda2 in seq(0.8, 1.4, 0.2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = lambda2,
theta = theta,
overdispersion = phi,
fixedExposureTime = 0.75
)$nFixed1)
}
}
}
cat(paste0(results, collapse = ", "))
```

1316, 1101, 957, 854, 1574, 1324, 1157, 1037, 1494, 1279, 1135, 1033, 1815, 1565, 1398, 1278, 1673, 1457, 1313, 1211, 2056, 1806, 1639, 1520, 1970, 1754, 1611, 1508, 2458, 2208, 2041, 1921

Similarly, Table 2 results (column M2) with unequal allocation between the treatment arms can be reconstructed by

```
results <- c()
for (phi in c(1, 5)) {
for (theta in c(0.5, 1.5)) {
for (lambda2 in c(2, 5, 10)) {
for (r in c(2 / 3, 1, 3 / 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = lambda2,
theta = theta,
overdispersion = phi,
allocationRatioPlanned = r,
fixedExposureTime = 1
)$nFixed)
}
}
}
}
cat(paste0(results, collapse = ", "))
```

124, 116, 117, 90, 86, 88, 80, 76, 79, 280, 272, 287, 232, 224, 235, 215, 208, 217, 395, 376, 389, 363, 348, 360, 352, 338, 350, 1075, 1036, 1082, 1027, 988, 1030, 1012, 972, 1013

Slight deviations resulting from rounding errors.

With the `getSampleSizeCounts()`

function it is easy to determine the allocation ratio that provides the smallest overall sample size at given power . This can be done by setting `allocationRatioPlanned = 0`

. In the example from above,

```
example3 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
allocationRatioPlanned = 0,
fixedExposureTime = 0.75
)
```

`example3$allocationRatioPlanned`

`[1] 1.068791`

`example3$nFixed`

`[1] 2629`

calculates the optimum allocation ratio to be equal to 1.069 thereby reducing the necessary sample size only very slightly from 2632 to 2629. With this result it might not be reasonable at all to deviate from a planned 1:1 allocation.

Friede and Schmidli (2010a, 2010b) consider blinded SSR procedures with count data. They show that blinded SSR to reestimate the overdispersion parameter maintains the required power without increasing the Type I error rate. The procedure is simply to calculate the overdispersion at interim in a blinded manner and to recalculate the sample size with a pooled event rate estimate and *under the assumption of the originally assumed effect*.

For example, if in the situation from above the overdispersion was estimated from the pooled sample to be, say, 0.352, and the overall event rate is estimated as , the recalculated sample size is

```
example4 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = 0.921,
theta = 0.85,
overdispersion = 0.352,
fixedExposureTime = 0.75
)
example4$nFixed
```

`[1] 2152`

thus reducing the necessary sample size from 2632 to 2152. Note that, of course, it is important when the interim review is performed. If it is done early, the nuisance parameters and cannot be estimated precisely enough, if it is done very late the recalculated sample size might be smaller and therefore the observation is larger than needed. This of course also has an impact on the test characteristics and might be investigated by simulations (Friede and Schmidli, 2010a). Methods for blinded estimation are compared by Schneider et al. (2013).

For checking the results of `rpact`

, the sample sizes in Table 1 from Friede and Schmidli (2010b) can be reconstructed by

```
results <- c()
for (theta in c(0.7, 0.8)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = lambda,
theta = theta,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

177, 135, 114, 190, 147, 126, 202, 159, 138, 446, 339, 286, 477, 371, 318, 509, 402, 349

and

```
results <- c()
for (theta in c(0.7, 0.8)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.1,
lambda = lambda,
theta = theta,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

237, 180, 152, 254, 197, 168, 270, 213, 185, 597, 454, 383, 639, 496, 425, 681, 539, 467

Slight deviations resulting from rounding errors.

For the non-inferiority case a non-inferiority margin needs to be specified and entered as `thetaH0`

. Typically, no difference in the event rates is assumed between the treatment groups (i.e., ). In that case, the control arm sample sizes from Table 2 and Table 3 from Friede and Schmidli (2010b) are obtained with

```
results <- c()
for (delta0 in c(1.15, 1.2)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = lambda,
theta = 1,
thetaH0 = delta0,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

1126, 858, 724, 1206, 938, 804, 1286, 1018, 885, 662, 504, 426, 709, 551, 473, 756, 599, 520

We will now consider count data designs with interim stages. First, you need to specify the design which is here defined as an O’Brien and Fleming alpha spending function design with interim analyses planned after 40% and 70% of the information:

```
design <- getDesignGroupSequential(
informationRates = c(0.4, 0.7, 1),
typeOfDesign = "asOF"
)
```

Suppose study subjects are observed with fixed exposure time of 12 months and have event rates 0.2 and 0.3 in the treatment and the control arm, respectively, with an overdispersion parameter equal to 1. Specify these parameters as follows to obtain the summary:

```
getSampleSizeCounts(
design = design,
lambda1 = 0.2,
lambda2 = 0.3,
fixedExposureTime = 12,
overdispersion = 1.5
) |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Overall power | 0.0580 | 0.4682 | 0.8000 |

Maximum number of subjects | 360.0 | ||

Information over stages | 19.4 | 33.9 | 48.5 |

Expected information under H0 | 48.4 | ||

Expected information under H0/H1 | 46.9 | ||

Expected information under H1 | 40.8 | ||

Maximum information | 48.5 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0580 | 0.4102 |

This summary displays the maximum amount of information (`round(example5$maxInformation, 2) =`

48.47) that needs to be achieved with N = 360 subjects together with stopping probabilities under , midway between and , under , and stopping probabilities under and if interim analyses are performed at 19.39 and 33.93, and the final analysis at 48.47.

If non-binding futility stops are planned, these might be derived from an O’Brien and Fleming beta spending function with , i.e., the following design as displayed in the graph below:

```
designFutility <- getDesignGroupSequential(
informationRates = c(0.4, 0.7, 1),
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
bindingFutility = FALSE
)
designFutility |>
plot()
```

This yields the following test characteristics with additional futility stop probabilities, resulting in slightly higher number of subjects and information levels necessary to achieve power 80%:

```
getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
fixedExposureTime = 12,
overdispersion = 1.5
) |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Overall power | 0.0688 | 0.5133 | 0.8000 |

Maximum number of subjects | 394.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

Similar to survival designs (see, e.g., Planning a Survival Trial with rpact) it is possible with the `getSampleSizeCounts()`

function to calculate calendar times where the information is estimated to be observed under the given parameters.

For the first case, suppose there is uniform recruitment of subjects over 6 months, and subjects *are followed for a prespecified time period which is identical for all subjects* as above. Specify `accrualTime = 6`

as an additional function parameter and obtain the following summary:

```
example7 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
fixedExposureTime = 12,
accrualTime = 6
)
example7 |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, accrual time = 6, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Overall power | 0.0688 | 0.5133 | 0.8000 |

Calendar time | 4.696 | 7.113 | 18.000 |

Expected study duration under H1 | 10.775 | ||

Number of subjects | 308.0 | 394.0 | 394.0 |

Expected number of subjects under H1 | 384.4 | ||

Maximum number of subjects | 394.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

You might also use the gscounts package in order to obtain very similar results. The relevant functionality for count data, however, is included in `rpact`

and the maintainer of `gscounts`

encourages the use of `rpact`

.

A different situation is given if subjects *have varying exposure time*. For this setting, assume we again have uniform recruitment of subjects over 6 months, but the study ends 12 months after the last subject entered the study. That is, the study is planned to be conducted in 18 months, having subjects that are observed (i.e., under exposure) between 12 and 18 months.

In order to perform the sample size calculation for this case, instead of `fixedExposureTime`

the parameter `followUpTime`

has to specified. It is the assumed (additional) follow-up time for the study and so the total study duration is `accrualTime + followUpTime`

.

```
example8 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = 6,
followUpTime = 12
)
example8 |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, accrual time = 6, follow-up time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Overall power | 0.0688 | 0.5133 | 0.8000 |

Calendar time | 4.810 | 7.419 | 18.000 |

Expected study duration under H1 | 10.949 | ||

Number of subjects | 304.0 | 380.0 | 380.0 |

Expected number of subjects under H1 | 371.5 | ||

Maximum number of subjects | 380.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

As expected, the maximum number of subjects is a bit lower (380 vs. 394) having also different calendar time estimates.

In `getSampleSizeCounts()`

, you can specify `maxNumberOfSubjects`

or `accrualTime`

and `acrualIntensity`

and find the study time, i.e., the necessary follow up time in order to achieve the required information levels. For example, one can calculate how long the study duration should be if subject recruitment is performed over 7.5 months instead of 6 months, i.e., if 475 instead of 380 subjects will be recruited?

```
example9 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = 7.5,
maxNumberOfSubjects = 7.5 / 6 * example8$maxNumberOfSubjects
)
example9$calendarTime
```

```
[,1]
[1,] 4.799704
[2,] 7.031870
[3,] 9.979368
```

You might also specify the parameter `accrualIntensity`

that describes the *number of subjects per time unit* in order to obtain the same result:

```
example10 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = c(0, 7.5),
accrualIntensity = c(1 / 6 * example8$maxNumberOfSubjects)
)
example10$calendarTime
```

```
[,1]
[1,] 4.799704
[2,] 7.031870
[3,] 9.979368
```

Since `accrualTime`

and `accrualIntensity`

can be defined as vectors, it is also possible to define a non-uniform recruitment scheme and investigate the influence on the estimated parameters.

As an important note, the Fisher information used for the calendar time calculation is bounded for and varying time point. Therefore it might happen that the numerical search algorithm fails, there is no derivable observation time, and an error message is displayed. For , this problem does not occur.

Friede, T., Schmidli, H. (2010a). Blinded sample size reestimation with count data: methods and applications in multiple sclerosis. *Statistics in Medicine*, 29, 1145-1156. https://doi.org/10.1002/sim.3861

Friede, T., Schmidli, H. (2010b). Blinded sample size reestimation with negative binomial counts in superiority and non-inferiority trials. *Methods of Information in Medicine*, 49, 618-624. https://doi.org/10.3414/ME09-02-0060

Mütze, T., Glimm, E., Schmidli, H., Friede, T. (2019). Group sequential designs for negative binomial outcomes. *Statistical Methods in Medical Research*, 28(8), 2326-2347. https://doi.org/10.1177/0962280218773115

Schneider, S., Schmidli, H., Friede, T. (2013). Robustness of methods for blinded sample size re‐estimation with overdispersed count data. Statistics in Medicine, 32(21), 3623-3653. https://doi.org/10.1002/sim.5800

Wassmer, G and Brannath, W. *Group Sequential and Confirmatory Adaptive Designs in Clinical Trials* (2016), ISBN 978-3319325606 https://doi.org/10.1007/978-3-319-32562-0

Zhu, H., Lakkis, H. (2014). Sample size calculation for comparing two negative binomial rates. *Statistics in Medicine*, 33, 376-387. https://doi.org/10.1002/sim.5947

*System* rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), *platform* x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

In rpact version 3.3, the group sequential methodology from Hampson and Jennison (2013) is implemented. As traditional group sequential designs are characterized specifically by the underlying boundary sets, one main task was to write a function returning the decision critical values according to the calculation rules in Hampson and Jennison (2013). The function returning the respective critical values has been validated, particularly via simulation towards Type I error rate control in various settings (for an example see below). Subsequently, functions characterizing a delayed response group sequential test in terms of power, maximum sample size and expected sample size have been written. These functions were integrated in the rpact functions `getDesignGroupSequential()`

, `getDesignCharacteristics()`

, and the corresponding `getSampleSize...()`

and `getPower...()`

functions.

The classical group sequential methodology works on the assumptions of having no treatment response delay, i.e., it is assumed that enrolled subjects are observed upon recruitment or at least shortly after. In many practical situation, this assumption does not hold true. Instead, it might be that there is a latency between the timing of recruitment and the actual measurement of a primary endpoint. That is, at interim there is some information in pipeline.

One method to handle this pipeline information was proposed by Hampson & Jennison (2013) and is called *delayed response group sequential design*. Assume that, in a -stage trial, given we will proceed to the trial end, we will observe an information sequence , and the corresponding -statistics . As we now have information in pipeline, define as the information available after awaiting the delay after having observed . Let be the vector of -statistics that are calculated based upon the information levels . Given boundary sets , and , a -stage delayed response group sequential design has the following structure:

That is, at each of the interim analyses, there is information outstanding. If does not fall within , the conclusion is to stop the trial neither for efficacy nor for futility (as it would be in a traditional group sequential design), but to *irreversibly* stop the recruitment. Afterwards, the outstanding data is awaited such that after awaiting the delayed information fraction of time, information is available and a new -statistic can be calculated. This statistic is then used to test the actual hypothesis of interest using a *decision critical value* . The heuristic idea is that the recruitment is stopped only if one is “confident about the subsequent decision”. This means that if the recruitment has been stopped for , it should be likely to obtain a subsequent rejection. In contrast, if the recruitment is stopped for , obtaining a subsequent rejection should be rather unlikely (whereas still possible).

The main difference to a group sequential design is that due to the delayed information, each interim potentially consists of two analyses: a recruitment stop analysis and if indicated a subsequent decision analysis. Hampson & Jennison (2013) propose to define the boundary sets and as error-spending boundaries determined using -spending and -spending functions in the one-sided testing setting with *binding* futility boundaries. In rpact, this methodology is extended to (one-sided) testing situations where (binding or non-binding) futility boundaries are available. According to Hampson & Jennison (2013), the boundaries with are chosen such that “reversal probabilities” are balanced. More precisely, is chosen as the (unique) solution of:

and for , is the (unique) solution of:

It can easily be shown that this constraint yields critical values that ensure Type I error rate control. We call this approach the *reverse probabilities approach*.

The values , , are determined via a root-search. Having determined all applicable boundaries, the rejection probability of the procedure given a treatment effect is:

i.e., the probability to firstly stop the recruitment, followed by a rejection at any of the stages. Setting , this expression gives the Type I error rate. These values are calculated also for and specified maximum sample size (in the prototype case of testing against with ).

As for the group sequential designs, the *inflation factor*, , of a delayed response design is the maximum sample size, , to achieve power for testing against (in the prototype case) relative to the fixed sample size, :

Let denote the number of subjects observed at the -th recruitment stop analysis and the number of subjects recruited at the subsequent -th decision analysis. Given , it holds that

The expected sample size, , of a delayed response design is:

with . As for the maximum sample size, this is provided relative to the sample size in a fixed sample design, i.e., as the expected reduction in sample size.

We illustrate the calculation of decision critical values and design characteristics using the described approach for a three-stage group sequential design with Kim & DeMets - and -spending functions with .

**First, load the rpact package **

```
library(rpact)
packageVersion("rpact") # version should be version 3.3 or later
```

`[1] '3.5.1'`

The delayed responses utility is simply to add the parameter `delayedInformation`

to the `getDesignGroupSequential()`

(or `getDesignInverseNormal()`

) function. `delayedInformation`

is either a positive constant or a vector of length with positive elements describing the amount of pipeline information at interim :

```
gsdWithDelay <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asKD",
gammaA = 2,
gammaB = 2,
typeBetaSpending = "bsKD",
informationRates = c(0.3, 0.7, 1),
delayedInformation = c(0.16, 0.2),
bindingFutility = TRUE
)
```

```
Warning: The delayed information design feature is experimental and hence not
fully validated (see www.rpact.com/experimental)
```

The output contains the continuation region for each interim analysis defined through the upper and lower boundary of the continuation region. Additionally, the interim analyses are additionally characterized through decision critical values (1.387, 1.82, 2.03):

`kable(gsdWithDelay)`

**Design parameters and output of group sequential design**

**User defined parameters**

*Type of design*: Kim & DeMets alpha spending*Information rates*: 0.300, 0.700, 1.000*Binding futility*: TRUE*Parameter for alpha spending function*: 2*Parameter for beta spending function*: 2*Delayed information*: 0.160, 0.200*Type of beta spending*: Kim & DeMets beta spending

**Derived from user defined parameters**

*Maximum number of stages*: 3

**Default parameters**

*Stages*: 1, 2, 3*Significance level*: 0.0250*Type II error rate*: 0.2000*Two-sided power*: FALSE*Test*: one-sided*Tolerance*: 0.00000001

**Output**

*Power*: 0.1053, 0.5579, 0.8000*futilityBoundsDelayedInformation*: -0.508, 1.096*Cumulative alpha spending*: 0.00225, 0.01225, 0.02500*Cumulative beta spending*: 0.0180, 0.0980, 0.2000*criticalValuesDelayedInformation*: 2.841, 2.295, 2.030*Stage levels (one-sided)*: 0.00225, 0.01087, 0.02116*Decision critical values*: 1.387, 1.820, 2.030*Reversal probabilities*: 0.00007335, 0.00179791

Note that the last decision critical values (2.03) is equal to the last critical value of the corresponding group sequential design without delayed response. To obtain the design characteristics, the function `getDesignCharacteristics()`

calculates the maximum sample size for the design (`shift`

), the inflation factor and the average sample sizes under the null hypothesis, the alternative hypothesis, and under a value in between and :

`kable(getDesignCharacteristics(gsdWithDelay))`

**Delayed response group sequential design characteristics**

*Number of subjects fixed*: 7.8489*Shift*: 8.2521*Inflation factor*: 1.0514*Informations*: 2.476, 5.777, 8.252*Power*: 0.1026, 0.5563, 0.8000*Rejection probabilities under H1*: 0.1026, 0.4537, 0.2437*Futility probabilities under H1*: 0.01869, 0.08335*Ratio expected vs fixed sample size under H1*: 0.9269*Ratio expected vs fixed sample size under a value between H0 and H1*: 0.9329*Ratio expected vs fixed sample size under H0*: 0.8165

Using the `summary()`

function, these number can be directly displayed without using the `getDesignCharacteristics()`

function as follows:

`kable(summary(gsdWithDelay))`

**Sequential analysis with a maximum of 3 looks (delayed response group sequential design)**

Kim & DeMets alpha spending design with delayed response (gammaA = 2) and Kim & DeMets beta spending (gammaB = 2), one-sided overall significance level 2.5%, power 80%, undefined endpoint, inflation factor 1.0514, ASN H1 0.9269, ASN H01 0.9329, ASN H0 0.8165.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 30% | 70% | 100% |

Delayed information | 16% | 20% | |

Upper bounds of continuation | 2.841 | 2.295 | 2.030 |

Stage levels (one-sided) | 0.0022 | 0.0109 | 0.0212 |

Lower bounds of continuation (binding) | -0.508 | 1.096 | |

Cumulative alpha spent | 0.0022 | 0.0122 | 0.0250 |

Cumulative beta spent | 0.0180 | 0.0980 | 0.2000 |

Overall power | 0.1026 | 0.5563 | 0.8000 |

Futility probabilities under H1 | 0.019 | 0.083 | |

Decision critical values | 1.387 | 1.820 | 2.030 |

Reversal probabilities | <0.0001 | 0.0018 |

summary(gsdWithDelay)

It might be of interest to check whether this in fact yields Type I error rate control. This can be done with the internal function `getSimulatedRejectionsDelayedResponse()`

as follows:

`rpact:::getSimulatedRejectionsDelayedResponse(gsdWithDelay, iterations = 10^6)`

$simulatedAlpha [1] 0.024989

$delta [1] 0

$iterations [1] 1000000

$seed [1] 435360189

$confidenceIntervall [1] 0.02468306 0.02529494

$alphaWithin95ConfidenceIntervall [1] TRUE

$time Time difference of 6.691327 secs

It also checks whether the simulated Type T error rate is within the 95% error boundaries.

Compared to the design with no delayed information, it turns out that the inflation factor is quite the same though the average sample sizes are different. This is due to the fact that, compared to the design with no delayed response, the actual number of patients to be used for the analysis is larger:

```
gsdWithoutDelay <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asKD",
gammaA = 2,
gammaB = 2,
typeBetaSpending = "bsKD",
informationRates = c(0.3, 0.7, 1),
bindingFutility = TRUE
)
kable(summary(gsdWithoutDelay))
```

**Sequential analysis with a maximum of 3 looks (group sequential design)**

Kim & DeMets alpha spending design (gammaA = 2) and Kim & DeMets beta spending (gammaB = 2), binding futility, one-sided overall significance level 2.5%, power 80%, undefined endpoint, inflation factor 1.072, ASN H1 0.8082, ASN H01 0.8268, ASN H0 0.6573.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 30% | 70% | 100% |

Efficacy boundary (z-value scale) | 2.841 | 2.295 | 2.030 |

Stage levels (one-sided) | 0.0022 | 0.0109 | 0.0212 |

Futility boundary (z-value scale) | -0.508 | 1.096 | |

Cumulative alpha spent | 0.0022 | 0.0122 | 0.0250 |

Cumulative beta spent | 0.0180 | 0.0980 | 0.2000 |

Overall power | 0.1053 | 0.5579 | 0.8000 |

Futility probabilities under H1 | 0.018 | 0.080 |

It might be also of interest to evaluate the expected sample size under a range of parameter values, e.g., to obtain an optimum design under some criterion that is based on all parameter value within some specified range. Keeping in mind that the `prototype case`

is for testing against with (known) , this is obtained with the following commands:

```
nMax <- getDesignCharacteristics(gsdWithDelay)$shift # use calculated sample size for the protoype case
deltaRange <- seq(-0.2, 1.5, 0.05)
ASN <- getPowerMeans(gsdWithDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$expectedNumberOfSubjects
dat <- data.frame(delta = deltaRange, ASN = ASN, delay = "delay")
ASN <- getPowerMeans(gsdWithoutDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$expectedNumberOfSubjects
dat <- rbind(dat, data.frame(delta = deltaRange, ASN = ASN, delay = "no delay"))
library(ggplot2)
myTheme <- theme(
axis.title.x = element_text(size = 14),
axis.text.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 14)
)
ggplot(data = dat, aes(x = delta, y = ASN, group = delay, linetype = factor(delay))) +
geom_line(size = 0.8) +
ylim(0, ceiling(nMax)) +
myTheme +
theme_classic() +
xlab("alternative") +
ylab("Expected number of subjects") +
geom_hline(size = 1, yintercept = nMax, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0.5, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 1, linetype = "dotted") +
labs(linetype = "") +
annotate(geom = "text", x = 0, y = nMax - 0.3, label = "fixed sample size", size = 4)
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

Note that, in contrast, the rejection probabilities are quite the same for the different designs:

```
reject <- c(
getPowerMeans(gsdWithDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$overallReject,
getPowerMeans(gsdWithoutDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$overallReject
)
dat$reject <- reject
ggplot(data = dat, aes(x = delta, y = reject, group = delay, linetype = factor(delay))) +
geom_line(size = 0.8) +
ylim(0, 1) +
myTheme +
theme_classic() +
xlab("alternative") +
ylab("rejection probability") +
geom_hline(size = 1, yintercept = 1 - gsdWithDelay$beta, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0.5, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 1, linetype = "dotted") +
labs(linetype = "")
```

Since we used the `nMax`

from the design with delayed responses, the power is 80% for this design (for the design without delayed response it is slightly below).

We illustrate the calculation of power and average sample size with an example provided by Schüürhuis (2022), p.68: Suppose it is planned to conduct a parallel group trial with subjects per arm to be linearly recruited within 24 months in presence of a delay of . The significance level is , the nominal type II error is at a treatment effect of . The boundaries are chosen to be calculated using O’Brien-Fleming-like - and -spending functions and the interim is planned after information has been collected, i.e., after months into the trial. Therefore, it is important to note that is the time point of analysis for the first interim since at this time point the full information of 30% of the subjects is available.

The numbers rpovided in Table 5.2 of Schüürhuis (2022) for the *Hampson and Jennison* approach can be obtained with the following commands:

```
gsdTwoStagesWithDelay <- getDesignGroupSequential(
kMax = 2,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
informationRates = c(0.3, 1),
delayedInformation = 5 / 24,
bindingFutility = TRUE
)
```

```
Warning: The delayed information design feature is experimental and hence not
fully validated (see www.rpact.com/experimental)
```

```
results <- getPowerMeans(
design = gsdTwoStagesWithDelay,
groups = 2,
normalApproximation = TRUE,
alternative = 0.3,
stDev = 1,
maxNumberOfSubjects = 350
)
# expected number of subjects table 5.2
round(results$expectedNumberOfSubjects / 2, 3)
```

`[1] 172.6`

```
# expected trial duration table 5.2
round(results$earlyStop * 17.2 + (1 - results$earlyStop) * 29, 3)
```

`[1] 28.671`

```
# power table 5.2
round(results$overallReject, 3)
```

`[1] 0.798`

It has to be noted that the time points of analysis are derived under the assumption of a *linear recruitment* of patients. Otherwise, these calculations need to be adapted for a non-linear case (which is easy to be done).

The *Recruitment Pause* approach can be obtained with the following commands (without using the `delayedInformation`

parameter). As above, the interim analysis information is fully observed after months, whereas the final information is here after months:

```
gsdTwoStagesWithoutDelay <- getDesignGroupSequential(
kMax = 2,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
informationRates = c(0.3 + 5 / 24, 1),
bindingFutility = FALSE
)
results <- getPowerMeans(
design = gsdTwoStagesWithoutDelay,
groups = 2,
normalApproximation = TRUE,
alternative = 0.3,
stDev = 1,
maxNumberOfSubjects = 350
)
# expected number of subjects table 5.2
round(results$expectedNumberOfSubjects / 2, 3)
```

`[1] 153.053`

```
# expected trial duration table 5.2
round(results$earlyStop * 17.2 + (1 - results$earlyStop) * 34, 3)
```

`[1] 29.715`

```
# power table 5.2
round(results$overallReject, 3)
```

`[1] 0.779`

The decision boundaries of the delayed response group sequential design can be illustrated as `type = 1`

plot for the design. This adds the decision boundaries as crosses together with the continuation region. Note that other plot types are directly accounting for the delayed response situation as the required numbers are calculated for this case.

`plot(gsdWithDelay)`

The approach described so far uses the identity of reversal probabilities to derive the decision boundaries. An alternative approach is to demand *two* conditions for rejecting the null at a given stage of the trial and spend a specified amount of significance at this stage. This definition is independent of the specification of futility boundaries but can be used with these as well. We show how the (new) rpact function `getGroupSequentialProbabilities()`

can handle this situation.

At stage , in order to reject , the test statistic needs to exceed the upper continuation bound *and* needs to exceed the critical value . Hence, the set of upper continuation boundaries and critical values is defined through the conditions

Since this cannot be solved without additional constraints, the critical values are fixed as . This makes sense since it often turns out that the optimum boundaries using the *reversal probabilities approach* are smaller than and the unadjusted boundary is a reasonable choice for a minimum requirement for rejecting .

Starting by , the values , with fixed , are successively determined via a root-search.

The conditions for the Type II error rate are the following:

The algorithm to derive the acceptance boundaries can be briefly described as follows: At given rejection boundaries (calculated from above) the algorithm successively calculates the acceptance boundaries using the specifed -spending function and an arbitrary sample size or “shift” value. If the last stage acceptance critical value is smaller than the last stage critical value, the shift value is increased, otherwise it is decreased. This is repeated until the last stage critical values coincide. The resulting shift value can be interpreted as the maximum necessary sample size to achieve power . Using the algorithm, we have to additionally specify upper and lower boundaries for the `shift`

(which can be interpreted as the maximum sample size for the group sequential design in the prototype case, cf., Wassmer & Brannath, 2016). This is set to be within 0 and 100 which covers practically relevant situations.

We use the function `getGroupSequentialProbabilities()`

to calculate the critical values by a numerical root search. For this, we define a -spending function `spend()`

which can be arbitrarily chosen. Here, we define a function according to the power family of Kim & DeMets with `gammaA = 1.345`

. This value was shown by Hampson & Jennison (2013) to be optimum in a specific context but this does not matter here. The (upper) continuation boundaries together with the decision boundaries and the last stage critical boundary are computed using the `uniroot()`

function as follows.

For , `decisionMatrix`

is set equal to for , we use whereas, for , we use

for calculating the stagewise rejection probabilities that yield Type I error rate control with appropriately defined information rates.

```
### Derive decision boundaries for delayed response alpha spending approach
alpha <- 0.025
gammaA <- 1.345
tolerance <- 1E-6
# Specify use function
spend <- function(x, size, gamma) {
return(size * x^gamma)
}
infRates <- c(28, 54, 96) / 96
kMax <- length(infRates)
delay <- rep(16, kMax - 1) / 96
u <- rep(NA, kMax)
c <- rep(qnorm(1 - alpha), kMax - 1)
for (k in (1:kMax)) {
if (k < kMax) {
infRatesPlusDelay <- c(infRates[1:k], infRates[k] + delay[k])
} else {
infRatesPlusDelay <- infRates
}
u[k] <- uniroot(
function(x) {
if (k == 1) {
d <- matrix(c(
x, c[k],
Inf, Inf
), nrow = 2, byrow = TRUE)
} else if (k < kMax) {
d <- matrix(c(
rep(-Inf, k - 1), x, c[k],
u[1:(k - 1)], Inf, Inf
), nrow = 2, byrow = TRUE)
} else {
d <- matrix(c(
rep(-Inf, k - 1), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE)
}
probs <- getGroupSequentialProbabilities(d, infRatesPlusDelay)
if (k == 1) {
probs[2, k + 1] - probs[1, k + 1] - spend(infRates[k], alpha, gammaA)
} else if (k < kMax) {
probs[2, k + 1] - probs[1, k + 1] - (spend(infRates[k], alpha, gammaA) -
spend(infRates[k - 1], alpha, gammaA))
} else {
probs[2, k] - probs[1, k] - (spend(infRates[k], alpha, gammaA) -
spend(infRates[k - 1], alpha, gammaA))
}
},
lower = -8, upper = 8
)$root
}
round(u, 5)
```

`[1] 2.43743 2.24413 2.06854`

We note that any other spending function can be used to define the design. That is, you can also use the spending probabilities of, say, an O`Brien & Fleming design approach that is defined through the shape of the boundaries. Furthermore, it is also possible to use the boundaries together with the unadjusted critical values in an inverse normal -value combination test where the weights are fixed through the planned infoprmation rates and the delay.

The calculation of the test characteristics is straightforward and can be derived for designs with or without futility boundaries. In the following example, we show how to derive lower continuation (or futility) boundaries that are based on a -spending function approach. As above, we use the same Kim & DeMets spending function with `gammaB = 1.345`

. Due to numerical reasons, we do not use the `uniroot()`

function here but a bisection method to numerically search for the boundaries.

```
beta <- 0.1
gammaB <- 1.345
u0 <- rep(NA, kMax)
cLower1 <- 0
cUpper1 <- 100
prec1 <- 1
iteration <- 1E5
while (prec1 > tolerance) {
shift <- (cLower1 + cUpper1) / 2
for (k in (1:kMax)) {
if (k < kMax) {
infRatesPlusDelay <- c(infRates[1:k], infRates[k] + delay[k])
} else {
infRatesPlusDelay <- infRates
}
nz <- matrix(rep(sqrt(infRatesPlusDelay), 2), nrow = 2, byrow = TRUE) * sqrt(shift)
prec2 <- 1
cLower2 <- -8
cUpper2 <- 8
while (prec2 > tolerance) {
x <- (cLower2 + cUpper2) / 2
if (k == 1) {
d2 <- matrix(c(
u[k], c[k],
Inf, Inf
), nrow = 2, byrow = TRUE) - nz
probs <- getGroupSequentialProbabilities(d2, infRatesPlusDelay)
ifelse(pnorm(x - nz[1]) + probs[1, k + 1] < spend(infRates[k], beta, gammaB),
cLower2 <- x, cUpper2 <- x
)
} else if (k < kMax) {
d1 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE) - nz[, 1:k]
probs1 <- getGroupSequentialProbabilities(d1, infRatesPlusDelay[1:k])
d2 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), u[k], c[k],
u[1:(k - 1)], Inf, Inf
), nrow = 2, byrow = TRUE) - nz
probs2 <- getGroupSequentialProbabilities(d2, infRatesPlusDelay)
ifelse(probs1[1, k] + probs2[1, k + 1] < spend(infRates[k], beta, gammaB) -
spend(
infRates[k - 1],
beta, gammaB
),
cLower2 <- x, cUpper2 <- x
)
} else {
d1 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE) - nz
probs <- getGroupSequentialProbabilities(d1, infRates)
ifelse(probs[1, k] < spend(infRates[k], beta, gammaB) -
spend(infRates[k - 1], beta, gammaB),
cLower2 <- x, cUpper2 <- x
)
}
iteration <- iteration - 1
ifelse(iteration > 0, prec2 <- cUpper2 - cLower2, prec2 <- 0)
}
u0[k] <- x
}
ifelse(u0[kMax] < u[kMax], cLower1 <- shift, cUpper1 <- shift)
ifelse(iteration > 0, prec1 <- cUpper1 - cLower1, prec1 <- 0)
}
round(u0, 5)
```

`[1] -0.40891 0.66367 2.06854`

`round(shift, 2)`

`[1] 12`

`round(shift / (qnorm(1 - alpha) + qnorm(1 - beta))^2, 3)`

`[1] 1.142`

We can compare these values with the “original” -spending approach with non-binding futility boundaries using the function `getDesignGroupSequential()`

:

```
x <- getDesignGroupSequential(
informationRates = infRates,
typeOfDesign = "asKD", typeBetaSpending = "bsKD",
gammaA = gammaA, gammaB = gammaB,
alpha = alpha, beta = beta, bindingFutility = FALSE
)
round(x$futilityBounds, 5)
```

`[1] -0.19958 0.80463`

`round(x$criticalValues, 5)`

`[1] 2.59231 2.39219 2.10214`

`round(getDesignCharacteristics(x)$inflationFactor, 3)`

`[1] 1.146`

We have shown how to handle a group sequential design with delayed responses in two diffenrent ways. So far, we have implemented the approach proposed by Hampson & Jennison (2013) that is based on reversal probabilities. The direct usage of the delayed information within the design definition make it easy for the user to apply these designs to commonly used trials with continuous, binary, and time to event endpoints. We have also shown how to use the `getGroupSequentialProbabilities()`

function to derive the critical values and the test characteristics for the alternative approach that “more directly” determines the critical values through a spending function approach.

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document provides an exemplary implementation of multi-arm-multi-stage (MAMS) design with binary endpoint using rpact. Concretely, the vignette consists of considering design implementation with respect to futility bounds on treatment effect scale, sample size calculations and simulations given different treatment selection approaches. Further, analysis using closed testing will be performed on generic data binary and survival data.

Note that since rpact itself doesn’t support landmark analysis, i.e. comparison of survival probabilities at a fixed point in time, using the Greenwoods standard error (SE) formula. Thus the first two analyses are based on the empirical event rates only. The R packages gestate and survival are then utilized to briefly show how one could merge the packages to perform the actually intended analysis using boundaries obtained by rpact and test statistics obtained by survival probabilities and standard errors estimated using gestate and survival.

For methodological and theoretical background, refer to “Group Sequential and Confirmatory Adaptive Designs in Clinical Trials” by Gernot Wassmer and Werner Brannath.

Before starting, load the rpact package and make sure the version of the package is at least 3.1.0:

```
library(rpact)
packageVersion("rpact")
```

`[1] '3.5.1'`

Consider we are interested in implementing a group sequential design with stages, treatment arms ( active, control), treatment arm comparisons (each active arm vs. the placebo arm), a binary endpoint, with a global one-sided , and the power for each treatment arm comparison set to , hence . Additionally, let the critical boundaries be calculated using alpha-spending O’Brien & Fleming approach and for planning purposes, suppose equally distributed information rates. Further, non-binding futility bounds are to be set in the following way:

Considering the active treatment arms, the goal of both arms is to confirm significant reduction in event (e.g. disease, death) rates as compared to a control arm. Thus, futility should appear exactly when there is only low reduction, no reduction or even increase seen in the data. Assume one designs a study where a treatment arm is futile in the first stage when there has only been a relative reduction of % (i.e. rate ratio of ) and in the second stage when data indicates a relative reduction of % only (i.e. rate ratio of ). For simplification, futility bounds are assumed to be independent. As rpact primarily uses futility bounds on -scale, one needs to determined the -scale value that correspond with the values listed above. Advantageously, the futility bounds on treatment effect scale are part of sample size calculation output. Therefore, rpact allows to determine the right -value by basically examining various options for -scale futility as input and decide on which values result in the needed treatment effect futility bounds. This can exemplarily be done by trying different -scale futility bounds as input up until the sample size calculation command output (given certain treatment effect assumptions) indicates that the input corresponds with the desired treatment effect futility bound.

As an example on how to get the corresponding futility bound values on the different scales, see the example below:

First, one needs to initialize the design and basically take arbitrary futility bounds on -scale as input:

```
# first and second stage futility on z-scale
fut1 <- 0.16
fut2 <- 0.39
d_fut <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(fut1, fut2),
bindingFutility = FALSE
)
```

Now, as mentioned, the sample size calculation output provides information about the futility boundaries on the treatment effect scale. Therefore, after assuming certain treatment effects, one needs to perform sample size calculation and extract the treatment effect futility bound from the respective output:

```
c_assum <- 0.1 # assumed rate in control
effect_assum <- 0.5 # relative reduction that is to be detected with probability of 0.8
# rates indicate binary endpoint
ssc_fut <- getSampleSizeRates(
design = d_fut,
riskRatio = TRUE,
pi1 = c_assum * (1 - effect_assum),
pi2 = c_assum
)
ssc_fut$futilityBoundsEffectScale
```

```
[,1]
[1,] 0.9464954
[2,] 0.9085874
```

The values printed out above are now the futility bounds on treatment effect scale that do correspond with the values *fut1* and *fut2* from above. Since and , the input value *fut1=* seems to be slightly to large while *fut2=* seems to be slightly to low, while this indicates how to adjust the input values such that one comes closer to the the true corresponding values.

Now, using a search algorithm or by simply trying different values, one results in futility bounds on -scale summarized in the following table created using knitr:

First stage | Second stage | |
---|---|---|

Orig. treatment effect scale | 0.950 | 0.900 |

Approx. z-Value for given t-Value | 0.149 | 0.414 |

Corresponding t-scale Value | 0.950 | 0.900 |

Diff. between Approx. and Input | 0.000 | 0.000 |

In the table above, the *Orig. treatment effect scale* is the intended futility bound on treatment effect scale, *Approx. -value* denotes the corresponding -scale value approximation and *Corresponding -scale Value* the actual -scale value using the calculated approx. -values as input.. One can see that the first stage futility bound on -scale should approximately be chosen as and the second stage -scale futility bound as , resulting in the desired futility bound on treatment effect scale of approximately 0.95 in first and 0.9 in second stage, respectively. In the next chapter containing the sample size calculations, the respective outputs validate that these -scale futility bounds are a reasonable choice.

Now that (approximate) futility bounds on -scale are known, the above specified design can be entirely initialized, *kMax=3* indicates a study design of three stages.

```
# GSD with futility bounds according to above calculations
d <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(0.149145, 0.41381),
bindingFutility = FALSE
)
kable(summary(d))
```

**Sequential analysis with a maximum of 3 looks (group sequential design)**

O’Brien & Fleming type alpha spending design, non-binding futility, one-sided overall significance level 2.5%, power 80%, undefined endpoint, inflation factor 1.0833, ASN H1 0.8652, ASN H01 0.843, ASN H0 0.6133.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Stage levels (one-sided) | 0.0001 | 0.0060 | 0.0231 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Overall power | 0.0213 | 0.4471 | 0.8000 |

Futility probabilities under H1 | 0.062 | 0.011 |

Simply printing out the defined object gives a nice overview of all relevant design parameters. Note that the adjusted of the last stage () is slightly lower that the predefined global , corresponding to a critical values of being slightly larger than , which is due to alpha-spending O’Brien Fleming adjustment. Further, the output basically provides an overview of the input parameters.

When it comes to sample size calculations for designs with binary endpoints, rpact provides the command `getSampleSizeRates()`

. It should be noted upfront that the sample size calculation in rpact always only refers to only one treatment arm comparison.

The sample size calculations code applicable here has already been indirectly used to properly determine the -scale futility boundaries whenever futility boundaries are only given on treatment effect scale. However, this sections’ purpose now is to provide some more detail on sample size calculations in MAMS-designs with binary endpoint.

Suppose the treatment effect under is or there is even a rate increase in active treatment group, meaning the difference , with representing the assumed event rate in the treatment group and being the assumed event rate in the reference/control group. Again, let the parameter setting be given as above and assume an expected relative reduction of event occurrence of given (hence ). Since we are interested in directly comparing risks, we set `riskRatio`

to `TRUE`

which particularly results in testing against . The sample size per stage for one treatment arm comparison can then be calculated using the commands:

```
c_rate <- 0.1 # assumed rate in control
effect <- 0.5 # relative reduction that is to be detected with probability of 0.8
# rates indicate binary endpoint
d_sample <- getSampleSizeRates(
design = d,
riskRatio = TRUE,
pi1 = c_rate * (1 - effect),
pi2 = c_rate
)
kable(summary(d_sample))
```

**Sample size calculation for a binary endpoint**

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for rates (normal approximation), H0: pi(1) / pi(2) = 1, H1: treatment rate pi(1) = 0.05, control rate pi(2) = 0.1, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Information rate | 33.3% | 66.7% | 100% |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Overall power | 0.0213 | 0.4471 | 0.8000 |

Number of subjects | 313.8 | 627.5 | 941.3 |

Expected number of subjects under H1 | 751.7 | ||

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

One-sided local significance level | 0.0001 | 0.0060 | 0.0231 |

Efficacy boundary (t) | 0.061 | 0.476 | 0.643 |

Futility boundary (t) | 0.950 | 0.903 | |

Overall exit probability (under H0) | 0.5594 | 0.1828 | |

Overall exit probability (under H1) | 0.0838 | 0.4366 | |

Exit probability for efficacy (under H0) | 0.0001 | 0.0059 | |

Exit probability for efficacy (under H1) | 0.0213 | 0.4258 | |

Exit probability for futility (under H0) | 0.5593 | 0.1769 | |

Exit probability for futility (under H1) | 0.0625 | 0.0108 |

Legend:

*(t)*: treatment effect scale

The variable *Futility boundary (t)* represents the futility bounds transferred to treatment effect scale. Therefore, one can see that the predefined bounds calculated above correspond with a rate ratio of at the first stage (i.e. relative reduction of ) and a rate ratio of at the second (i.e. relative reduction of approx. ), being actually pretty close to the intended ones. The slight differences to the results above, again, are due to assumed independence (i.e. futility bound calculations for the different stages are done on an independent basis). However, even using this simplification, results are equivalent with a precision of digits or more.

Another important information provided by the output is that to achieve the desired study characteristics, such as keeping the -error controlled at while obtaining power of , approx. study subjects are needed per arm per stage given equally spread information rates. Thus, one results in a maximum of subjects in case one assumes a design with treatment arms, being active and being a control . Multiplications by are then due to having stages and active treatment arms compared to control arm. It should be noted here that this is just an approximation of the sample size needed to achieve a power of , while power here is defined as the probability of successfully detecting the assumed effect. Simulations in following chapters will indicate that this rather rough approximate sample size is actually rather conservative since the studies appear to be overpowered in simulations, while the term power has a different meaning here: when considering simulations of studies with multiple active treatment arms, power is referred to as the probability to claim success of at least one active treatment arm in the study.

The row *Exit probability for futility* indicates the rather low probability of early stopping due to futility (. stage: , . stage: ), while this depends on the assumed treatment effects and the defined boundaries. *Efficacy boundary (t)*, just as *Efficacy boundary (z-value scale)* shows that an enormous decrease in events needs to be detected in the treatment group in order to reject in the first stage (rate ratio of , i.e. relative reduction of ) which, again reflects the “conservatism” of alpha-spending O’Brien Fleming approach in early stages, however resulting in rather liberal and monotonically decreasing boundaries along the stages.

```
# boundary plots
par(mar = c(4, 4, .1, .1))
plot(d_sample, main = paste0(
"Boundaries against stage-wise ",
"cumul. sample sizes - 1"
), type = 1)
plot(d_sample, main = paste0(
"Boundaries against stage-wise ",
"cumul. sample sizes - 2"
), type = 2)
```

Plotting the cumulative sample size against the applicable boundaries is a way of nicely visualizing these important study characteristics. On the left plot, one can see a plot having the boundaries on the y-axis on -scale.. The dashed line represents the critical value of a fixed study design with one-sided testing and (, denoting the cumulative distribution function of the standardnormal distribution, the corresponding quantile function). The right plot basically contains the same information, while the y-axis is now given on treatment effect scale. The red line represents the efficacy bound which needs to be crossed to obtain statistical significance at the applicable stage, the blue line represents the futility bounds. Note that the plot again indicates that low risk ratio values are desirable.

Performing simulations prior to conducting the study is often reasonable since they allow for evaluating characteristics, such as power (please refer to definition above), assuming different scenarios or constellations of treatment effects. Thus, simulations enable to get a more holistic view on how and where the planned study could go in case the assumptions appear to be (approximately) true.

Simulation of the study design will be done using the function `getSimulationMultiArmRates()`

, which needs an Inverse Normal Design as input, similarly defined, with the add-on of containing information about how the analysis in the simulations should be performed:

```
# design as above, just as inverse normal
d_IN <- getDesignInverseNormal(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(0.149145, 0.41380800),
bindingFutility = FALSE
)
```

To perform study simulations for instance for power or probability-of-success evaluation, one need to assume (different) effect rates in the active treatment arms, which need to be defined in a matrix-object. For binary data, die `effectMatrix`

refers to the actual event rate in each arm, but not to the the difference in event rates between control and active treatment arms. Further, number of iterations needs to be defined in advance.

```
# set number of iterations to be used in simulation
maxNumberOfIterations <- 100 # 10000
# specify the scenarios, nrow: number of scenarios, ncol: number of treatment arms
effectMatrix <- matrix(c(0.100, 0.100, 0.05, 0.05, 0.055, 0.045),
byrow = TRUE, nrow = 3, ncol = 2
)
# first column: first treatment arm, second column: second treatment arm
show(effectMatrix)
```

```
[,1] [,2]
[1,] 0.100 0.100
[2,] 0.050 0.050
[3,] 0.055 0.045
```

Considering a design with active treatment arms, both to be compared against a control arm, the effect matrix contains treatment effect in the first active arm in the first column and the effect in the second active arm in the second column respectively.

Next step would be to actually perform the simulation. Several variables need to be initialized in order to individualize the simulation according to the specific needs. Choosing `typeOfShape`

as `userDefined`

refers to the above defined effect matrix, however, one could also assume a *linear* relationship and initialize a vector of maximal assumed effects (`piMaxVector`

) in treatment groups. When some functional form of dose response curve are used, it should be noted that the `piMaxVector`

may be used as the maximum treatment effect on the whole dose response curve instead of observed dose range, which could be misleading. Therefore it is suggested to use the `userDefined`

shape. `directionUpper`

is set to `FALSE`

since low obtained rates of events do correspond with a better clinical outcome and reducing the rate is considered beneficial to the subjects. Intersection test to be performed is *Simes* , while other options would be given by e.g. *Bonferroni* or *Dunnett*, with *Dunnett* being the default setting. `typeOfSelection`

is initially set to `rbest`

with `rValue`

being 2, which means that at each stage, the best treatment arms should be carried forward. `successCriterion = "all"`

means that to stop the study early due to efficacy, both active treatment arms need to be tested significantly at interim analysis and another option would be to set `successCriterion = "atLeastOne"`

to declare significant testing of one treatment arm to be sufficient for study success at interim. The vector `plannedSubjects`

contains the cumulative per arm per stage sample sizes previously calculated.

```
# first simulation
simulation <- getSimulationMultiArmRates(
design = d_IN,
activeArms = 2,
effectMatrix = effectMatrix,
typeOfShape = "userDefined",
piControl = 0.1,
intersectionTest = "Simes",
directionUpper = FALSE,
typeOfSelection = "rBest",
rValue = 2,
effectMeasure = "testStatistic",
successCriterion = "all",
plannedSubjects = c(157, 314, 471),
allocationRatioPlanned = 1,
maxNumberOfIterations = maxNumberOfIterations,
seed = 145873,
showStatistics = TRUE
)
kable(summary(simulation))
```

**Simulation of a binary endpoint (multi-arm design)**

Sequential analysis with a maximum of 3 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for rates (2 treatments vs. control), H0: pi(i) - pi(control) = 0, power directed towards smaller values, H1: treatment rate pi_max as specified, control rate pi(control) = 0.1, planned cumulative sample size = c(157, 314, 471), effect shape = user defined, intersection test = Simes, selection = r best, r = 2, effect measure based on test statistic, success criterion: all, simulation runs = 100, seed = 145873.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Stage levels (one-sided) | 0.0001 | 0.0060 | 0.0231 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Reject at least one [1] | 0.0200 | ||

Reject at least one [2] | 0.7900 | ||

Reject at least one [3] | 0.9200 | ||

Rejected arms per stage [1] | |||

Treatment arm 1 | 0 | 0 | 0.0100 |

Treatment arm 2 | 0 | 0 | 0.0100 |

Rejected arms per stage [2] | |||

Treatment arm 1 | 0 | 0.4000 | 0.3200 |

Treatment arm 2 | 0 | 0.3200 | 0.3900 |

Rejected arms per stage [3] | |||

Treatment arm 1 | 0 | 0.3100 | 0.3900 |

Treatment arm 2 | 0.0100 | 0.4800 | 0.3800 |

Success per stage [1] | 0 | 0 | 0 |

Success per stage [2] | 0 | 0.2600 | 0.3800 |

Success per stage [3] | 0 | 0.2800 | 0.3700 |

Exit probability for futility [1] | 0.6300 | 0.2300 | |

Exit probability for futility [2] | 0.0600 | 0.0100 | |

Exit probability for futility [3] | 0.0400 | 0 | |

Expected number of subjects under H1 [1] | 711.2 | ||

Expected number of subjects under H1 [2] | 1229.3 | ||

Expected number of subjects under H1 [3] | 1243.4 | ||

Overall exit probability [1] | 0.6300 | 0.2300 | |

Overall exit probability [2] | 0.0600 | 0.2700 | |

Overall exit probability [3] | 0.0400 | 0.2800 | |

Stagewise number of subjects [1] | |||

Treatment arm 1 | 157.0 | 157.0 | 157.0 |

Treatment arm 2 | 157.0 | 157.0 | 157.0 |

Control arm | 157.0 | 157.0 | 157.0 |

Stagewise number of subjects [2] | |||

Treatment arm 1 | 157.0 | 157.0 | 157.0 |

Treatment arm 2 | 157.0 | 157.0 | 157.0 |

Control arm | 157.0 | 157.0 | 157.0 |

Stagewise number of subjects [3] | |||

Treatment arm 1 | 157.0 | 157.0 | 157.0 |

Treatment arm 2 | 157.0 | 157.0 | 157.0 |

Control arm | 157.0 | 157.0 | 157.0 |

Selected arms [1] | |||

Treatment arm 1 | 1.0000 | 0.3700 | 0.1400 |

Treatment arm 2 | 1.0000 | 0.3700 | 0.1400 |

Selected arms [2] | |||

Treatment arm 1 | 1.0000 | 0.9400 | 0.6700 |

Treatment arm 2 | 1.0000 | 0.9400 | 0.6700 |

Selected arms [3] | |||

Treatment arm 1 | 1.0000 | 0.9600 | 0.6800 |

Treatment arm 2 | 1.0000 | 0.9600 | 0.6800 |

Number of active arms [1] | 2.000 | 2.000 | 2.000 |

Number of active arms [2] | 2.000 | 2.000 | 2.000 |

Number of active arms [3] | 2.000 | 2.000 | 2.000 |

Conditional power (achieved) [1] | 0.0614 | 0.1061 | |

Conditional power (achieved) [2] | 0.3845 | 0.5980 | |

Conditional power (achieved) [3] | 0.4098 | 0.7031 |

Legend:

*(i)*: treatment arm i*[j]*: effect matrix row j (situation to consider)

In this output, the different input situations are indicated through while .

Under with assumed treatment effect equal to the assumed effect in the control group (scenario 1), probability of rejecting at least one is low (i.e. committing type I error, *Reject at least one [1]* ), while rejection probability (under alternative power) is high especially assuming high treatment effects in both treatment arms (under alternative, *Reject at least one [2]* for scenario ). Note also that, in any case, both treatment arms are selected as `rValue=2`

means that the two best treatment arms (i.e. all arms in this case) are selected, regardless of if the one or both treatment arms meet the applicable futility bounds. Further, the simulation output contains several information. That is, e.g. stage-wise number of subjects are calculated and probabilities for arms to be selected in each stage are provided for each assumed effect. Additionally, expected sample size, which is commonly used as optimality criterion of designs, stopping due to futility-probabilities and conditional power, defined as the probability of having a statistically significant result given the data observed thus far, are listed. Note that, since probability of futility stop is rather high in scenario (first stage: , second stage: ), the corresponding *expected number of subjects* is approximately , lying far below the expected sample sizes in the other scenarios.

Now, suppose the underlying treatment arm selection scheme is different and especially not straightforwardly covered by the other available and pre-defined options in rpact (i.e. `best`

, `rBest`

, `epsilon`

, `all`

). Then, rpact accounts for that by allowing users to implement a user defined treatment arm selection function thus being capable of covering various different selection approaches. The user needs to set `typeOfSelection`

to `userDefined`

and define a function used as input for the `selectArmsFunction`

-argument.

Say e.g. a treatment arm should be selected if and only if it doesn’t meet the futility bound, and if it does, function should be specified such that this arm is deselected at the applicable stage. See the different treatment arm selection approaches illustrated for the first stage depending on the different potential outcomes in the figures below, while the first diagram illustrates the procedure using `rBest`

and the second using the `userDefined`

scheme here, while the numbers represent the following potential first stage analysis results:

- Both active Treament Arms significant
- 1 active Treatment Arm significant, 1 active Treatment Arm non-significant
- 1 active Treatment Arm futile, 1 active Treatment Arm significant
- Both active Treatment Arms non-significant
- 1 active Treatment Arm futile, 1 active Treatment Arm non-significant
- Both active Treatment Arm futile

rBest, rValue=2

This first diagram graphically represent how treatment selection proceeds choosing *typeOfSelection=rBest* with *rValue=2* in case of the study design previously defined. Early study success can be obtained whenever both treatment arms can be tested significantly at interim, continuation of study with only non-significant (meaning neither stopped for efficacy, not for futility) active treatment arms is to be done whenever one of the active arms is discontinued due to efficacy and if none of the arms can be tested significantly at interim, the study continues with both treatment arms as even arms crossing the futility bounds are to be carried forward using *rBest, rValue=2*.

userDefined

This second diagram illustrates the active treatment arm selection that is to be implemented through the user defined selection function. Efficacy stop occurs when both active treatment arms are tested significantly at interim, continuation with non-significant arm only happens only when one active arm is significant while the other arm is not, but neither futile. The study also terminates early with success of one arm when one active arm is futile while the other can be deemed superior to control, continuation with non-futile arms only is done when both arms are non-significant or one of them is non-significant while the other one is futile. Lastly, futility stop can also happen at interim.

Now that the futility bounds are defined in the design specification already one could believe that selection rule is only an add-on to the futility bound. However, this is not true since specifying selection rule overwrites the futility bound as continuation-criterion for treatment arms. Thus, even if futility bounds are pre-specified, to implement stopping due to futility in application of `getSimulationMultiArmRates()`

(but also `getSimulationMultiArmMeans()`

, `getSimulationMultiArmSurvival()`

), one could use these as input in the customized selection function. In this case, having the previously defined futility bounds of at the first interim and at the second respectively, there is a need to individualize the selection scheme along the different stages. It is important to note here that `effectMeasure = "testStatistic"`

needs to be set as the futility bounds have intentionally been transformed to and calculated on -scale. Now, rpact enables implementation of this selection scheme by allowing *stage* as an argument in the selection function, additional to `effectVector`

:

```
# first row: first stage futility bounds, second stage: second stage futility bound
futility_bounds <- matrix(c(d_IN$futilityBounds, d_IN$futilityBounds), nrow = 2)
# selection function
selection <- function(effectVector, stage) {
# if stage==1, compare to first stage fut. bounds,
# if stage==2, compare to second stage fut. bounds
selectedArms <- switch(stage,
(effectVector >= futility_bounds[1, ]),
(effectVector >= futility_bounds[2, ])
)
return(selectedArms)
}
simulation <- getSimulationMultiArmRates(
design = d_IN,
activeArms = 2,
effectMatrix = effectMatrix,
typeOfShape = "userDefined",
piControl = 0.1,
intersectionTest = "Simes",
directionUpper = FALSE,
typeOfSelection = "userDefined",
selectArmsFunction = selection,
effectMeasure = "testStatistic",
successCriterion = "all",
plannedSubjects = c(157, 314, 471),
allocationRatioPlanned = 1,
maxNumberOfIterations = maxNumberOfIterations,
seed = 145873,
showStatistics = TRUE
)
kable(summary(simulation))
```

**Simulation of a binary endpoint (multi-arm design)**

Sequential analysis with a maximum of 3 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for rates (2 treatments vs. control), H0: pi(i) - pi(control) = 0, power directed towards smaller values, H1: treatment rate pi_max as specified, control rate pi(control) = 0.1, planned cumulative sample size = c(157, 314, 471), effect shape = user defined, intersection test = Simes, selection = user defined, effect measure based on test statistic, success criterion: all, simulation runs = 100, seed = 145873.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Stage levels (one-sided) | 0.0001 | 0.0060 | 0.0231 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Reject at least one [1] | 0.0200 | ||

Reject at least one [2] | 0.8200 | ||

Reject at least one [3] | 0.8100 | ||

Rejected arms per stage [1] | |||

Treatment arm 1 | 0 | 0 | 0.0200 |

Treatment arm 2 | 0 | 0 | 0 |

Rejected arms per stage [2] | |||

Treatment arm 1 | 0 | 0.4200 | 0.3300 |

Treatment arm 2 | 0.0100 | 0.3700 | 0.3000 |

Rejected arms per stage [3] | |||

Treatment arm 1 | 0.0100 | 0.2300 | 0.2800 |

Treatment arm 2 | 0.0100 | 0.4800 | 0.2900 |

Success per stage [1] | 0 | 0 | 0.0200 |

Success per stage [2] | 0 | 0.3400 | 0.4500 |

Success per stage [3] | 0.0100 | 0.2800 | 0.4200 |

Exit probability for futility [1] | 0.6200 | 0.2700 | |

Exit probability for futility [2] | 0.0500 | 0.0700 | |

Exit probability for futility [3] | 0.0700 | 0.0800 | |

Expected number of subjects under H1 [1] | 657.8 | ||

Expected number of subjects under H1 [2] | 1121.0 | ||

Expected number of subjects under H1 [3] | 1106.8 | ||

Overall exit probability [1] | 0.6200 | 0.2700 | |

Overall exit probability [2] | 0.0500 | 0.4100 | |

Overall exit probability [3] | 0.0800 | 0.3600 | |

Stagewise number of subjects [1] | |||

Treatment arm 1 | 157.0 | 140.5 | 85.6 |

Treatment arm 2 | 157.0 | 103.3 | 71.4 |

Control arm | 157.0 | 157.0 | 157.0 |

Stagewise number of subjects [2] | |||

Treatment arm 1 | 157.0 | 145.4 | 133.7 |

Treatment arm 2 | 157.0 | 147.1 | 122.1 |

Control arm | 157.0 | 157.0 | 157.0 |

Stagewise number of subjects [3] | |||

Treatment arm 1 | 157.0 | 129.7 | 112.1 |

Treatment arm 2 | 157.0 | 148.5 | 151.4 |

Control arm | 157.0 | 157.0 | 157.0 |

Selected arms [1] | |||

Treatment arm 1 | 1.0000 | 0.3400 | 0.0600 |

Treatment arm 2 | 1.0000 | 0.2500 | 0.0500 |

Selected arms [2] | |||

Treatment arm 1 | 1.0000 | 0.8800 | 0.4600 |

Treatment arm 2 | 1.0000 | 0.8900 | 0.4200 |

Selected arms [3] | |||

Treatment arm 1 | 1.0000 | 0.7600 | 0.4000 |

Treatment arm 2 | 1.0000 | 0.8700 | 0.5400 |

Number of active arms [1] | 2.000 | 1.553 | 1.000 |

Number of active arms [2] | 2.000 | 1.863 | 1.630 |

Number of active arms [3] | 2.000 | 1.772 | 1.679 |

Conditional power (achieved) [1] | 0.0627 | 0.5144 | |

Conditional power (achieved) [2] | 0.4522 | 0.7123 | |

Conditional power (achieved) [3] | 0.4165 | 0.7539 |

Legend:

*(i)*: treatment arm i*[j]*: effect matrix row j (situation to consider)

Depending on what stage the simulation is currently iterating through, the `switch()`

in the treatment selection function recognizes the stage changes and adapts the applicable futility bound by switching rows in the predefined futility bound matrix.

Again, assuming to be true, the probability of falsely rejecting it is , thus this simulation again indicates type I error rate control at . The power assuming a reduction is given through . Since now, stopping of treatment arms is determined by crossing the futility bounds or not, the probability of having futility in the first stage increases to in the first and to in the second stage under as compared to the first simulation. Having a different treatment selection scheme also result in lower expected number of subjects and, in opposite to the first simulation, the number of arms per scenario lies starting from stage since this treatment selection allows for early discontinuation of study arms whereas in the first simulation, arms are to be carried forward regardless of potential futility.

In both simulations, with initially calculated 157 subjects per arm per stage and a relative event rate reduction of as alternative, one can see that the study is overpowered (since *Reject at least one [j]*>0.8 in scenario ), which is due to the considered alternative having approximately the same effect in the two active treatment arms. For the second simulation, performing simulations through different potentially optimal sample sizes (in a sense that being the smallest sample size needed to achieve power) indicate that one could save some subjects and still achieve the desired power. The following plot shows the minimum sample size that one could choose to achieve the power requirements, which appears to be about , however, one should keep in mind that this number might slightly be deviating from an analytically optimal solution due to deviations introduced by the simulation:

Further, it should be noted here that even if one explicitly has under the null hypothesis, testing the hypothesis means that whenever with , one has a scenario where the null hypothesis is true. Consequently, one should properly check if the type I error rate is controlled not only assuming , but also considering the other scenarios when . For the second simulation, considering only the cases where rate equality holds, following plot obtained by simulation indicate that, given the study configurations here, the type I error rate is controlled under various control event rate assumptions:

In the next chapters, different hypothetical binary endpoint datasets are generated and analyzed using the `getAnalysisResults()`

command. As previously mentioned, rpact itself doesn’t support landmark analysis using the Greenwoods standard error (SE) formula. Thus the first two analyses base on the empirical event rates only. Afterwards, gestate and survival are then used to show how one could merge the packages to perform the intended analysis using boundaries obtained by rpact and test statistics obtained by survival probabilities and standard errors estimated using gestate and survival.

In the first stage analysis, one first has to manually enter a dataset for data observed in the trial:

```
genData_1 <- getDataset(
events1 = 4,
events2 = 8,
events3 = 16,
sampleSizes1 = 153,
sampleSizes2 = 157,
sampleSizes3 = 156
)
kable(summary(genData_1))
```

**Dataset of multi-arm rates**

The dataset contains the sample sizes and events of two treatment groups and one control group.

Stage | 1 | 1 | 1 |
---|---|---|---|

Group | 1 | 2 | 3 |

Sample size | 153 | 157 | 156 |

Number of events | 4 | 8 | 16 |

This dataset is a generic realization of data from the first stage in a design with binary endpoint under the input assumption of having a rate reduction in the treatment groups given . The highest index () corresponds with the events occurred in the control group and with the underlying sample sizes, respectively. The other indices represent active treatment groups. Data here is chosen such that one can see lower event rates in the active treatment groups. Note the slight imbalances in sample sizes, which might occur due to dropouts or recruitment issues.

Actual analysis of the first stage using *Simes* as intersection test goes as follows:

```
results_1 <- getAnalysisResults(
design = d_IN,
dataInput = genData_1,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_1))`

**Multi-arm analysis results for a binary endpoint (2 active arms vs. control)**

Sequential analysis with 3 looks (inverse normal combination test design). The results were calculated using a multi-arm test for rates (one-sided, alpha = 0.025), Simes intersection test, normal approximation test. H0: pi(i) - pi(control) = 0 against H1: pi(i) - pi(control) < 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | -0.076 | ||

Cumulative effect size (2) | -0.052 | ||

Cumulative treatment rate (1) | 0.026 | ||

Cumulative treatment rate (2) | 0.051 | ||

Cumulative control rate | 0.103 | ||

Stage-wise test statistic (1) | -2.730 | ||

Stage-wise test statistic (2) | -1.716 | ||

Stage-wise p-value (1) | 0.0032 | ||

Stage-wise p-value (2) | 0.0431 | ||

Adjusted stage-wise p-value (1, 2) | 0.0063 | ||

Adjusted stage-wise p-value (1) | 0.0032 | ||

Adjusted stage-wise p-value (2) | 0.0431 | ||

Overall adjusted test statistic (1, 2) | 2.493 | ||

Overall adjusted test statistic (1) | 2.730 | ||

Overall adjusted test statistic (2) | 1.716 | ||

Test action: reject (1) | FALSE | ||

Test action: reject (2) | FALSE | ||

Conditional rejection probability (1) | 0.2907 | ||

Conditional rejection probability (2) | 0.1204 | ||

95% repeated confidence interval (1) | [-0.212; 0.043] | ||

95% repeated confidence interval (2) | [-0.191; 0.079] | ||

Repeated p-value (1) | 0.1150 | ||

Repeated p-value (2) | 0.2429 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Although having obviously lower event rates in the first stage already, none of the hypotheses can be rejected, which can manually be tested comparing e.g. the overall adjusted test statistics for the global intersection hypothesis of having no effect in neither one of the active treatment arms to the efficacy bound and see that , i.e. no rejection of global intersection. Neither stop due to futility appears since no futility boundary is crossed at stage , which means that both treatment arms are to be carried forward to a second stage analysis. For obtaining non-significance in stage , one could also compare repeated p-values to full .

Proceeding to second stage, a generic dataset might look as follows, while the second vector entries represent second stage data:

```
# assuming there was no futility or efficacy stop study proceeds to randomize subjects
genData_2 <- getDataset(
events1 = c(4, 7),
events2 = c(8, 7),
events3 = c(16, 15),
sampleSizes1 = c(153, 155),
sampleSizes2 = c(157, 155),
sampleSizes3 = c(156, 155)
)
kable(summary(genData_2))
```

**Dataset of multi-arm rates**

The dataset contains the sample sizes and events of two treatment groups and one control group. The total number of looks is two; stage-wise and cumulative data are included.

Stage | 1 | 1 | 1 | 2 | 2 | 2 |
---|---|---|---|---|---|---|

Group | 1 | 2 | 3 | 1 | 2 | 3 |

Stage-wise sample size | 153 | 157 | 156 | 155 | 155 | 155 |

Cumulative sample size | 153 | 157 | 156 | 308 | 312 | 311 |

Stage-wise number of events | 4 | 8 | 16 | 7 | 7 | 15 |

Cumulative number of events | 4 | 8 | 16 | 11 | 15 | 31 |

Here, again, event rates are lower in treatment groups.

```
results_2 <- getAnalysisResults(
design = d_IN,
dataInput = genData_2,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_2))`

**Multi-arm analysis results for a binary endpoint (2 active arms vs. control)**

Sequential analysis with 3 looks (inverse normal combination test design). The results were calculated using a multi-arm test for rates (one-sided, alpha = 0.025), Simes intersection test, normal approximation test. H0: pi(i) - pi(control) = 0 against H1: pi(i) - pi(control) < 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | -0.076 | -0.064 | |

Cumulative effect size (2) | -0.052 | -0.052 | |

Cumulative treatment rate (1) | 0.026 | 0.036 | |

Cumulative treatment rate (2) | 0.051 | 0.048 | |

Cumulative control rate | 0.103 | 0.100 | |

Stage-wise test statistic (1) | -2.730 | -1.770 | |

Stage-wise test statistic (2) | -1.716 | -1.770 | |

Stage-wise p-value (1) | 0.0032 | 0.0384 | |

Stage-wise p-value (2) | 0.0431 | 0.0384 | |

Adjusted stage-wise p-value (1, 2) | 0.0063 | 0.0384 | |

Adjusted stage-wise p-value (1) | 0.0032 | 0.0384 | |

Adjusted stage-wise p-value (2) | 0.0431 | 0.0384 | |

Overall adjusted test statistic (1, 2) | 2.493 | 3.014 | |

Overall adjusted test statistic (1) | 2.730 | 3.182 | |

Overall adjusted test statistic (2) | 1.716 | 2.464 | |

Test action: reject (1) | FALSE | TRUE | |

Test action: reject (2) | FALSE | FALSE | |

Conditional rejection probability (1) | 0.2907 | 0.7911 | |

Conditional rejection probability (2) | 0.1204 | 0.5133 | |

95% repeated confidence interval (1) | [-0.212; 0.043 ] | [-0.130; -0.005] | |

95% repeated confidence interval (2) | [-0.191; 0.079] | [-0.119; 0.011] | |

Repeated p-value (1) | 0.1150 | 0.0086 | |

Repeated p-value (2) | 0.2429 | 0.0274 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Performing the same comparisons as in stage , one can see: The global null is to be rejected (). Subsequently, the hypothesis for the first active treatment is rejected, since indicating that the first active arm performs better than control. The same result can be obtained by recognizing that the *repeated p-value (1)* of falls below . The consequence is that this arm can be discontinued early due to efficacy. However, since the second treatment arm is not tested significant () and early stopping for efficacy is true only when all active treatments appear to be significant, study will continue up to the the final stage with the second treatment arm only. Again, no futility stop occurs. It should be noted here that the *adjusted stage-wise p-values* cannot directly be used for testing, but the values do only base on the second stage data. Since an inverse normal combination test is performed with weights given by here, the first and second stage p-values can be used to calculate the overall adjusted test statistics, exemplarily .

Third-stage-dataset:

```
genData_3 <- getDataset(
events1 = c(4, 7, NA),
events2 = c(8, 7, 6),
events3 = c(16, 15, 16),
sampleSizes1 = c(153, 155, NA),
sampleSizes2 = c(157, 155, 156),
sampleSizes3 = c(156, 155, 160)
)
kable(summary(genData_3))
```

**Dataset of multi-arm rates**

The dataset contains the sample sizes and events of two treatment groups and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 |
---|---|---|---|---|---|---|---|---|---|

Group | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 |

Stage-wise sample size | 153 | 157 | 156 | 155 | 155 | 155 | NA | 156 | 160 |

Cumulative sample size | 153 | 157 | 156 | 308 | 312 | 311 | NA | 468 | 471 |

Stage-wise number of events | 4 | 8 | 16 | 7 | 7 | 15 | NA | 6 | 16 |

Cumulative number of events | 4 | 8 | 16 | 11 | 15 | 31 | NA | 21 | 47 |

Final analysis:

```
results_3 <- getAnalysisResults(
design = d_IN,
dataInput = genData_3,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_3))`

**Multi-arm analysis results for a binary endpoint (2 active arms vs. control)**

Sequential analysis with 3 looks (inverse normal combination test design). The results were calculated using a multi-arm test for rates (one-sided, alpha = 0.025), Simes intersection test, normal approximation test. H0: pi(i) - pi(control) = 0 against H1: pi(i) - pi(control) < 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | -0.076 | -0.064 | |

Cumulative effect size (2) | -0.052 | -0.052 | -0.055 |

Cumulative treatment rate (1) | 0.026 | 0.036 | |

Cumulative treatment rate (2) | 0.051 | 0.048 | 0.045 |

Cumulative control rate | 0.103 | 0.100 | 0.100 |

Stage-wise test statistic (1) | -2.730 | -1.770 | |

Stage-wise test statistic (2) | -1.716 | -1.770 | -2.149 |

Stage-wise p-value (1) | 0.0032 | 0.0384 | |

Stage-wise p-value (2) | 0.0431 | 0.0384 | 0.0158 |

Adjusted stage-wise p-value (1, 2) | 0.0063 | 0.0384 | 0.0158 |

Adjusted stage-wise p-value (1) | 0.0032 | 0.0384 | |

Adjusted stage-wise p-value (2) | 0.0431 | 0.0384 | 0.0158 |

Overall adjusted test statistic (1, 2) | 2.493 | 3.014 | 3.702 |

Overall adjusted test statistic (1) | 2.730 | 3.182 | |

Overall adjusted test statistic (2) | 1.716 | 2.464 | 3.253 |

Test action: reject (1) | FALSE | TRUE | TRUE |

Test action: reject (2) | FALSE | FALSE | TRUE |

Conditional rejection probability (1) | 0.2907 | 0.7911 | |

Conditional rejection probability (2) | 0.1204 | 0.5133 | |

95% repeated confidence interval (1) | [-0.212; 0.043 ] | [-0.130; -0.005] | |

95% repeated confidence interval (2) | [-0.191; 0.079 ] | [-0.119; 0.011 ] | [-0.099; -0.013] |

Repeated p-value (1) | 0.1150 | 0.0086 | |

Repeated p-value (2) | 0.2429 | 0.0274 | 0.0006 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Considering either e.g. overall adjusted test statistics for intersection and single hypothesis or repeated p-values, or directly referring to *Test action: reject (2)*, one can conclude that the study claimed success since the first active treatment arm has already been tested significantly in the second stage while the second active treatment arm appears to be superior to control in the final stage.

Visualization of the analyses can be done graphically by illustrating the repeated confidence intervals, while one can see how the intervals keep narrowing along the stages because of increasing cumulative sample sizes and consistent trend observed across stages:

`plot(results_3, type = 2)`

Assume that the first stage dataset does not differ from the first generic example (thus, nor the analysis results), then a second stage dataset could be:

```
genData_4 <- getDataset(
events1 = c(4, 9),
events2 = c(8, 23),
events3 = c(16, 15),
sampleSizes1 = c(153, 155),
sampleSizes2 = c(157, 155),
sampleSizes3 = c(156, 155)
)
```

Note the high event number in the second active treatment arm.

Analysis results:

```
results_4 <- getAnalysisResults(
design = d_IN,
dataInput = genData_4,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_4))`

**Multi-arm analysis results for a binary endpoint (2 active arms vs. control)**

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | -0.076 | -0.057 | |

Cumulative effect size (2) | -0.052 | 0.000 | |

Cumulative treatment rate (1) | 0.026 | 0.042 | |

Cumulative treatment rate (2) | 0.051 | 0.099 | |

Cumulative control rate | 0.103 | 0.100 | |

Stage-wise test statistic (1) | -2.730 | -1.275 | |

Stage-wise test statistic (2) | -1.716 | 1.385 | |

Stage-wise p-value (1) | 0.0032 | 0.1011 | |

Stage-wise p-value (2) | 0.0431 | 0.9170 | |

Adjusted stage-wise p-value (1, 2) | 0.0063 | 0.2023 | |

Adjusted stage-wise p-value (1) | 0.0032 | 0.1011 | |

Adjusted stage-wise p-value (2) | 0.0431 | 0.9170 | |

Overall adjusted test statistic (1, 2) | 2.493 | 2.352 | |

Overall adjusted test statistic (1) | 2.730 | 2.832 | |

Overall adjusted test statistic (2) | 1.716 | 0.234 | |

Test action: reject (1) | FALSE | FALSE | |

Test action: reject (2) | FALSE | FALSE | |

Conditional rejection probability (1) | 0.2907 | 0.4500 | |

Conditional rejection probability (2) | 0.1204 | 0.0009 | |

95% repeated confidence interval (1) | [-0.212; 0.043] | [-0.125; 0.003] | |

95% repeated confidence interval (2) | [-0.191; 0.079] | [-0.080; 0.075] | |

Repeated p-value (1) | 0.1150 | 0.0340 | |

Repeated p-value (2) | 0.2429 | 0.2429 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

As one can see, the global null hypothesis can not be rejected since . However, the second treatment arm results in a stop due to futility, since, after not rejecting the global null, as indicated by the overall adjusted test statistic. That means this study course would lead to continuation only with active treatment arm .

Subsequent final stage dataset:

```
genData_5 <- getDataset(
events1 = c(4, 9, 7),
events2 = c(8, 23, NA),
events3 = c(16, 15, 16),
sampleSizes1 = c(153, 155, 165),
sampleSizes2 = c(157, 155, NA),
sampleSizes3 = c(156, 155, 160)
)
kable(summary(genData_5))
```

**Dataset of multi-arm rates**

The dataset contains the sample sizes and events of two treatment groups and one control group. The total number of looks is three; stage-wise and cumulative data are included.

Stage | 1 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 3 |
---|---|---|---|---|---|---|---|---|---|

Group | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 |

Stage-wise sample size | 153 | 157 | 156 | 155 | 155 | 155 | 165 | NA | 160 |

Cumulative sample size | 153 | 157 | 156 | 308 | 312 | 311 | 473 | NA | 471 |

Stage-wise number of events | 4 | 8 | 16 | 9 | 23 | 15 | 7 | NA | 16 |

Cumulative number of events | 4 | 8 | 16 | 13 | 31 | 31 | 20 | NA | 47 |

Note that recruitment for treatment arm has been terminated after stage .

Final stage analysis:

```
results_5 <- getAnalysisResults(
design = d_IN,
dataInput = genData_5,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_5))`

**Multi-arm analysis results for a binary endpoint (2 active arms vs. control)**

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | -0.076 | -0.057 | -0.058 |

Cumulative effect size (2) | -0.052 | 0.000 | |

Cumulative treatment rate (1) | 0.026 | 0.042 | 0.042 |

Cumulative treatment rate (2) | 0.051 | 0.099 | |

Cumulative control rate | 0.103 | 0.100 | 0.100 |

Stage-wise test statistic (1) | -2.730 | -1.275 | -2.024 |

Stage-wise test statistic (2) | -1.716 | 1.385 | |

Stage-wise p-value (1) | 0.0032 | 0.1011 | 0.0215 |

Stage-wise p-value (2) | 0.0431 | 0.9170 | |

Adjusted stage-wise p-value (1, 2) | 0.0063 | 0.2023 | 0.0215 |

Adjusted stage-wise p-value (1) | 0.0032 | 0.1011 | 0.0215 |

Adjusted stage-wise p-value (2) | 0.0431 | 0.9170 | |

Overall adjusted test statistic (1, 2) | 2.493 | 2.352 | 3.089 |

Overall adjusted test statistic (1) | 2.730 | 2.832 | 3.481 |

Overall adjusted test statistic (2) | 1.716 | 0.234 | |

Test action: reject (1) | FALSE | FALSE | TRUE |

Test action: reject (2) | FALSE | FALSE | FALSE |

Conditional rejection probability (1) | 0.2907 | 0.4500 | |

Conditional rejection probability (2) | 0.1204 | 0.0009 | |

95% repeated confidence interval (1) | [-0.212; 0.043 ] | [-0.125; 0.003 ] | [-0.102; -0.017] |

95% repeated confidence interval (2) | [-0.191; 0.079] | [-0.080; 0.075] | |

Repeated p-value (1) | 0.1150 | 0.0340 | 0.0010 |

Repeated p-value (2) | 0.2429 | 0.2429 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Since per definition, success of one treatment arm suffices to declare study success, that is the conclusion one comes to due to significance of treatment arm . Stopping treatment arm earlier therefore did not influence the positive result.

A landmark analysis is an analysis where, given a predefined and fixed point in time, survival probabilities at that specific time point are to be compared between different groups, such as active treatment and control group.

As previously mentioned, rpact itself does not support the procedure of comparing survival probabilities (more precisely estimates) at a specific point in time with standard error being calculated by Greenwoods formula, but bases on simple rate comparisons only. However, a R package called gestate supports various survival analysis methods, including landmark analysis. Firstly, one needs to load the package:

`library(gestate)`

Then, since gestates’ functionality bases on curve subjects and the analysis approach bases on survival data, the distribution of the data from treatment and control groups need to be specified. Assume that the data in the control group follows an exponential distribution Exp()-distribution with and the treatment groups follow Exp()- and Exp()-distribution with respectively (hazard ratio of , heuristically representing the assumed treatment effect of a rate-reduction), while other options would be to assume Weibull or piecewise Exponential distribution. This is initialized using:

```
effect <- 0.5
# initializing assumed distributions of different treatment arms as well as control
dist_c <- Exponential(lambda = 0.1)
dist_t1 <- Exponential(lambda = 0.1 * (1 - effect))
dist_t2 <- Exponential(lambda = 0.1 * (1 - effect))
```

Then, another curve that needs to be initialized is the recruitment curve. Gestate covers the potential assumptions of e.g. having instantaneous, linear or piecewise linear recruitment. In many applications, assuming linear (e.g. constant rate) recruitment might be suitable, thus this is what might be used here as well. Note that the subject number equal the numbers that have been calculated in chapter and, since measuring study duration in month is common, one may assume a total recruitment length of . The data will later on be used to estimate survival probabilities and standard errors, subsequently to be used as input for the rpact continuous endpoint module. Since the mean, standard deviation and sample sizes in the dataset input all need to be stage-wise because of the use of inverse normal combination test, data simulation will also be done separately per stage per arm, resulting in a maximum of 9 simulated datasets in this application. Further, since the information levels are equally spread along the three stages, each stage-wise dataset will base on a recruitment of subjects and a recruitment length of . Assuming an individual observation time of , one results in a total duration of per stage:

```
# equally sized arms and stages
n <- 157
# maximum duration of each stage
maxtime <- 7
# initialize recruitment for control group
recruitment_c <- LinearR(Nactive = 0, Ncontrol = n, rlength = 6)
# initialize recruitment for treatment groups, here: equal
recruitment_t1 <- LinearR(Nactive = n, Ncontrol = 0, rlength = 6)
recruitment_t2 <- LinearR(Nactive = n, Ncontrol = 0, rlength = 6)
```

As already mentioned in chapter (initialization of design), the interim analyses are to be performed at and with denoting the information fraction and the index denoting the stage. Since information here stands in correlation with sample size one can conclude that interims should be performed after month and after month of recruitment. Censoring could be specified through the variables `active_dcurve`

, `control_dcurve`

, however, in trials where no or only few censoring/dropout might be a reasonable assumption, these variables can be left on default (*=Blank()*). The only remaining type of censoring is censoring due to end of observation, e.g. subjects having had no event before study ends are considered censored.

Having made assumptions on data distribution, recruitment and censoring, simulating patient level survival data works as follows. It should be noted here that following simulations and data manipulations are demonstrated for the first stage data, however, second and third stage data is simulated and manipulated similarly.

```
# simulating data for all arms and stages independently (needed for dataset input)
# first stage, all arms, last index represents the stage
example_data_long_t1_1 <- simulate_trials(
active_ecurve = dist_t1,
control_ecurve = Blank(),
rcurve = recruitment_t1,
assess = maxtime,
iterations = 1,
seed = 1,
detailed_output = TRUE
)
example_data_long_t2_1 <- simulate_trials(
active_ecurve = dist_t2,
control_ecurve = Blank(),
rcurve = recruitment_t2,
assess = maxtime,
iterations = 1,
seed = 2,
detailed_output = TRUE
)
example_data_long_c_1 <- simulate_trials(
active_ecurve = Blank(),
control_ecurve = dist_c,
rcurve = recruitment_c,
assess = maxtime,
iterations = 1,
seed = 3,
detailed_output = TRUE
)
kable(head(example_data_long_t1_1, 10))
```

Time | Censored | Trt | Iter | ETime | CTime | Rec_Time | Assess | Max_F | RCTime |
---|---|---|---|---|---|---|---|---|---|

4.136118 | 1 | 2 | 1 | 15.103637 | Inf | 2.8638823 | 7 | Inf | 4.136118 |

2.375578 | 1 | 2 | 1 | 23.632856 | Inf | 4.6244223 | 7 | Inf | 2.375578 |

2.914134 | 0 | 2 | 1 | 2.914134 | Inf | 0.1667227 | 7 | Inf | 6.833277 |

2.795905 | 0 | 2 | 1 | 2.795905 | Inf | 3.1638647 | 7 | Inf | 3.836135 |

1.718086 | 1 | 2 | 1 | 8.721372 | Inf | 5.2819144 | 7 | Inf | 1.718086 |

4.761620 | 1 | 2 | 1 | 57.899371 | Inf | 2.2383802 | 7 | Inf | 4.761620 |

6.712245 | 1 | 2 | 1 | 24.591241 | Inf | 0.2877548 | 7 | Inf | 6.712245 |

6.168230 | 1 | 2 | 1 | 10.793657 | Inf | 0.8317695 | 7 | Inf | 6.168230 |

5.071047 | 1 | 2 | 1 | 19.131350 | Inf | 1.9289527 | 7 | Inf | 5.071047 |

2.940920 | 0 | 2 | 1 | 2.940920 | Inf | 0.9289897 | 7 | Inf | 6.071010 |

The table above provide information on patient level, such as time (to event), treatment group, time of event (if observed), assessment timing and recruitment time. To achieve that every subject only has an individual observation time of , one can manually set the censoring indicator to whenever no event has happened or the simulation indicates that an event has happened only after the fixed observation time.

```
# Check: if time to event (first column) is >=1, subject is censored
example_data_long_t1_1[which(example_data_long_t1_1[, 1] > 1), 2] <- 1
example_data_long_t2_1[which(example_data_long_t2_1[, 1] > 1), 2] <- 1
example_data_long_c_1[which(example_data_long_c_1[, 1] > 1), 2] <- 1
```

After doing this, since the administrative censoring due to end of observation has manually been manipulated, the time of censoring in the last column can manually be corrected according to the changes:

```
# if censoring indicator==1, the censoring time is set to event time
# according to the previous changes
# stage 1
example_data_long_t1_1[which(example_data_long_t1_1[, 2] == 1), 9] <-
example_data_long_t1_1[which(example_data_long_t1_1[, 2] == 1), 1]
example_data_long_t2_1[which(example_data_long_t2_1[, 2] == 1), 9] <-
example_data_long_t2_1[which(example_data_long_t2_1[, 2] == 1), 1]
example_data_long_c_1[which(example_data_long_c_1[, 2] == 1), 9] <-
example_data_long_c_1[which(example_data_long_c_1[, 2] == 1), 1]
```

Though working on subject level data (if available) is commonly recommended, one could be interested in tables containing time, number of subjects at risk, survival probabilities and respective standard errors, henceforth referred to as life tables. After setting the assessment time to the defined , for creating the life tables, the commands `Surv()`

and `survfit()`

from the R package survival are used:

```
# needs to be run in case package has not yet been installed
library(survival)
# setting desired assessment time
example_data_short_t1_1 <- set_assess_time(example_data_long_t1_1, maxtime)
example_data_short_t2_1 <- set_assess_time(example_data_long_t2_1, maxtime)
example_data_short_c_1 <- set_assess_time(example_data_long_c_1, maxtime)
# Creating live tables for each group depending on the stage
# at first landmark time point
lt_t1_1 <- summary(survfit(Surv(
example_data_short_t1_1[, "Time"],
1 - example_data_short_t1_1[, "Censored"]
) ~ 1, error = "greenwood"))
lt_t2_1 <- summary(survfit(Surv(
example_data_short_t2_1[, "Time"],
1 - example_data_short_t2_1[, "Censored"]
) ~ 1, error = "greenwood"))
lt_c_1 <- summary(survfit(Surv(
example_data_short_c_1[, "Time"],
1 - example_data_short_c_1[, "Censored"]
) ~ 1, error = "greenwood"))
kable(head(cbind(
"Time" = lt_t1_1$time, "n.risk" = lt_t1_1$n.risk,
"n.event" = lt_t1_1$n.event, "surv.prob." = lt_t1_1$surv,
"Greenwoods SE" = lt_t1_1$std.err
), 10))
```

Time | n.risk | n.event | surv.prob. | Greenwoods SE |
---|---|---|---|---|

0.4050083 | 157 | 1 | 0.9936306 | 0.0063491 |

0.7430455 | 156 | 1 | 0.9872611 | 0.0089502 |

0.7453705 | 155 | 1 | 0.9808917 | 0.0109263 |

1.0411090 | 152 | 1 | 0.9744385 | 0.0126170 |

1.1852241 | 143 | 1 | 0.9676242 | 0.0142506 |

1.1887832 | 142 | 1 | 0.9608100 | 0.0156951 |

1.3882469 | 136 | 1 | 0.9537452 | 0.0170959 |

1.7934816 | 127 | 1 | 0.9462354 | 0.0185375 |

1.8221775 | 125 | 1 | 0.9386655 | 0.0198748 |

1.9769123 | 119 | 1 | 0.9307776 | 0.0212154 |

This table created above contains time, number of subjects at risk, number of events, survival probability estimation and standard error estimation, thus allowing for creating a Kaplan-Meier-Plot to visualize the development of survival probability in time. Note that choosing `error="greenwood"`

leads to the desired SE estimation. One could also discretize the time into intervals to obtain life tables with consistent time steps, if desired. Exemplary Kaplan-Meier-Plots for the different treatment groups at the different stages are created below:

While the survival probabilities from the treatment arms develop rather similarly for the independently simulated stages, the figures demonstrate that, besides few early exceptions, the survival probabilities from the control group falls below the survival probabilities of the treatment groups almost constantly, indicating that the groups might actually (i.e. potentially statistically significant) differ.

As previously mentioned, the continuous module of rpact is now used to perform the landmark analysis of the gestate-simulated data. Note here that this is a valid approximation whenever assuming the survival probability estimates to be normally distributed is reasonable. To initialize continuous datasets, estimated survival probabilities from the life tables above will be used as mean, the transformed Greenwood SE estimations will be used as standard deviation and the sample size from the earlier sample size calculations is used, making the simplifying assumption of no dropouts. The first stage continuous dataset therefore is:

```
lm <- 7
# first stage dataset
dataset_1 <- getDataset(
# accessing the survival probability values at the timepoint closest to the desired one
means1 = c(lt_t1_1$surv[which(abs(lt_t1_1$time - lm) == min(abs(lt_t1_1$time - lm)))]),
means2 = c(lt_t2_1$surv[which(abs(lt_t2_1$time - lm) == min(abs(lt_t2_1$time - lm)))]),
means3 = c(lt_c_1$surv[which(abs(lt_c_1$time - lm) == min(abs(lt_c_1$time - lm)))]),
stDevs1 = c(lt_t1_1$std.err[which(abs(lt_t1_1$time - lm) ==
min(abs(lt_t1_1$time - lm)))] * sqrt(n)),
stDevs2 = c(lt_t2_1$std.err[which(abs(lt_t2_1$time - lm) ==
min(abs(lt_t2_1$time - lm)))] * sqrt(n)),
stDevs3 = c(lt_c_1$std.err[which(abs(lt_c_1$time - lm) ==
min(abs(lt_c_1$time - lm)))] * sqrt(n)),
n1 = c(n),
n2 = c(n),
n3 = c(n)
)
kable(summary(dataset_1))
```

**Dataset of multi-arm means**

The dataset contains the sample sizes, means, and standard deviations of two treatment groups and one control group.

Stage | 1 | 1 | 1 |
---|---|---|---|

Group | 1 | 2 | 3 |

Sample size | 157 | 157 | 157 |

Mean | 0.734 | 0.761 | 0.543 |

Standard deviation | 0.657 | 0.661 | 0.811 |

The explicit estimates can be seen using the `summary()`

-command.

Note that the highest index () represents the control group. Since the time in the life table data has not been structured into intervals of equal length, the datasets do mostly not have survival probabilities and standard error entries exactly at the given landmark time point, which is why the closest point in time is chosen here. Analysis of the data according to the group sequential design in this vignette can then be done by:

```
result_1 <- getAnalysisResults(
dataInput = dataset_1, design = d_IN,
intersectionTest = "Simes",
normalApproximation = TRUE, varianceOption = "pairwisePooled"
)
```

`kable(summary(result_1))`

**Multi-arm analysis results for a continuous endpoint (2 active arms vs. control)**

Sequential analysis with 3 looks (inverse normal combination test design). The results were calculated using a multi-arm t-test (one-sided, alpha = 0.025), Simes intersection test, normal approximation test, pairwise pooled variances option. H0: mu(i) - mu(control) = 0 against H1: mu(i) - mu(control) > 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | 0.191 | ||

Cumulative effect size (2) | 0.217 | ||

Cumulative (pooled) standard deviation | 0.714 | ||

Stage-wise test statistic (1) | 2.287 | ||

Stage-wise test statistic (2) | 2.601 | ||

Stage-wise p-value (1) | 0.0111 | ||

Stage-wise p-value (2) | 0.0046 | ||

Adjusted stage-wise p-value (1, 2) | 0.0093 | ||

Adjusted stage-wise p-value (1) | 0.0111 | ||

Adjusted stage-wise p-value (2) | 0.0046 | ||

Overall adjusted test statistic (1, 2) | 2.354 | ||

Overall adjusted test statistic (1) | 2.287 | ||

Overall adjusted test statistic (2) | 2.601 | ||

Test action: reject (1) | FALSE | ||

Test action: reject (2) | FALSE | ||

Conditional rejection probability (1) | 0.2359 | ||

Conditional rejection probability (2) | 0.2530 | ||

95% repeated confidence interval (1) | [-0.133; 0.514] | ||

95% repeated confidence interval (2) | [-0.107; 0.542] | ||

Repeated p-value (1) | 0.1425 | ||

Repeated p-value (2) | 0.1331 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Since the overall adjusted test statistic for the intersection hypotheses () does not exceed the first stage critical value of , the intersection hypothesis of having no effect in either one of the treatment groups cannot be rejected, while e.g. repeated p-values lead to the same results. Furthermore, since all of the applicable test statistics do not fall below the futility bound of , the study ordinarily proceeds to the second stage:

```
dataset_2 <- getDataset(
means1 = c(
lt_t1_1$surv[which(abs(lt_t1_1$time - lm) == min(abs(lt_t1_1$time - lm)))],
lt_t1_2$surv[which(abs(lt_t1_2$time - lm) == min(abs(lt_t1_2$time - lm)))]
),
means2 = c(
lt_t2_1$surv[which(abs(lt_t2_1$time - lm) == min(abs(lt_t2_1$time - lm)))],
lt_t2_2$surv[which(abs(lt_t2_2$time - lm) == min(abs(lt_t2_2$time - lm)))]
),
means3 = c(
lt_c_1$surv[which(abs(lt_c_1$time - lm) == min(abs(lt_c_1$time - lm)))],
lt_c_2$surv[which(abs(lt_c_2$time - lm) == min(abs(lt_c_2$time - lm)))]
),
stDevs1 = c(
lt_t1_1$std.err[which(abs(lt_t1_1$time - lm) ==
min(abs(lt_t1_1$time - lm)))] * sqrt(n),
lt_t1_2$std.err[
which(abs(lt_t1_2$time - lm) == min(abs(lt_t1_2$time - lm)))
] * sqrt(n)
),
stDevs2 = c(
lt_t2_1$std.err[which(abs(lt_t2_1$time - lm) ==
min(abs(lt_t2_1$time - lm)))] * sqrt(n),
lt_t2_2$std.err[which(abs(lt_t2_2$time - lm) ==
min(abs(lt_t2_2$time - lm)))] * sqrt(n)
),
stDevs3 = c(
lt_c_1$std.err[which(abs(lt_c_1$time - lm) == min(abs(lt_c_1$time - lm)))]
* sqrt(n),
lt_c_2$std.err[which(abs(lt_c_2$time - lm) == min(abs(lt_c_2$time - lm)))]
* sqrt(n)
),
n1 = c(n, n),
n2 = c(n, n),
n3 = c(n, n)
)
result_2 <- getAnalysisResults(
dataInput = dataset_2, design = d_IN,
intersectionTest = "Simes", normalApproximation = TRUE,
varianceOption = "pairwisePooled"
)
```

`kable(summary(result_2))`

**Multi-arm analysis results for a continuous endpoint (2 active arms vs. control)**

Sequential analysis with 3 looks (inverse normal combination test design). The results were calculated using a multi-arm t-test (one-sided, alpha = 0.025), Simes intersection test, normal approximation test, pairwise pooled variances option. H0: mu(i) - mu(control) = 0 against H1: mu(i) - mu(control) > 0.

Stage | 1 | 2 | 3 |
---|---|---|---|

Fixed weight | 0.577 | 0.577 | 0.577 |

Efficacy boundary (z-value scale) | 3.710 | 2.511 | 1.993 |

Futility boundary (z-value scale) | 0.149 | 0.414 | |

Cumulative alpha spent | 0.0001 | 0.0060 | 0.0250 |

Stage level | 0.0001 | 0.0060 | 0.0231 |

Cumulative effect size (1) | 0.191 | 0.226 | |

Cumulative effect size (2) | 0.217 | 0.329 | |

Cumulative (pooled) standard deviation | 0.714 | 0.931 | |

Stage-wise test statistic (1) | 2.287 | 1.769 | |

Stage-wise test statistic (2) | 2.601 | 3.512 | |

Stage-wise p-value (1) | 0.0111 | 0.0384 | |

Stage-wise p-value (2) | 0.0046 | 0.0002 | |

Adjusted stage-wise p-value (1, 2) | 0.0093 | 0.0004 | |

Adjusted stage-wise p-value (1) | 0.0111 | 0.0384 | |

Adjusted stage-wise p-value (2) | 0.0046 | 0.0002 | |

Overall adjusted test statistic (1, 2) | 2.354 | 4.015 | |

Overall adjusted test statistic (1) | 2.287 | 2.868 | |

Overall adjusted test statistic (2) | 2.601 | 4.323 | |

Test action: reject (1) | FALSE | TRUE | |

Test action: reject (2) | FALSE | TRUE | |

Conditional rejection probability (1) | 0.2359 | 0.7272 | |

Conditional rejection probability (2) | 0.2530 | 0.9870 | |

95% repeated confidence interval (1) | [-0.133; 0.514] | [-0.005; 0.441] | |

95% repeated confidence interval (2) | [-0.107; 0.542] | [0.096 ; 0.527] | |

Repeated p-value (1) | 0.1425 | 0.0119 | |

Repeated p-value (2) | 0.1331 | 0.0007 |

Legend:

*(i)*: results of treatment arm i vs. control arm*(i, j, …)*: comparison of treatment arms ‘i, j, …’ vs. control arm

Again referring to the overall adjusted test statistics, since , the global intersection hypothesis can be rejected in the second stage. Subsequently, the hypothesis of no effect in the first treatment group can be rejected because and the hypothesis of no effect in the second treatment group can be rejected due to exceeding the applicable critical value. Another testing approach would be to compare repeated p-values listed at the end of the output () to the full significance level .

As mentioned in the context of the simulations in chapter 5, study success can only be concluded when both treatment arms have been tested significantly against control. As this appears to be the case here, the study would have successfully finished at the second stage.

This vignette demonstrated how to implement a MAMS study design given the case the futility bounds are only known on treatment effect scale, to perform simulations with special regards to treatment selection as well as how to analyze generic data using the binary module or perform landmark analysis with the support of gestate. Especially considering the final chapter about the landmark analysis, further investigations on how the results and characteristics such as -level control and/or the power behave could be done in subsequent research.

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document provides examples for simulating multi-arm multi-stage (MAMS) designs for testing means in many-to-one comparisons. For designs with multiple arms, rpact enables the simulation of designs that use the **closed combination testing principle**. For a description of the methodology please refer to Part III of the book “Group Sequential and Confirmatory Adaptive Designs in Clinical Trials” by Gernot Wassmer & Werner Brannath. Essentially, we show in this vignette how to reproduce part of the simulation results provided in the paper “On Sample Size Determination in Multi-Arm Confirmatory Adaptive Designs” by Gernot Wassmer (Journal of Biopharmaceutical Statistics, 2011).

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact") # version should be version 3.0.1 or later
```

`[1] '3.5.1'`

rpact enables the assessment of sample sizes in multiple arms including selection of treatment arms. We will first consider the simple case of a two-stage design with O’Brien & Fleming boundaries assuming three active treatment arms which are tested against control. Let the three treatment arms be referring to three different and increasing doses, “low”, “medium”, and “high”, say. We assume that the highest dose will have response difference 10 as compared to control, and that there will be a linear dose-response relationship. The standard deviation is assumed to be . At interim, the treatment arm with the highest observed response as compared to placebo is selected for testing at the second stage.

One way to adjust for the multiple comparison situation is to use the Bonferroni correction for testing the intersection tests in the closed system of hypotheses. It will show that using instead of for the sample size calculation for the highest dose in the two-arm fixed sample size case can serve as a reasonable first guess for the sample size for the multi-arm case. That is, for and power we calculate the sample size using the commands

```
nsFixed <- getSampleSizeMeans(alpha = 0.025 / 3, beta = 0.1, alternative = 10, stDev = 15)
kable(summary(nsFixed))
```

**Sample size calculation for a continuous endpoint**

Fixed sample analysis, significance level 0.83% (one-sided). The results were calculated for a two-sample t-test, H0: mu(1) - mu(2) = 0, H1: effect = 10, standard deviation = 15, power 90%.

Stage | Fixed |
---|---|

Efficacy boundary (z-value scale) | 2.394 |

Number of subjects | 124.5 |

One-sided local significance level | 0.0083 |

Efficacy boundary (t) | 6.526 |

Legend:

*(t)*: treatment effect scale

yielding 125 as the total number of subjects and hence n = 63 subjects per treatment arm in order to achieve the desired power. As a first guess for the multi-arm two-stage case we choose 30 per stage and treatment arm and use the following commands for evaluating the MAMS design. Note that `plannedSubjects`

refers to the **cumulative sample sizes over the stages per selected active arm**:

```
designIN <- getDesignInverseNormal(kMax = 2, alpha = 0.025, typeOfDesign = "OF")
maxNumberOfIterations <- 1000
simBonfMAMS <- getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
muMaxVector = c(10),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Bonferroni",
typeOfShape = "linear",
typeOfSelection = "best",
successCriterion = "all",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
kable(summary(simBonfMAMS))
```

**Simulation of a continuous endpoint (multi-arm design)**

Sequential analysis with a maximum of 2 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for means (3 treatments vs. control), H0: mu(i) - mu(control) = 0, power directed towards larger values, H1: mu_max = 10, standard deviation = 15, planned cumulative sample size = c(30, 60), effect shape = linear, intersection test = Bonferroni, selection = best, effect measure based on effect estimate, success criterion: all, simulation runs = 1000, seed = 1234.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | 2.797 | 1.977 |

Stage levels (one-sided) | 0.0026 | 0.0240 |

Reject at least one | 0.8850 | |

Rejected arms per stage | ||

Treatment arm 1 | 0.0130 | 0.0080 |

Treatment arm 2 | 0.1120 | 0.0890 |

Treatment arm 3 | 0.3030 | 0.4510 |

Success per stage | 0.0060 | 0.8790 |

Exit probability for futility | 0.0040 | |

Expected number of subjects under H1 | 179.4 | |

Overall exit probability | 0.0100 | |

Stagewise number of subjects | ||

Treatment arm 1 | 30.0 | 0.7 |

Treatment arm 2 | 30.0 | 5.6 |

Treatment arm 3 | 30.0 | 23.7 |

Control arm | 30.0 | 30.0 |

Selected arms | ||

Treatment arm 1 | 1.0000 | 0.0220 |

Treatment arm 2 | 1.0000 | 0.1850 |

Treatment arm 3 | 1.0000 | 0.7830 |

Number of active arms | 3.000 | 1.000 |

Conditional power (achieved) | 0.5785 |

Legend:

*(i)*: treatment arm i

We see that the power, which is the probability to reject at least one of the three corresponding hypotheses, is about 88% if a linear dose-response relationship is assumed. Note that there is a small probability to stop the trial for futility which is due to the use of the Bonferroni correction yielding adjusted -values equal to 1 at interim (making a rejection at stage 2 impossible).

Using the Dunnett test for testing the intersection hypotheses increases the power to about 90% which is obtained by selecting `intersectionTest = "Dunnett"`

:

```
simDunnettMAMS <- getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
typeOfShape = "linear",
muMaxVector = c(10),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Dunnett",
typeOfSelection = "best",
successCriterion = "all",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
kable(summary(simDunnettMAMS))
```

**Simulation of a continuous endpoint (multi-arm design)**

Sequential analysis with a maximum of 2 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for means (3 treatments vs. control), H0: mu(i) - mu(control) = 0, power directed towards larger values, H1: mu_max = 10, standard deviation = 15, planned cumulative sample size = c(30, 60), effect shape = linear, intersection test = Dunnett, selection = best, effect measure based on effect estimate, success criterion: all, simulation runs = 1000, seed = 1234.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | 2.797 | 1.977 |

Stage levels (one-sided) | 0.0026 | 0.0240 |

Reject at least one | 0.8990 | |

Rejected arms per stage | ||

Treatment arm 1 | 0.0130 | 0.0080 |

Treatment arm 2 | 0.1140 | 0.0910 |

Treatment arm 3 | 0.3080 | 0.4580 |

Success per stage | 0.0060 | 0.8930 |

Expected number of subjects under H1 | 179.6 | |

Overall exit probability | 0.0060 | |

Stagewise number of subjects | ||

Treatment arm 1 | 30.0 | 0.7 |

Treatment arm 2 | 30.0 | 5.6 |

Treatment arm 3 | 30.0 | 23.7 |

Control arm | 30.0 | 30.0 |

Selected arms | ||

Treatment arm 1 | 1.0000 | 0.0220 |

Treatment arm 2 | 1.0000 | 0.1870 |

Treatment arm 3 | 1.0000 | 0.7850 |

Number of active arms | 3.000 | 1.000 |

Conditional power (achieved) | 0.5761 |

Legend:

*(i)*: treatment arm i

Changing `successCriterion = "all"`

to `successCriterion = "atLeastOne"`

reduces the expected number of subjects considerably because the trial is stopped at interim in many more cases:

```
simDunnettMAMSatLeastOne <- getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
typeOfShape = "linear",
muMaxVector = c(10),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Dunnett",
typeOfSelection = "best",
successCriterion = "atLeastOne",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
kable(summary(simDunnettMAMSatLeastOne))
```

**Simulation of a continuous endpoint (multi-arm design)**

Sequential analysis with a maximum of 2 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for means (3 treatments vs. control), H0: mu(i) - mu(control) = 0, power directed towards larger values, H1: mu_max = 10, standard deviation = 15, planned cumulative sample size = c(30, 60), effect shape = linear, intersection test = Dunnett, selection = best, effect measure based on effect estimate, success criterion: at least one, simulation runs = 1000, seed = 1234.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | 2.797 | 1.977 |

Stage levels (one-sided) | 0.0026 | 0.0240 |

Reject at least one | 0.8990 | |

Rejected arms per stage | ||

Treatment arm 1 | 0.0130 | 0.0080 |

Treatment arm 2 | 0.1140 | 0.0910 |

Treatment arm 3 | 0.3080 | 0.4580 |

Success per stage | 0.3420 | 0.5570 |

Expected number of subjects under H1 | 159.5 | |

Overall exit probability | 0.3420 | |

Stagewise number of subjects | ||

Treatment arm 1 | 30.0 | 0.8 |

Treatment arm 2 | 30.0 | 6.3 |

Treatment arm 3 | 30.0 | 22.9 |

Control arm | 30.0 | 30.0 |

Selected arms | ||

Treatment arm 1 | 1.0000 | 0.0180 |

Treatment arm 2 | 1.0000 | 0.1380 |

Treatment arm 3 | 1.0000 | 0.5020 |

Number of active arms | 3.000 | 1.000 |

Conditional power (achieved) | 0.3989 |

Legend:

*(i)*: treatment arm i

For this example, we might conclude that choosing 30 subjects per treatment arm and stage is a reasonable choice. If, however, the effect sizes are smaller for the low and medium dose, the power might decrease and the sample size therefore should be increased. For example, assuming effect sizes of only 1 and 2 in the low and medium dose group, respectively, the test characteristics can be obtained by using the `typeOfShape = userDefined`

option. The effect sizes of interest are specified through `effectMatrix`

(which needs to be a matrix because you can also generally consider more that one parameter configuration per simulation run):

```
simDunnettMAMS <- getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
typeOfShape = "userDefined",
effectMatrix = matrix(c(1, 2, 10), ncol = 3),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Dunnett",
typeOfSelection = "best",
successCriterion = "atLeastOne",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
kable(summary(simDunnettMAMS))
```

**Simulation of a continuous endpoint (multi-arm design)**

Sequential analysis with a maximum of 2 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a multi-arm comparisons for means (3 treatments vs. control), H0: mu(i) - mu(control) = 0, power directed towards larger values, H1: mu_max = 10, standard deviation = 15, planned cumulative sample size = c(30, 60), effect shape = user defined, intersection test = Dunnett, selection = best, effect measure based on effect estimate, success criterion: at least one, simulation runs = 1000, seed = 1234.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | 2.797 | 1.977 |

Stage levels (one-sided) | 0.0026 | 0.0240 |

Reject at least one | 0.8970 | |

Rejected arms per stage | ||

Treatment arm 1 | 0 | 0.0010 |

Treatment arm 2 | 0.0050 | 0.0040 |

Treatment arm 3 | 0.3060 | 0.5860 |

Success per stage | 0.3060 | 0.5910 |

Expected number of subjects under H1 | 161.6 | |

Overall exit probability | 0.3060 | |

Stagewise number of subjects | ||

Treatment arm 1 | 30.0 | 0.3 |

Treatment arm 2 | 30.0 | 0.8 |

Treatment arm 3 | 30.0 | 29.0 |

Control arm | 30.0 | 30.0 |

Selected arms | ||

Treatment arm 1 | 1.0000 | 0.0060 |

Treatment arm 2 | 1.0000 | 0.0180 |

Treatment arm 3 | 1.0000 | 0.6700 |

Number of active arms | 3.000 | 1.000 |

Conditional power (achieved) | 0.2592 |

Legend:

*(i)*: treatment arm i

It is interesting (though actually clear) that the power and other test characteristics are quite similar and so the validity of the chosen sample size might be considered as robust against possible deviations for the originally assumed linear dose-response relationship. Note that ’Stagewise number of subjects` denote the **conditional expected sample size in treatment arm i** and so account for the fact that a treatment arm is selected given the fact that the second stage was reached.

Since treatment arms are discontinued over the two stages, the sample size and hence the information over the stages is not the same. Despite of this, we used the **unweighted inverse normal method** (with weight = ) for combining the two stages. The pre-fixed weight, however, does not have a substantial impact on the power of the procedure which is shown in the following plot. The simulated power values show that in a medium range of the weights the power does not change substantially and hence it is reasonable to choose equal weights for the two stages:

```
powerValues <- c()
weights <- seq(0.05, 0.95, 0.05)
for (w in weights) {
designIN <- getDesignInverseNormal(
kMax = 2, alpha = 0.025,
informationRates = c(w, 1), typeOfDesign = "OF"
)
powerValues <- c(
powerValues,
getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
typeOfShape = "linear",
muMaxVector = c(10),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Dunnett",
typeOfSelection = "best",
successCriterion = "atLeastOne",
maxNumberOfIterations = maxNumberOfIterations,
seed = 12345
)$rejectAtLeastOne
)
}
plot(weights, powerValues,
type = "l", lwd = 3, ylim = c(0.7, 1),
xlab = "weight", ylab = "Power"
)
lines(weights, rep(0.9, length(weights)), lty = 2)
```

You might assess different selection rules by using the parameter `typeOfSelection`

. Five options are available: `best`

, `rBest`

, `epsilon`

, `all`

, and `userDefined`

. For `rbest`

(select the r best treatment arms), the parameter `rValue`

has to be specified, for `epsilon`

(select treatment arm not worse than epsilon compared to the best), the parameter `epsilonValue`

has to be specified.

If `userDefined`

is selected, `selectArmsFunction`

needs to be specified that depends on `effectVector`

. Note that `effectVector`

is either the test statistic or the effect difference (in absolute terms) which can be selected through the parameter `effectMeasure`

. For example, using the function

```
designIN <- getDesignInverseNormal(kMax = 2, alpha = 0.025, typeOfDesign = "OF")
mySelectionFunction <- function(effectVector) {
selectedArms <- (effectVector >= c(5, 5, 5))
return(selectedArms)
}
```

defines a selection rule where all treatment arms with effect sizes exceeding 5 (with the default `effectMeasure = effectEstimate`

) are selected. Running

```
simSelectionMAMS <- getSimulationMultiArmMeans(
design = designIN,
activeArms = 3,
typeOfShape = "linear",
muMaxVector = c(10),
stDev = 15,
plannedSubjects = c(30, 60),
intersectionTest = "Dunnett",
typeOfSelection = "userDefined",
selectArmsFunction = mySelectionFunction,
successCriterion = "atLeastOne",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
kable(summary(simSelectionMAMS))
```

shows that for the second stage the expected number of selected treatment arms is 1.803 indicating that there are cases where more than one arm is selected for the second stage:

**Simulation of multi-arm means (inverse normal combination test design)**

**Design parameters**

*Information rates*: 0.500, 1.000*Critical values*: 2.797, 1.977*Futility bounds (binding)*: -Inf*Cumulative alpha spending*: 0.002583, 0.025000*Local one-sided significance levels*: 0.002583, 0.023996*Significance level*: 0.0250*Test*: one-sided

**User defined parameters**

*Seed*: 1234*Standard deviation*: 15*Planned cumulative subjects*: 30, 60*mu_max*: 10*Type of selection*: userDefined*Success criterion*: atLeastOne

**Default parameters**

*Maximum number of iterations*: 1000*Planned allocation ratio*: 1*Calculate subjects function*: default*Active arms*: 3*Effect matrix (1)*: 3.333*Effect matrix (2)*: 6.667*Effect matrix (3)*: 10.000*Type of shape*: linear*Slope*: 1*Intersection test*: Dunnett*Adaptations*: TRUE*Effect measure*: effectEstimate

**Results**

*Iterations [1]*: 1000*Iterations [2]*: 604*Reject at least one*: 0.8840*Rejected arms per stage (1) [1]*: 0.0140*Rejected arms per stage (1) [2]*: 0.0470*Rejected arms per stage (2) [1]*: 0.0870*Rejected arms per stage (2) [2]*: 0.2640*Rejected arms per stage (3) [1]*: 0.3050*Rejected arms per stage (3) [2]*: 0.5330*Futility stop per stage*: 0.0700*Early stop*: 0.3960*Success per stage [1]*: 0.3260*Success per stage [2]*: 0.5580*Selected arms (1) [1]*: 1.0000*Selected arms (1) [2]*: 0.1360*Selected arms (2) [1]*: 1.0000*Selected arms (2) [2]*: 0.3840*Selected arms (3) [1]*: 1.0000*Selected arms (3) [2]*: 0.5690*Selected arms (4) [1]*: 1.0000*Selected arms (4) [2]*: 0.6040*Number of active arms [1]*: 3.000*Number of active arms [2]*: 1.803*Expected number of subjects*: 170.8*Sample sizes (1) [1]*: 30*Sample sizes (1) [2]*: 6.8*Sample sizes (2) [1]*: 30*Sample sizes (2) [2]*: 19.1*Sample sizes (3) [1]*: 30*Sample sizes (3) [2]*: 28.3*Sample sizes (4) [1]*: 30*Sample sizes (4) [2]*: 30*Conditional power (achieved) [1]*: NA*Conditional power (achieved) [2]*: 0.4021

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

Using `getData()`

enables to show how often this is the case. The following code lines calculate how often 1, 2, and 3 treatment arms were selected for the second stage:

```
dat <- getData(simSelectionMAMS)
tab <- as.matrix(table(dat[dat$stageNumber == 2, ]$iterationNumber))
round(table(tab[, 1]) / nrow(tab), 5)
```

` 1 2 3 `

0.36258 0.47185 0.16556

Note that these probabilities are **conditional probabilities** (conditional on performing the second stage) and sum to one whereas the probabilities for selecting arm 1, 2, or 3 provided in the summary are **unconditional**, i.e., not conditioned on reaching the second stage. In particular, they may become small if the study often stops at interim.

We now consider a three-stage inverse normal combination test design where no early stops for efficacy are foreseen. At the end, the full significance level of should be used. This is achieved by the definition of the design through

```
designIN3Stages <- getDesignInverseNormal(
typeOfDesign = "asUser",
userAlphaSpending = c(0, 0, 0.025)
)
```

`Changed type of design to 'noEarlyEfficacy'`

As above, we plan a design with three active treatment arms to be tested against control and assume a linear dose-response relationship. We want to consider a range of maximum values for the effect and therefore specify `muMaxVector = seq(0,12,3)`

, i.e., including the null hypothesis case. For the selection of treatment arms, we define the epsilon selection rule with , i.e., for the subsequent stages, the treatment arm with the highest response and all treatment arms that differ less than 2 with the highest response are selected. To exclude treatment arms with non-positive response, we additionally specify `threshold = 0`

. In order to simulate a situation with a maximum of 60 subject per treatment arm we set `plannedSubjects = c(20, 40, 60)`

:

```
simSelectionEpsilonMAMS <- getSimulationMultiArmMeans(
design = designIN3Stages,
activeArms = 3,
typeOfShape = "linear",
muMaxVector = seq(0, 12, 2),
stDev = 15,
plannedSubjects = c(20, 40, 60),
intersectionTest = "Dunnett",
typeOfSelection = "epsilon",
epsilonValue = 2,
threshold = 0,
successCriterion = "atLeastOne",
maxNumberOfIterations = maxNumberOfIterations,
seed = 1234
)
options("rpact.summary.output.size" = "medium")
# kable(summary(simSelectionEpsilonMAMS))
kable(simSelectionEpsilonMAMS)
```

**Simulation of multi-arm means (inverse normal combination test design)**

**Design parameters**

*Information rates*: 0.333, 0.667, 1.000*Critical values*: Inf, Inf, 1.960*Futility bounds (binding)*: -Inf, -Inf*Cumulative alpha spending*: 0.0000, 0.0000, 0.0250*Local one-sided significance levels*: 0.0000, 0.0000, 0.0250*Significance level*: 0.0250*Test*: one-sided

**User defined parameters**

*Seed*: 1234*Standard deviation*: 15*Planned cumulative subjects*: 20, 40, 60*mu_max*: 0, 2, 4, 6, 8, 10, 12*Type of selection*: epsilon*Success criterion*: atLeastOne*Epsilon value*: 2*Threshold*: 0

**Default parameters**

*Maximum number of iterations*: 1000*Planned allocation ratio*: 1*Calculate subjects function*: default*Active arms*: 3*Effect matrix (1)*: 0.0000, 0.6667, 1.3333, 2.0000, 2.6667, 3.3333, 4.0000*Effect matrix (2)*: 0.0000, 1.3333, 2.6667, 4.0000, 5.3333, 6.6667, 8.0000*Effect matrix (3)*: 0.0000, 2.0000, 4.0000, 6.0000, 8.0000, 10.0000, 12.0000*Type of shape*: linear*Slope*: 1*Intersection test*: Dunnett*Adaptations*: TRUE, TRUE*Effect measure*: effectEstimate*r value*: NA

**Results**

*Iterations [1]*: 1000, 1000, 1000, 1000, 1000, 1000, 1000*Iterations [2]*: 759, 835, 925, 947, 979, 994, 996*Iterations [3]*: 645, 765, 892, 934, 976, 992, 996*Reject at least one*: 0.0230, 0.0790, 0.2160, 0.4750, 0.7200, 0.8990, 0.9680*Rejected arms per stage (1) [1]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (1) [2]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (1) [3]*: 0.0070, 0.0120, 0.0230, 0.0280, 0.0260, 0.0160, 0.0070*Rejected arms per stage (2) [1]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (2) [2]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (2) [3]*: 0.0060, 0.0170, 0.0680, 0.1020, 0.1620, 0.1730, 0.1920*Rejected arms per stage (3) [1]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (3) [2]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Rejected arms per stage (3) [3]*: 0.0110, 0.0510, 0.1450, 0.3980, 0.6210, 0.7920, 0.8560*Overall futility stop*: 0.3550, 0.2350, 0.1080, 0.0660, 0.0240, 0.0080, 0.0040*Futility stop per stage [1]*: 0.2410, 0.1650, 0.0750, 0.0530, 0.0210, 0.0060, 0.0040*Futility stop per stage [2]*: 0.1140, 0.0700, 0.0330, 0.0130, 0.0030, 0.0020, 0.0000*Early stop [1]*: 0.2410, 0.1650, 0.0750, 0.0530, 0.0210, 0.0060, 0.0040*Early stop [2]*: 0.1140, 0.0700, 0.0330, 0.0130, 0.0030, 0.0020, 0.0000*Success per stage [1]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Success per stage [2]*: 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000*Success per stage [3]*: 0.0230, 0.0790, 0.2160, 0.4750, 0.7200, 0.8990, 0.9680*Selected arms (1) [1]*: 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000*Selected arms (1) [2]*: 0.3500, 0.3280, 0.2980, 0.2300, 0.1590, 0.1000, 0.0630*Selected arms (1) [3]*: 0.2450, 0.2410, 0.2060, 0.1380, 0.0730, 0.0410, 0.0160*Selected arms (2) [1]*: 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000*Selected arms (2) [2]*: 0.3480, 0.3930, 0.4280, 0.4060, 0.4020, 0.3480, 0.3370*Selected arms (2) [3]*: 0.2600, 0.2920, 0.3440, 0.3160, 0.3080, 0.2260, 0.2160*Selected arms (3) [1]*: 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000*Selected arms (3) [2]*: 0.3630, 0.4770, 0.6060, 0.7030, 0.7840, 0.8530, 0.8830*Selected arms (3) [3]*: 0.2720, 0.4050, 0.5570, 0.6810, 0.7670, 0.8430, 0.8690*Selected arms (4) [1]*: 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000*Selected arms (4) [2]*: 0.7590, 0.8350, 0.9250, 0.9470, 0.9790, 0.9940, 0.9960*Selected arms (4) [3]*: 0.6450, 0.7650, 0.8920, 0.9340, 0.9760, 0.9920, 0.9960*Number of active arms [1]*: 3.000, 3.000, 3.000, 3.000, 3.000, 3.000, 3.000*Number of active arms [2]*: 1.398, 1.435, 1.440, 1.414, 1.374, 1.309, 1.288*Number of active arms [3]*: 1.205, 1.226, 1.241, 1.215, 1.176, 1.119, 1.105*Expected number of subjects*: 144.8, 154.7, 165.1, 167.1, 169, 167.9, 167.5*Sample sizes (1) [1]*: 20, 20, 20, 20, 20, 20, 20*Sample sizes (1) [2]*: 9.2, 7.9, 6.4, 4.9, 3.2, 2, 1.3*Sample sizes (1) [3]*: 7.6, 6.3, 4.6, 3, 1.5, 0.8, 0.3*Sample sizes (2) [1]*: 20, 20, 20, 20, 20, 20, 20*Sample sizes (2) [2]*: 9.2, 9.4, 9.3, 8.6, 8.2, 7, 6.8*Sample sizes (2) [3]*: 8.1, 7.6, 7.7, 6.8, 6.3, 4.6, 4.3*Sample sizes (3) [1]*: 20, 20, 20, 20, 20, 20, 20*Sample sizes (3) [2]*: 9.6, 11.4, 13.1, 14.8, 16, 17.2, 17.7*Sample sizes (3) [3]*: 8.4, 10.6, 12.5, 14.6, 15.7, 17, 17.4*Sample sizes (4) [1]*: 20, 20, 20, 20, 20, 20, 20*Sample sizes (4) [2]*: 20, 20, 20, 20, 20, 20, 20*Sample sizes (4) [3]*: 20, 20, 20, 20, 20, 20, 20*Conditional power (achieved) [1]*: NA, NA, NA, NA, NA, NA, NA*Conditional power (achieved) [2]*: 0, 0, 0, 0, 0, 0, 0*Conditional power (achieved) [3]*: 0.1018, 0.1637, 0.2969, 0.4578, 0.6416, 0.8353, 0.9184

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

Note that we explicitly use the `options("rpact.summary.output.size" = "medium")`

command because otherwise the output turns out to be too long. You might also illustrate the results through the generic `plot`

command. For example, generating Overall Power/Early Stopping and Selected Arms per Stage plots are achieved by

`plot(simSelectionEpsilonMAMS, type = c(5, 3), grid = 0)`

This vignette can only give a brief introduction into possible configurations that can be considered within the simulation tool for multi-arm designs. Other than described here, for real-trial applications typically there is much more to take into account to adequately address the situation. For example, it might be of interest to additionally assess a sample size reassessment strategy. This can be performed as for simulating a single hypothesis situation (for example, see the vignette “Simulation of a trial with a binary endpoint and unblinded sample size re-calculation”).

For testing rates, the function `getSimulationMultiArmRates()`

and for survival designs, the function `getSimulationMultiArmSurvival()`

is available with very similar options as compared to the considered case. For survival designs, we note that - other than for the single hypothesis case - the function does not generate survival times on the subjects level, but normally distributed log-rank test statistics. As a consequence, for this case no estimates of analysis times, study duration, and expected number of subjects can be obtained.

System: rpact 3.5.1, R version 4.3.2 (2023-10-31 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 3.5.1, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

rpact provides the functions `getSimulationMeans()`

(continuous endpoints), `getSimulationRates()`

(binary endpoints) and `getSimulationSurvival()`

(time-to-event endpoints) for simulation of group sequential trials with adaptive SSR.

For trials with adaptive SSR, the design can be created with the functions `getDesignInverseNormal()`

or `getDesignFisher()`

. The sample size is re-calculated based on the target conditional power (argument `conditionalPower`

). Conditional power is by default evaluated at the observed estimates for the parameters. If the evaluation of conditional power at other parameter values is desired, they can be provided as arguments `thetaH1`

(for `getSimulationMeans()`

and `getSimulationSurvival()`

), or `pi1H1`

and `pi2H1`

(for `getSimulationRates()`

). For continuous endpoints, by default, conditional power is evaluated at the standard deviation `stDev`

under which the trial is simulated. In rpact 3.0, a variable `stDevH1`

can be entered to specify the standard deviation that is used for the sample size recalculation.

For the functions `getSimulationMeans()`

and `getSimulationRates()`

(but not in `getSimulationSurvival()`

), the SSR function can be optionally modified using the argument `calcSubjectsFunction()`

(see the respective help pages and the code below for details and examples).

In this vignette, we present an example of the use of these functions for a trial with a binary endpoint. For this, we will use the constraint promizing zone approach as described in Hsiao et al 2019.

For this vignette, additionally to rpact itself and ggplot2, we use the packages ggpubr and dplyr.

- 1:1 randomized superiority trial with overall response rate (ORR) as the primary endpoint (binary)
- ORR in the control arm is known to be ~20%
- The novel treatment may increase ORR by 10%-13%

- 2.5% one-sided significance level

We first calculate the sample sizes (per treatment group) for the corresponding fixed designs with 90% power:

```
# fixed design powered for delta of 13%
ssMin <- getSampleSizeRates(pi1 = 0.33, pi2 = 0.2, alpha = 0.025, beta = 0.1)
(Nmin <- ceiling(ssMin$numberOfSubjects1))
```

```
[,1]
[1,] 241
```

```
# fixed design powered for delta of 10%
ssMax <- getSampleSizeRates(pi1 = 0.30, pi2 = 0.2, alpha = 0.025, beta = 0.1)
(Nmax <- ceiling(ssMax$numberOfSubjects1))
```

```
[,1]
[1,] 392
```

Assume that the sponsor is unwilling to make an up-front commitment for a trial with Nmax = 392 subjects per treatment group but that they are willing to provide an up-front commitment for a trial with Nmin = 241 subjects per treatment group. If the results at an interim analysis with an unblinded SSR look “promising”, the sponsor would then be willing to commit funding for up to 392 subjects per treatment group in total.

To help the sponsor, we investigate two designs with an interim analysis for SSR after 120 subjects per treatment group:

- SSR based on conditional power: adjust sample size to achieve a conditional power of 90% assuming that the true response rates are 20% and 30% (if this is feasible within the given sample size range)
- A constrained promising zone design (Hsiao et al. 2019) with and .

To combine the two stages for both designs, we use an inverse normal combination test with optimal weights for the minimal final group size (practically, this is a design with equal weights) and no provision for early stopping for neither efficacy nor futility at interim:

```
# group size at interim
(n1 <- floor(Nmin / 2))
```

```
[,1]
[1,] 120
```

```
# inverse normal design with possible rejection of H0 only at the final analysis
designIN <- getDesignInverseNormal(
typeOfDesign = "noEarlyEfficacy",
kMax = 2,
alpha = 0.025
)
kable(summary(designIN))
```

**Sequential analysis with a maximum of 2 looks (inverse normal combination test design)**

No early efficacy stop design, one-sided overall significance level 2.5%, power 80%, undefined endpoint, inflation factor 1, ASN H1 1, ASN H01 1, ASN H0 1.

Stage | 1 | 2 |
---|---|---|

Information rate | 50% | 100% |

Efficacy boundary (z-value scale) | Inf | 1.960 |

Stage levels (one-sided) | 0 | 0.0250 |

Cumulative alpha spent | 0 | 0.0250 |

Overall power | 0 | 0.8000 |

It is straightforward to simulate the test characteristics from this design using the function `getSimulationRates()`

. `plannedSubjects`

refers to the cumulated sample sizes over the two stages **in both treatment groups**. If `conditionalPower`

is specified, `minNumberOfSubjectsPerStage`

and `maxNumberOfSubjectsPerStage`

must be specified. They refer to the minimum and maximum overall sample sizes **per stage** (the first element is the first stage sample size), respectively. If `pi1H1`

and/or `pi2H1`

are not specified, the observed (simulated) rates at interim are used for the SSR.

```
# Design with sample size re-estimated to get conditional power of 0.9 at
# pi1H1 = 0.3, pi2H1 = 0.2 [minimum effect size]
# (evaluate at most interesting values for pi1)
simCpower <- getSimulationRates(designIN,
pi1 = c(0.2, 0.3, 0.33), pi2 = 0.2,
# cumulative overall sample size
plannedSubjects = 2 * c(n1, Nmin),
conditionalPower = 0.9,
# stage-wise minimal overall sample size
minNumberOfSubjectsPerStage = 2 * c(n1, (Nmin - n1)),
# stage-wise maximal overall sample size
maxNumberOfSubjectsPerStage = 2 * c(n1, (Nmax - n1)),
pi1H1 = 0.3, pi2H1 = 0.2,
maxNumberOfIterations = 1000,
seed = 12345
)
kable(simCpower, showStatistics = FALSE)
```

**Simulation of rates (inverse normal combination test design)**

**Design parameters**

*Information rates*: 0.500, 1.000*Critical values*: Inf, 1.960*Futility bounds (binding)*: -Inf*Cumulative alpha spending*: 0.0000, 0.0250*Local one-sided significance levels*: 0.0000, 0.0250*Significance level*: 0.0250*Test*: one-sided

**User defined parameters**

*Seed*: 12345*Conditional power*: 0.9*Planned cumulative subjects*: 240, 482*Minimum number of subjects per stage*: 240, 242*Maximum number of subjects per stage*: 240, 544*Assumed treatment rate*: 0.200, 0.300, 0.330*Assumed control rate*: 0.200*pi(1) under H1*: 0.300

**Default parameters**

*Maximum number of iterations*: 1000*Planned allocation ratio*: 1*Direction upper*: TRUE*Risk ratio*: FALSE*Theta H0*: 0*Normal approximation*: TRUE*Treatment groups*: 2*pi(2) under H1*: 0.200

**Results**

*Effect*: 0.00, 0.10, 0.13*Iterations [1]*: 1000, 1000, 1000*Iterations [2]*: 1000, 1000, 1000*Overall reject*: 0.0220, 0.8590, 0.9730*Reject per stage [1]*: 0.0000, 0.0000, 0.0000*Reject per stage [2]*: 0.0220, 0.8590, 0.9730*Futility stop per stage*: 0.0000, 0.0000, 0.0000*Early stop*: 0.0000, 0.0000, 0.0000*Expected number of subjects*: 771.7, 631.5, 577.2*Sample sizes [1]*: 240, 240, 240*Sample sizes [2]*: 531.7, 391.5, 337.2*Conditional power (achieved) [1]*: NA, NA, NA*Conditional power (achieved) [2]*: 0.4821, 0.8579, 0.9092

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

As described in Hsiao et al. 2019, this method chooses the sample size according to the following rules:

Choose the second stage size between and such that the conditional power is for the minimal effect size we want to detect (i.e., ORR of 20% vs. 30%). We chose here as in the original publication.

If such a sample size does not exist, then proceed as follows:

- If the conditional power cannot be boosted to at least by increasing the sample size to , i.e., if the interim result is not considered «promising», then do not increase the sample and set . We chose here as in the original publication.
- Otherwise: set

To simulate the CPZ design in rpact, we can use the function `getSimulationRates()`

again. However, the situation is more complicated because we need to re-define the sample size recalculation rule using the argument `calcSubjectsFunction`

(see the help page `?getSimulationRates`

for more information regarding `calcSubjectsFunction()`

):

```
# CPZ design (evaluate at the most interesting values for pi1)
# home-made SSR function
myCPZSampleSizeCalculationFunction <- function(..., stage,
plannedSubjects,
conditionalPower,
minNumberOfSubjectsPerStage,
maxNumberOfSubjectsPerStage,
conditionalCriticalValue,
overallRate) {
rateUnderH0 <- (overallRate[1] + overallRate[2]) / 2
# function adapted from example in ?getSimulationRates
calculateStageSubjects <- function(cp) {
2 * (max(0, conditionalCriticalValue *
sqrt(2 * rateUnderH0 * (1 - rateUnderH0)) +
stats::qnorm(cp) * sqrt(overallRate[1] *
(1 - overallRate[1]) + overallRate[2] * (1 - overallRate[2]))))^2 /
(max(1e-12, (overallRate[1] - overallRate[2])))^2
}
# Calculate sample size required to reach maximum desired conditional power
# cp_max (provided as argument conditionalPower)
stageSubjectsCPmax <- calculateStageSubjects(cp = conditionalPower)
# Calculate sample size required to reach minimum desired conditional power
# cp_min (**manually set for this example to 0.8**)
stageSubjectsCPmin <- calculateStageSubjects(cp = 0.8)
# Define stageSubjects
stageSubjects <- ceiling(min(max(
minNumberOfSubjectsPerStage[stage],
stageSubjectsCPmax
), maxNumberOfSubjectsPerStage[stage]))
# Set stageSubjects to minimal sample size in case minimum conditional power
# cannot be reached with available sample size
if (stageSubjectsCPmin > maxNumberOfSubjectsPerStage[stage]) {
stageSubjects <- minNumberOfSubjectsPerStage[stage]
}
# return sample size
return(stageSubjects)
}
# Now simulate the CPZ design
simCPZ <- getSimulationRates(designIN,
pi1 = c(0.2, 0.3, 0.33), pi2 = 0.2,
plannedSubjects = 2 * c(n1, Nmin), # cumulative overall sample size
conditionalPower = 0.9,
# stage-wise minimal overall sample size
minNumberOfSubjectsPerStage = 2 * c(n1, (Nmin - n1)),
# stage-wise maximal overall sample size
maxNumberOfSubjectsPerStage = 2 * c(n1, (Nmax - n1)),
pi1H1 = 0.3, pi2H1 = 0.2,
calcSubjectsFunction = myCPZSampleSizeCalculationFunction,
maxNumberOfIterations = 1000,
seed = 12345
)
kable(simCPZ, showStatistics = FALSE)
```

**Simulation of rates (inverse normal combination test design)**

**Design parameters**

*Information rates*: 0.500, 1.000*Critical values*: Inf, 1.960*Futility bounds (binding)*: -Inf*Cumulative alpha spending*: 0.0000, 0.0250*Local one-sided significance levels*: 0.0000, 0.0250*Significance level*: 0.0250*Test*: one-sided

**User defined parameters**

*Seed*: 12345*Conditional power*: 0.9*Planned cumulative subjects*: 240, 482*Minimum number of subjects per stage*: 240, 242*Maximum number of subjects per stage*: 240, 544*Calculate subjects function*: user defined*Assumed treatment rate*: 0.200, 0.300, 0.330*Assumed control rate*: 0.200*pi(1) under H1*: 0.300

**Default parameters**

*Maximum number of iterations*: 1000*Planned allocation ratio*: 1*Direction upper*: TRUE*Risk ratio*: FALSE*Theta H0*: 0*Normal approximation*: TRUE*Treatment groups*: 2*pi(2) under H1*: 0.200

**Results**

*Effect*: 0.00, 0.10, 0.13*Iterations [1]*: 1000, 1000, 1000*Iterations [2]*: 1000, 1000, 1000*Overall reject*: 0.0190, 0.8070, 0.9460*Reject per stage [1]*: 0.0000, 0.0000, 0.0000*Reject per stage [2]*: 0.0190, 0.8070, 0.9460*Futility stop per stage*: 0.0000, 0.0000, 0.0000*Early stop*: 0.0000, 0.0000, 0.0000*Expected number of subjects*: 525.7, 574.5, 550.1*Sample sizes [1]*: 240, 240, 240*Sample sizes [2]*: 285.7, 334.5, 310.1*Conditional power (achieved) [1]*: NA, NA, NA*Conditional power (achieved) [2]*: 0.2827, 0.7883, 0.8898

**Legend**

*(i)*: values of treatment arm i*[k]*: values at stage k

We first use the aggregated data from the two simulations to compare the dependence of the re-calculated sample size and the corresponding conditional power on the interim Z-score between the two designs. For this, we use the function `getData()`

, the `summarize()`

command of the dplyr package and plot it with ggplot2. Note that for this illustration we summaries over all values of `pi1`

. This makes sense because we used a fixed `pi1H1`

and `pi2H1`

for both sample size simulation methods.

```
# aggregate data across simulation runs for both simulations and extract Z-score,
# conditionalPower, and totalSampleSize1 (per group)
aggSimCpower <- getData(simCpower)
sumCpower <- aggSimCpower %>%
group_by(iterationNumber) %>%
summarise(
design = "SS re-calculation for cp = 90%",
Z1 = testStatistic[1], conditionalPower = conditionalPowerAchieved[2],
totalSampleSize1 = (numberOfSubjects[1] + numberOfSubjects[2]) / 2
) %>%
arrange(Z1) %>%
filter(Z1 > 0, Z1 < 5)
aggSimCPZ <- getData(simCPZ)
sumCPZ <- aggSimCPZ %>%
group_by(iterationNumber) %>%
summarise(
design = "Constrained promising zone (CPZ)",
Z1 = testStatistic[1], conditionalPower = conditionalPowerAchieved[2],
totalSampleSize1 = (numberOfSubjects[1] + numberOfSubjects[2]) / 2
) %>%
arrange(Z1) %>%
filter(Z1 > 0, Z1 < 5)
sumBoth <- rbind(sumCpower, sumCPZ)
# Plot it
plot1 <- ggplot(aes(Z1, conditionalPower,
col = design,
group = design
), data = sumBoth) +
geom_line(aes(linetype = design), lwd = 1.2) +
scale_x_continuous(name = "Z-score at interim analysis") +
scale_y_continuous(
breaks = seq(0, 1, by = 0.1),
name = "Conditional power at re-calculated sample size"
) +
scale_color_manual(values = c("#d7191c", "#fdae61"))
plot2 <- ggplot(aes(Z1, totalSampleSize1,
col = design,
group = design
), data = sumBoth) +
geom_line(aes(linetype = design), lwd = 1.2) +
scale_x_continuous(name = "Z-score at interim analysis") +
scale_y_continuous(name = "Re-calculated final sample size (per group)") +
scale_color_manual(values = c("#d7191c", "#fdae61"))
ggarrange(plot1, plot2, common.legend = TRUE, legend = "bottom")
```

To compare the two designs across a wider range of parameters, we re-simulate the designs for a finer grid of assumed ORR in the intervention group and then visualize the results.

```
# Simulate designs again over a range of parameters and plot them
pi1Seq <- seq(0.2, 0.4, by = 0.01)
simCpowerLong <- getSimulationRates(designIN,
pi1 = pi1Seq, pi2 = 0.2,
plannedSubjects = c(2 * n1, 2 * Nmin),
conditionalPower = 0.9,
minNumberOfSubjectsPerStage = c(2 * n1, 2 * (Nmin - n1)),
maxNumberOfSubjectsPerStage = c(2 * n1, 2 * (Nmax - n1)),
pi1H1 = 0.3, pi2H1 = 0.2,
maxNumberOfIterations = 10000,
seed = 12345
)
plot(simCpowerLong, type = 6)
```

```
simCPZLong <- getSimulationRates(designIN,
pi1 = pi1Seq, pi2 = 0.2,
# cumulative overall sample size
plannedSubjects = 2 * c(n1, Nmin),
conditionalPower = 0.9,
# stage-wise minimal overall sample size
minNumberOfSubjectsPerStage = 2 * c(n1, (Nmin - n1)),
# stage-wise maximal overall sample size
maxNumberOfSubjectsPerStage = 2 * c(n1, (Nmax - n1)),
pi1H1 = 0.3, pi2H1 = 0.2,
calcSubjectsFunction = myCPZSampleSizeCalculationFunction,
maxNumberOfIterations = 10000,
seed = 12345
)
plot(simCPZLong, type = 6)
```

```
# Pool datasets from simulations (and fixed designs)
simCpowerData <- with(simCpowerLong,
data.frame(
design = "SS re-calculation for cp = 90%",
pi1 = pi1, pi2 = pi2, effect = effect, power = overallReject,
expectedNumberOfSubjects1 = expectedNumberOfSubjects / 2
),
stringsAsFactors = F
)
simCPZData <- with(simCPZLong,
data.frame(
design = "Constrained promising zone (CPZ)",
pi1 = pi1, pi2 = pi2, effect = effect, power = overallReject,
expectedNumberOfSubjects1 = expectedNumberOfSubjects / 2
),
stringsAsFactors = F
)
simFixed241 <- with(
simCPZData,
data.frame(
design = "Fixed (n = 241)",
pi1 = pi1, pi2 = pi2, effect = effect,
power = getPowerRates(
pi1 = pi1, pi2 = 0.2, maxNumberOfSubjects = 2 * 241,
alpha = 0.025, sided = 1
)$overallReject,
expectedNumberOfSubjects1 = 241, stringsAsFactors = F
)
)
simFixed392 <- with(
simCPZData,
data.frame(
design = "Fixed (n = 392)",
pi1 = pi1, pi2 = pi2, effect = effect,
power = getPowerRates(
pi1 = pi1, pi2 = 0.2, maxNumberOfSubjects = 2 * 392,
alpha = 0.025, sided = 1
)$overallReject,
expectedNumberOfSubjects1 = 392, stringsAsFactors = F
)
)
simdata <- rbind(simCpowerData, simCPZData, simFixed241, simFixed392)
simdata$design <- factor(simdata$design,
levels = c(
"Fixed (n = 241)", "Fixed (n = 392)",
"SS re-calculation for cp = 90%", "Constrained promising zone (CPZ)"
)
)
# Plot difference in ORR vs power
ggplot(aes(effect, power, col = design), data = simdata) +
geom_line(lwd = 1.2) +
scale_x_continuous(name =
```