`myDesign <- getDesignInverseNormal(kMax = 2, typeOfDesign = "noEarlyEfficacy") `

We demonstrate how rpact enables users to easily define new functions for calculating the number of subjects or events required, based on given conditional power and critical values for specific testing scenarios. This includes the implementation of advanced strategies like the ‘promising zone approach.’

- Efficacy endpoint PFS
- Assumed hazard ratio = 0.67, and requires 263 events

280 PFS events yields power 91.8 %.

If 350 patients are enrolled over 28 months with a median PFS time of 8.5 in the control group, the final analysis is expected to be after an additional follow-up of about 12 months

500 PFS events are needed to have 90% power at HR = 0.75 with more patients and a different expected follow-up

**“Milestone-based” investment:**Two-stage approach with interim after 140 events

Enough power for detecting HR = 0.67

If conditional power CP for detecting HR = 0.75 falls in a “promizing zone”, an additional investment would be made that allows the trial to remain open until 420 PFS events were obtained

Conditional power based on

**assumed**minimum clinical relevant effect HR = 0.75

Number of events for the second stage between 140 and 280

If conditional power for 280 additional events at HR = 0.75 is smaller than , set number of additional events = 140 (

*non-promising case*)If conditional power for 140 additional events at HR = 0.75 exceeds , set number of additional events = 140, otherwise calculate event number according to (

*promising case*)This defined a promizing zone for HR within the sample size may be modified.

`rpact`

First, define the design

`myDesign <- getDesignInverseNormal(kMax = 2, typeOfDesign = "noEarlyEfficacy") `

Define the event number calculation function `myEventSizeCalculationFunction()`

```
# Define promizing zone event size function
myEventSizeCalculationFunction <- function(..., stage,
plannedEvents,
conditionalPower,
minNumberOfEventsPerStage,
maxNumberOfEventsPerStage,
conditionalCriticalValue,
estimatedTheta) {
calculateStageEvents <- function(cp) {
4 * max(0, conditionalCriticalValue + qnorm(cp))^2 /
log(max(1 + 1e-12, estimatedTheta))^2
}
# Calculate events required to reach maximum desired conditional power
# cp_max (provided as argument conditionalPower)
stageEventsCPmax <- ceiling(calculateStageEvents(cp = conditionalPower))
# Calculate events required to reach minimum desired conditional power
# cp_min (**manually set for this example to 0.8**)
stageEventsCPmin <- ceiling(calculateStageEvents(cp = 0.8))
# Define stageEvents
stageEvents <- min(max(minNumberOfEventsPerStage[stage], stageEventsCPmax),
maxNumberOfEventsPerStage[stage])
# Set stageEvents to minimal sample size in case minimum conditional power
# cannot be reached with available sample size
if (stageEventsCPmin > maxNumberOfEventsPerStage[stage]) {
stageEvents <- minNumberOfEventsPerStage[stage]
}
# return overall events for second stage
return(plannedEvents[1] + stageEvents)
}
```

by specifying `calcEventsFunction = myEventSizeCalculationFunction`

and a range of assumed true hazard ratios

`hazardRatioSeq <- seq(0.65, 0.85, by = 0.025)`

```
simSurvPromZone <- getSimulationSurvival(design = myDesign,
hazardRatio = hazardRatioSeq,
directionUpper = FALSE,
plannedEvents = c(140, 280),
median2 = 9,
minNumberOfEventsPerStage = c(NA, 140),
maxNumberOfEventsPerStage = c(NA, 280),
thetaH1 = 0.75,
conditionalPower = 0.9,
accrualTime = 36,
calcEventsFunction = myEventSizeCalculationFunction,
maxNumberOfIterations = maxNumberOfIterations,
longTimeSimulationAllowed = TRUE,
maxNumberOfSubjects = 500)
```

Specify `calcEventsFunction = NULL`

```
simSurvCondPower <- getSimulationSurvival(design = myDesign,
hazardRatio = hazardRatioSeq,
directionUpper = FALSE,
plannedEvents = c(140, 280),
median2 = 9,
minNumberOfEventsPerStage = c(NA, 140),
maxNumberOfEventsPerStage = c(NA, 280),
thetaH1 = 0.75,
conditionalPower = 0.9,
accrualTime = 36,
calcEventsFunction = NULL,
maxNumberOfIterations = maxNumberOfIterations,
longTimeSimulationAllowed = TRUE,
maxNumberOfSubjects = 500)
```

```
aggSimCondPower <- getData(simSurvCondPower)
sumCpower <- summarize(aggSimCondPower, .by = c(iterationNumber, hazardRatio),
design = "Event re-calculation for cp = 90%",
totalSampleSize1 = sum(eventsPerStage),
Z1 = testStatistic[1],
conditionalPower = conditionalPowerAchieved[2])
aggSimPromZone <- getData(simSurvPromZone)
sumCPZ <- summarize(aggSimPromZone, .by = c(iterationNumber, hazardRatio),
design = "Constrained promising zone (CPZ) with cpmin = 80%",
totalSampleSize1 = sum(eventsPerStage),
Z1 = testStatistic[1],
conditionalPower = conditionalPowerAchieved[2])
sumBoth <- rbind(sumCpower, sumCPZ) %>%
filter(Z1 > -1, Z1 < 4)
# Plot it
plot1 <- ggplot(data = sumBoth, aes(Z1, totalSampleSize1, col = design, group = design)) +
geom_line(aes(linetype = design), lwd = 1.2) +
theme_classic() +
geom_line(aes(Z1, 280 + 150*dnorm(Z1, log(0.75*sqrt(140)/2))), color = "black") +
grids(linetype = "dashed") +
scale_x_continuous(name = "Z-score at interim analysis") +
scale_y_continuous(name = "Re-calculated number of events", limits = c(280, 500)) +
scale_color_manual(values = c("red", "orange"))
plot2 <- ggplot(data = sumBoth, aes(Z1, conditionalPower, col = design, group = design)) +
geom_line(aes(linetype = design), lwd = 1.2) +
theme_classic() +
geom_line(aes(Z1, dnorm(Z1, log(0.75*sqrt(140)/2))), color = "black") +
grids(linetype = "dashed") +
scale_x_continuous(name = "Z-score at interim analysis") +
scale_y_continuous(
breaks = seq(0, 1, by = 0.1),
name = "Conditional power at re-calculated sample size"
) +
scale_color_manual(values = c("red", "orange"))
ggarrange(plot1, plot2, ncol= 2, common.legend = TRUE, legend = "top")
```

```
ggplot(data = sumBoth, aes(1 - pnorm(Z1), conditionalPower, col = design, group = design)) +
geom_line(aes(linetype = design), lwd = 1.2) +
theme_classic() +
grids(linetype = "dashed") +
scale_x_continuous(name = "p-value at interim analysis") +
scale_y_continuous(
breaks = seq(0, 1, by = 0.1),
name = "Conditional power at re-calculated sample size"
) +
scale_color_manual(values = c("#d7191c", "#fdae61"))
```

`plot(simSurvPromZone, type = 6) `

`plot(simSurvCondPower, type = 6)`

```
# Pool datasets from simulations (and fixed designs)
simCondPowerData <- with(as.list(simSurvCondPower),
data.frame(
design = "Events re-calculation with cp = 90%",
hazardRatio = hazardRatio, power = overallReject,
expectedNumberOfEvents = expectedNumberOfEvents
))
simPromZoneData <- with(as.list(simSurvPromZone),
data.frame(
design = "Constrained promising zone (CPZ)",
hazardRatio = hazardRatio, power = overallReject,
expectedNumberOfEvents = expectedNumberOfEvents
))
simFixed280 <- data.frame(
design = "Fixed events = 280",
hazardRatio = hazardRatioSeq,
power = getPowerSurvival(alpha = 0.025,
directionUpper = FALSE,
maxNumberOfEvents = 280,
median2 = 9,
accrualTime = 28,
maxNumberOfSubjects = 500,
hazardRatio = hazardRatioSeq
)$overallReject,
expectedNumberOfEvents = 280
)
simFixed420 <- data.frame(
design = "Fixed events = 420",
hazardRatio = hazardRatioSeq,
power = getPowerSurvival(alpha = 0.025,
directionUpper = FALSE,
maxNumberOfEvents = 420,
median2 = 9,
accrualTime = 28,
maxNumberOfSubjects = 500,
hazardRatio = hazardRatioSeq
)$overallReject,
expectedNumberOfEvents = 420
)
simdata <- rbind(simCondPowerData, simPromZoneData, simFixed280, simFixed420)
simdata$design <- factor(simdata$design,
levels = c(
"Fixed events = 280",
"Fixed events = 420",
"Events re-calculation with cp = 90%",
"Constrained promising zone (CPZ)"
))
```

```
# Plot difference in power
ggplot(aes(hazardRatio, power, col = design), data = simdata) +
theme_classic() +
grids(linetype = "dashed") +
geom_line(lwd = 1.2) +
scale_x_continuous(name = "Hazard Ratio") +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), name = "Power") +
geom_vline(xintercept = c(0.67, 0.75), color = "black", lwd = 0.9) +
scale_color_manual(values = c("#2c7bb6", "#abd9e9", "#fdae61", "#d7191c"))
```

```
# Plot difference in expected sample size
ggplot(aes(hazardRatio, expectedNumberOfEvents, col = design), data = simdata) +
theme_classic() +
grids(linetype = "dashed") +
geom_line(lwd = 1.2) +
scale_x_continuous(name = "Hazard Ratio") +
scale_y_continuous(name = "Expected Events") +
scale_color_manual(values = c("#2c7bb6", "#abd9e9", "#fdae61", "#d7191c"))
```

- Easy implementation in
`rpact`

- Simulation very fast
- Consideration of efficacy or futility stops straightforward
- Trade-off between overall expected sample size and power
- Usage of combination test (or equivalent) theoretically mandatory
- Adaptations based on test statistic only

Wassmer, G and Brannath, W. *Group Sequential and Confirmatory Adaptive Designs in Clinical Trials* (2016), ISBN 978-3319325606 https://doi.org/10.1007/978-3-319-32562-0

*System* rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), *platform* x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2024). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

The online shiny app for rpact is available at https://shiny.rpact.com. The default settings when the Shiny app is loaded is for a fixed sample design, which means that there is only one look at the data (kMax = 1). In other words, the default setting is not for a sequential design, but a traditional design where the data is analyzed once. Moving the slider for the “Maximum number of stages” would increase the number of looks in the design (you can select up to up to 10 looks).

The rpact package focuses on Confirmatory Adaptive Clinical Trial Design and Analysis. In clinical trials, researchers mostly test directional predictions, and thus, the default setting is to perform a one-sided test. Outside of clinical trials, it might be less common to design studies testing a directional prediction, but it is often a good idea. In clinical trials, it is common to use a 0.025 significance level (or Type I error rate) for one-sided tests, as it is deemed preferable in regulatory settings to set the Type I error rate for one-sided tests at half the conventional Type I error used in two-sided tests. In other fields, such as psychology, researchers typically use a 0.05 significance level, regardless of whether they perform a one-sided or two-sided test. A default 0.2 Type II error rate (or power of 0.8) is common in many fields, and is thus the default setting for the Type II error rate in the Shiny app.

Remember that you always need to justify your error rates – the defaults are most often not optimal choices in any real-life design (and it might be especially useful to choose a higher power, if possible).

We can explore a group sequential design by moving the slider for the maximum number of stages to, say, kMax = 2. The option to choose a design appears above the slider in the form of three “Design” radio buttons (Group Sequential, Inverse Normal, and Fisher), which by default is set to a group sequential design – this is the type of designs we will focus on in this step by step tutorial. The other options are relevant for adaptive designs which we will not discuss here.

A new drop down menu has appeared below the box to choose a Type II error rate that asks you to specify the “Type of design”. This allows you to choose how you want to control the level across looks. By default the choice is an O’Brien-Fleming design. Set the “Type of Design” option to “Pocock (P)”. Note there is also a Pocock type -spending (asP) option – we will use that later.

Because most people in social sciences will probably have more experience with two-sided tests at an of 0.05, choose a two-sided test and an level of 0.05 choose those settings. The input window should now look like the example below:

Click on the “Plot” tab. The first plot in the drop-down menu shows the boundaries at each look. The critical score at each look it presented, as is a reference line at and . These reference lines are the critical value for a two-sided test with a single look (i.e., a fixed design) with an of 5%. We see that the boundaries on the scale have increased. This means we need to observe a more extreme score at an analysis to reject . Furthermore, we see that the critical bounds are constant across both looks. This is exactly the goal of the Pocock correction: The level is lowered so that the level is the same at each look, and the overall level across all looks at the data is controlled at 5%. It is conceptually very similar to the Bonferroni correction. We can reproduce the design and the plot in R using the following code:

```
design <- getDesignGroupSequential(
kMax = 2,
typeOfDesign = "P",
alpha = 0.05,
sided = 2
)
plot(design, type = 1)
```

In the drop-down menu, we can easily change the type of design from “Pocock (P)” to “O’Brien-Fleming (OF)” to see the effect of using different corrections for the critical values across looks in the plot. We see that the O’Brien-Fleming correction has a different goal. The critical value at the first look is very high (which also means the level for this look is very low), but the critical value at the final look is extremely close to the unadjusted critical value of 1.96 (or the level of 0.05).

```
design <- getDesignGroupSequential(
kMax = 2,
typeOfDesign = "OF",
alpha = 0.05,
sided = 2
)
plot(design, type = 1)
```

We can plot the corrections for different types of designs for each of 3 looks (2 interim looks and one final look) in the same plot in R. The plot below shows the Pocock, O’Brien-Fleming, Haybittle-Peto, and Wang-Tsiatis correction with = 0.25. We see that researchers can choose different approaches to spend their -level across looks. Researchers can choose to spend their conservatively (keeping most of the for the last look), or more liberally (spending more at the earlier looks, which increases the probability of stopping early for many true effect sizes).

```
# Comparison corrections
d1 <- getDesignGroupSequential(typeOfDesign = "OF", sided = 2, alpha = 0.05)
d2 <- getDesignGroupSequential(typeOfDesign = "P", sided = 2, alpha = 0.05)
d3 <- getDesignGroupSequential(
typeOfDesign = "WT", deltaWT = 0.25,
sided = 2, alpha = 0.05
)
d4 <- getDesignGroupSequential(typeOfDesign = "HP", sided = 2, alpha = 0.05)
designSet <- getDesignSet(designs = c(d1, d2, d3, d4), variedParameters = "typeOfDesign")
plot(designSet, type = 1, legendPosition = 5)
```

Because the statistical power of a test depends on the level (and the effect size and the sample size), this means that at the final look the statistical power of an O’Brien-Fleming or Haybittle-Peto design is very similar to the statistical power for a fixed design with only one look. If the is lowered, the sample size of a study needs to be increased to maintain the same statistical power at the last look. Therefore, the Pocock correction requires a remarkably larger increase in the maximum sample size than the O’Brien-Fleming or Haybittle-Peto correction. We will discuss these issues in more detail when we consider sample size planning below.

If you head to the “Report” tab, you can download an easily readable summary of the main results. Here, you can also see the level you would use for each look at the data (e.g., p < 0.0052, and p < 0.0480 for a O’Brien-Fleming type design with 2 looks).

Corrected levels can be computed to many digits, but this quickly reaches a level of precision that is meaningless in real life. The observed type I error rate for all tests you will do in your lifetime is not noticeably different if you set the level at 0.0194, 0.019, or 0.02 (see the concept of ‘significant digits’. Even as we calculate and use thresholds up to many digits in sequential tests, the messiness of most research makes these levels have false precision. Keep this in mind when interpreting your data.

Note that the rpact Shiny app usefully shows the R code required to reproduce the output.

```
design <- getDesignGroupSequential(
typeOfDesign = "OF",
informationRates = c(0.5, 1),
alpha = 0.05,
beta = 0.2,
sided = 2
)
kable(summary(design))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|

OF | 2 | 1 | 0.5 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0051658 | 2.796510 | 0.0025829 |

OF | 2 | 2 | 1.0 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0500000 | 1.977431 | 0.0239965 |

An important contribution to the sequential testing literature was made by Lan and DeMets (1983) who proposed the -spending function approach. In the figure below, the O’Brien-Fleming-like -spending function is plotted against the discrete O’Brien-Fleming bounds. We can see that the two approaches are not identical, but very comparable. The main benefit of these spending functions is that the error rate of the study can be controlled, while neither the number nor the timing of the looks needs to be specified in advance. This makes -spending approaches much more flexible. When using an -spending function it is important that the decision to perform an interim analysis is not based on collected data, as this can still can increase the Type I error rate.

```
d1 <- getDesignGroupSequential(typeOfDesign = "P", kMax = 5)
d2 <- getDesignGroupSequential(typeOfDesign = "asP", kMax = 5)
d3 <- getDesignGroupSequential(typeOfDesign = "OF", kMax = 5)
d4 <- getDesignGroupSequential(typeOfDesign = "asOF", kMax = 5)
designSet <- getDesignSet(
designs = c(d1, d2, d3, d4),
variedParameters = "typeOfDesign"
)
plot(designSet, type = 1)
```

Although -spending functions control the Type I error rate even when there are deviations from the pre-planned number of looks, or their timing, this does require recalculating the boundaries used in the statistical test based on the amount of information that has been observed. Let us assume a researcher designs a study with three equally spaced looks at the data (two interim looks, one final look), using a Pocock-type spending function, where results will be analyzed in a two-sided t-test with an overall desired Type I error rate of 0.05, and a desired power of 0.9 for a Cohen’s d of 0.5. An a-priori power analysis (which we will explain later in this tutorial) shows that we achieve the desired power in our sequential design if we plan to look after 65.4, 130.9, and 196.3 observations in each condition. Since we cannot collect partial participants, we should round these numbers up, and because we have 2 independent groups, we will collect 66 observations for look 1 (33 in each condition), 132 at the second look (66 in each condition) and 198 at the third look (99 in each condition).

```
design <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "asP",
sided = 2,
alpha = 0.05,
beta = 0.1
)
kable(summary(design))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

asP | 3 | 1 | 0.3333333 | 0.05 | 0.1 | FALSE | 2 | 0 | 0.0226416 | none | 2.279428 | 0.0113208 |

asP | 3 | 2 | 0.6666667 | 0.05 | 0.1 | FALSE | 2 | 0 | 0.0381691 | none | 2.294911 | 0.0108691 |

asP | 3 | 3 | 1.0000000 | 0.05 | 0.1 | FALSE | 2 | 0 | 0.0500000 | none | 2.295938 | 0.0108397 |

```
sampleSizeResult <- getSampleSizeMeans(
design = design,
groups = 2,
alternative = 0.5,
stDev = 1
)
kable(summary(sampleSizeResult))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 196.2899 | 98.14493 | 98.14493 | 65.42995 | 0.3940421 | 0.7315581 | 192.311 | 174.9527 | 122.6407 | -0.5776870 | 0.5776870 | 0.0226416 |

2 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 196.2899 | 98.14493 | 98.14493 | 130.85991 | 0.3375160 | 0.7315581 | 192.311 | 174.9527 | 122.6407 | -0.4061647 | 0.4061647 | 0.0217382 |

3 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 196.2899 | 98.14493 | 98.14493 | 196.28986 | 0.1684419 | 0.7315581 | 192.311 | 174.9527 | 122.6407 | -0.3304143 | 0.3304143 | 0.0216794 |

Now imagine that due to logistical issues, we do not manage to analyze the data until we have collected data from 76 observations (38 in each condition) instead of the planned 66 observations. So our first look at the data does not occur at 33.3% of planned sample, but at 76/198 = 38.4% of the planned sample. We can recalculate the level we should use for each look at the data, based on the current look, and planned future looks. Instead of using the -levels 0.0226, 0.0217, and 0.0217 at the three respective looks (as indicated above in the summary of the originally planned design), we can adjust the information rates in the Shiny app (Double click on a cell to edit it; hit Ctrl+Enter to finish editing, or Esc to cancel):

The updated -levels are 0.0253 for the current look, 0.0204 for the second look, and 0.0216 for the final look. To compute updated bounds in R directly, we can use the code:

```
design <- getDesignGroupSequential(
typeOfDesign = "asP",
informationRates = c(76 / 198, 2 / 3, 1),
alpha = 0.05,
sided = 2
)
kable(summary(design))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

asP | 3 | 1 | 0.3838384 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0253271 | none | 2.236377 | 0.0126635 |

asP | 3 | 2 | 0.6666667 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0381691 | none | 2.318176 | 0.0102199 |

asP | 3 | 3 | 1.0000000 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0500000 | none | 2.296496 | 0.0108238 |

It is also possible to correct the -level if the final look at the data changes, for example because you are not able to collect the intended sample size, or because due to unforeseen circumstances you collect more data than planned. If this happens, we can no longer use the -spending function we chose, and instead have to provide a user-defined -spending function by updating the timing and -spending function to reflect the data collection as it actually occurred up to the final look.

Assuming the second look in our earlier example occurred as originally planned, but the last look occurred at 206 participants instead of 198 we can compute an updated -level for the last look. Given the current total sample size, we need to recompute the -levels for the earlier looks, which now occurred at 72/206 = 0.369, 132/206 = 0.641, and for the last look at 206/206 = 1.

Because the first and second look occurred with the adjusted -levels we computed after the first adjustment (-levels of 0.0253 and 0.0204) we can look at the “Cumulative alpha spent” row and see how much of our Type I error rate we spent so far (0.0253 and 0.382). We also know we want to spend the remainder of our Type I error rate at the last look, for a total of 0.05.

Our actual -spending function is no longer captures by the Pocock spending function after collecting more data than planned, but instead, we have a user defined spending function. We can enter both the updated information rates and the final -spending function directly in the Shiny app by selecting the “User defined alpha spending (asUser)” option as “Type of design”:

The output shows the computed -level for this final look is 0.0210 instead of 0.0216. The difference is very small in this specific case, but might be larger depending on the situation. This example shows the flexibility of group designs when -spending functions are used. We can also perform these calculations in R directly:

```
design <- getDesignGroupSequential(
typeOfDesign = "asUser",
informationRates =
c(72 / 206, 132 / 206, 1),
alpha = 0.05,
sided = 2, userAlphaSpending = c(0.0253, 0.0382, 0.05)
)
kable(summary(design))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asUser | 3 | 1 | 0.3495146 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0253 | 0.0253 | none | 2.236791 | 0.0126500 |

asUser | 3 | 2 | 0.6407767 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0382 | 0.0382 | none | 2.328780 | 0.0099354 |

asUser | 3 | 3 | 1.0000000 | 0.05 | 0.2 | FALSE | 2 | 0 | 0.0500 | 0.0500 | none | 2.312358 | 0.0103790 |

We will once again start with the default settings of the Shiny app which is for a fixed design with one look. Click on the “Endpoint” tab to choose how you want to specify the desired endpoint in this study. We will assume we plan to perform a test, and therefore, that our endpoint is based on the means we observe.

Then click the “Trial Settings” tab. Here, you can specify if you want to calculate the required sample size (to achieve a desired power) or compute the expected power (based on a chosen sample size). By default, the calculation will be for a two-group (independent) test.

The same number of individuals are collected in each group (allocation ratio = 1). It is possible to choose to use a normal approximation (which some software programs use) but the default settings where the calculations are based on the distribution, will be (ever so slightly) more accurate.

The effect under the null hypothesis is 0 by default, the default effect under the alternative is 0.2, and the default standard deviation is 1. This means that by default the power analysis is for a standardized effect size of Cohen’s d = 0.2/1 = 0.2. That is a small effect. In this example we will assume a researcher is interested in detecting a somewhat more substantial effect size, a mean difference of 0.5. This can be specified by changing the effect under the alternative to 0.5. Note that it is possible to compute the power for multiple values by selecting a value larger than 1 in the “# values” drop-down menu (but we will calculate power for a single alternative for now).

We can also directly perform these calculations in R:

```
design <- getDesignGroupSequential(
kMax = 1,
alpha = 0.05,
sided = 2
)
kable(summary(getSampleSizeMeans(design, alternative = 0.5)))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 127.5316 | 63.76578 | 63.76578 | -0.3504905 | 0.3504905 | 0.05 |

These calculations show that for a fixed design we should collect 128 participants (64 in each condition) to achieve 80% power for a Cohen’s d of 0.5 (or a mean difference of 0.5 with an expected population standard deviation of 1).

This result is similar to what can be computed in power analysis software for non-sequential designs, such as G*power.

We will now look at power in a sequential design. Change the slider for the number of looks (kMax) to 3. Furthermore, change the Type II error rate to 0.1 (a default of 0.2 is, regardless of what Cohen thought, really a bit large). By default rpact assumes we will look at the data at equal times – after 33%, 67%, and 100% of the data is collected. The default design is an O’Brien-Fleming design, with a one-sided test. Set the alternative hypothesis in the “Trial Settings” tab to 0.5. We can compute the sample size we would need for a sequential group design to achieve the desired error rates for a specified alternative using the `getSampleSizeMeans()`

function in R.

```
seq_design_of <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "OF",
sided = 1,
alpha = 0.05,
beta = 0.1
)
# Compute the sample size we need
power_res_of <- getSampleSizeMeans(
design = seq_design_of,
groups = 2,
alternative = 0.5,
stDev = 1,
allocationRatioPlanned = 1,
normalApproximation = FALSE
)
kable(summary(power_res_of))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 47.28022 | 0.1055285 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.9101351 |

2 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 94.56045 | 0.5240119 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.4369946 |

3 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 141.84067 | 0.2704596 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.2891226 |

The same output is available in the Shiny app under the “Sample Size” tab.

This output shows that at the first look, with a very strict -level of 0.0015, we will have almost no power. Even if there is a true effect of d = 0.5, in only 10.55% of the studies we run will we be able to stop after collecting 33% of the data has been collected (as we see in the row “Overall power” or “Cumulative Power”). One might wonder whether it would even be worth looking at the data at this time point (the answer might very well be ‘no’, and it is not necessary to design equally spaced looks). At the second look overall power is 62.95%, which gives us a reasonable chance to stop if there is an effect, at the the final look it should be 90%, as this is what we designed the study to achieve. We can also print the full results (instead of just a summary), or select “Details” in the Shiny app:

`kable(power_res_of)`

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | maxNumberOfSubjects | maxNumberOfSubjects1 | maxNumberOfSubjects2 | numberOfSubjects | rejectPerStage | earlyStop | expectedNumberOfSubjectsH0 | expectedNumberOfSubjectsH01 | expectedNumberOfSubjectsH1 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 47.28022 | 0.1055285 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.9101351 |

2 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 94.56045 | 0.5240119 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.4369946 |

3 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 141.8407 | 70.92034 | 70.92034 | 141.84067 | 0.2704596 | 0.6295404 | 140.8822 | 132.0052 | 107.0864 | 0.2891226 |

We see that the maximum number of subjects we would need to collect is 141.8, or rounded up, 142. The expected number of subjects under (when there is no true effect) is 140.9 - we will almost always collect data up to the third look, unless we make a Type I error and stop at the first two looks.

The expected number of subjects under (i.e., d = 0.5) is 107.1. If there is a true effect of d = 0.5, we will stop early in some studies, and therefore the average expected sample size is lower than the maximum.

We can plot the results across a range of possible effect sizes:

```
sample_res_plot <- getPowerMeans(
design = seq_design_of,
groups = 2,
alternative = seq(0, 1, 0.01),
stDev = 1,
allocationRatioPlanned = 1,
maxNumberOfSubjects = 142, # rounded up
normalApproximation = FALSE
)
# code for plot (not run, we show an annotated version of this plot)
# plot(sample_res_plot, type = 6, legendPosition = 6)
```

To create this plot in the Shiny app, you need to specify the design, in the endpoint tab select “Means”, and in the trial settings select “Power” as the calculation target, two groups, and for the number of values, select 50 from the drop-down menu. Specify the lower (i.e., 0) and upper (i.e., 1) value of the mean difference (given the standard deviation of 1, these values will also be Cohen’s d effect sizes). The maximum number of subjects is set to 142 (based on the power analysis we performed above). Go to the “Plot” tab and select the “Sample Size [6]” plot.

If you click on the “Plot” tab and select the Sample Size graph [6], and set the max sample size (nMax) to 50, you see that depending on the true effect size, there is a decent probability of stopping early (blue line) compared to at the final look (green line). Furthermore, the larger the effect size, the lower the Average sample size will be (red line).

Without sequential analyses we would collect 50 participants (the maximum sample size specified). But when the true effect size is large, we have a high probability to stop early, and the sample size that one needs to collect will on average (in the long run of doing many sequential designs) be lower.

After this general introduction to the benefits of group sequential designs to efficiently design well powered studies, we will look at more concrete examples of how to perform an a-priori power analysis for sequential designs.

When designing a study where the goal is to test whether a specific effect can be statistically rejected researchers often want to make sure their sample size is large enough to have sufficient power for an effect size of interest. This is done by performing an a-priori power analysis. Given a specified effect size, -level, and desired power, an a-priori power analysis will indicate the number of observations that should be collected.

An informative study has a high probability of correctly concluding an effect is present when it is present, and absent when it is absent. An a-priori power analysis is used to choose a sample size to achieve desired Type I and Type II error rates, in the long run, given assumptions about the null and alternative model.

We will assume that we want to design a study that can detect a difference of 0.5, with an assumed standard deviation in the population of 1, which means the expected effect is a Cohen’s d of 0.5. If we plan to analyze our hypothesis in a one-sided test (given our directional prediction), set the overall -level to 0.05, and want to achieve a Type II error probability of 0.1 (or a power of 0.9). Finally, we believe it is feasible to perform 2 interim analyses, and one final analysis (e.g., collect the data across three weeks, and we are willing to stop the data collection after any Friday). How many observations would we need?

The decision depends on the final factor we need to decide in a sequential design: the -spending function. We can choose an -spending function as we design our experiment, and compare different choices of a spending function. We will start by examining the sample size we need to collect if we choose an O’Brien-Fleming -spending function.

On the “Endpoint” tab we specify means. Then we move to the “Trial Design” tab. It is easy in rpact to plot power across a range of effect sizes, by selecting multiple values from the drop-down menu (i.e., 5). We set 0.3 and 0.7 as the lower and upper value, and keep the standard deviation at 1, so that we get the sample sizes for the range of Cohen’s d 0.3 to 0.7.

Sometimes you might have a clearly defined effect size to test against – such as a theoretically predicted effect size, or a smallest practically relevant effect size. Other times, you might primarily know the sample size you can achieve to collect, and you want to perform a sensitivity analysis, where you examine which effect size you can detect with a desired power, given a certain sample size. Plotting power across a range of effect sizes is typically useful. Even if you know which effect size you expect, you might want to look at what would be the consequences of the true effect size being slightly different than expected.

Open the “Plot” tab and from the drop-down menu select “Sample size [6]”. You will see a plot like the one below, created with the R package. From the results (in the row “Maximum number of subjects”), we see that if the true effect size is indeed d = 0.5, we would need to collect at most 141 participants (the result differs very slightly from the power analysis reported above, as we use the O’Brien-Fleming alpha spending function, and not the O’Brien-Fleming correction). In the two rows below, we see that this is based on 71 (rounded up) participants in each condition, so in practice we would actually collect a total of 142 participants due to upward rounding within each condition.

```
design <- getDesignGroupSequential(
typeOfDesign = "asOF",
alpha = 0.05, beta = 0.1
)
sample_res <- (getSampleSizeMeans(design,
alternative = c(0.3, 0.4, 0.5, 0.6, 0.7)
))
plot(sample_res, type = 5, legendPosition = 4)
```

This maximum is only slightly higher than if we had used a fixed design. For a fixed design (which you can examine by moving the slider for the maximum number of stages back to 1), we would need to collect 69.2 participants, or 138.4 in total, while for a sequential design, the maximum sample size per condition is 70.5.

The difference between a fixed design and a sequential design can be calculated by looking at the “Inflation factor”. We can find the inflation factor for the sequential design in the “Characteristics” in the “Design” tab (select for the R output “Details + characteristics”, or “Summary + details + characteristics”) which is 1.0187. In other words, the maximum sample size increased to 69.2 x 1.0187 = 70.5 per condition. The inflation is essentially caused by the reduction in the -level at the final look, and differs between designs (e.g., for a Pocock type alpha spending function, the inflation factor for the current design is larger, namely 1.1595)

However, the maximum sample size is not the expected sample size for this design, because of the possibility that we can stop data collection at an earlier look in the sequential design. In the long run, if d = 0.5, and we use an O’Brien-Fleming -spending function, and ignoring upward rounding because we can only collect a complete number of observations, we will sometimes collect 47 participants and stop after the first look see the row “Number of subjects [1]”), sometimes 94 and stop after the second look (see the row “Number of subjects [2]”)), and sometimes 141 and stop after the last look (see the row “Number of subjects [1]”)).

As we see in the row “Exit probability for efficacy (under H1)” we can stop early 6.75% of the time after look 1, 54.02% after look two, and in the remaining cases we will stop 1 - (0.0675 + 0.5402) = 39.23% of the time at the last look.

This means that, assuming there is a true effect of d = 0.5, the *expected* sample size on average is the probability of stopping at each look, multiplied by the number of observations we collect at each look, so 0.0675 * 47.0 + 0.5402 * 94.0 + ((1 - (0.0675 + 0.5402)) * 141.0) = 109.3, which matches the row “Expected number of subjects under H1” (again, assuming the alternative hypothesis of d = 0.5 is correct). So, in any single study we might need to collect slightly more data than in a fixed design, but on average we will need to collect less observations in a sequential design, namely 109.3, instead of 138.4 in a fixed design (assuming the alternative hypothesis is true).

```
design <- getDesignGroupSequential(typeOfDesign = "asOF", alpha = 0.05, beta = 0.1)
# getDesignCharacteristics(design)$inflationFactor
sample_res <- (getSampleSizeMeans(design, alternative = c(0.5)))
kable(sample_res)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | maxNumberOfSubjects | maxNumberOfSubjects1 | maxNumberOfSubjects2 | numberOfSubjects | rejectPerStage | earlyStop | expectedNumberOfSubjectsH0 | expectedNumberOfSubjectsH01 | expectedNumberOfSubjectsH1 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 140.9904 | 70.49522 | 70.49522 | 46.99681 | 0.0674865 | 0.6077022 | 140.1886 | 132.2839 | 109.2587 | 0.9953709 |

2 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 140.9904 | 70.49522 | 70.49522 | 93.99362 | 0.5402157 | 0.6077022 | 140.1886 | 132.2839 | 109.2587 | 0.4484318 |

3 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 140.9904 | 70.49522 | 70.49522 | 140.99043 | 0.2922978 | 0.6077022 | 140.1886 | 132.2839 | 109.2587 | 0.2874698 |

For a Pocock -spending function the maximum sample size is larger (you can check by changing the spending function). The reason is that the -level at the final look is lower for a Pocock spending function than for the O’Brien-Fleming spending function, and the sample size required to achieve a desired power is thus higher. However, because the -level at the first look is higher, there is a higher probability of stopping early, and therefore the expected sample size is lower for a Pocock spending function (97.7 compared to 109.3). It is up to the researcher to choose a spending function, and weigh how desirable it would be to stop early, given some risk in any single study of increasing the sample size at the final look. For these specific design parameters, the Pocock -spending function might be more efficient on average, but also more risky in any single study.

So far, the sequential design would only stop at an interim analysis if we can reject . It is also possible to stop for futility, for example, based on a -spending function. We can directly compare the previous design with a design where we stop for futility. Just as we are willing to distribute our Type I error rate across interim analyses, we can distribute our Type II error rate across looks, and decide to stop for futility when we can reject the presence of an effect at least as large as 0.5, even if we are then making a Type II error.

If there actually is no effect, such designs are more efficient. One can choose in advance to stop data collection when the presence of the effect the study was designed to detect can be rejected (i.e., binding -spending), but it is typically recommended to allow the possibility to continue data collection (i.e., non-binding beta-spending). Adding futility bounds based on -spending functions reduce power, and increase the required sample size to reach a desired power, but this is on average compensated by the fact that studies stop earlier due to futility, which can make designs more efficient.

When an -spending function is chosen in the rpact Shiny app, a new drop-drown menu appears that allows users to choose a beta-spending function. In the R package, we simply add `typeBetaSpending = "bsOF"`

to the specification of the design. You do not need to choose the same spending approach for and as is done in this example.

```
design <- getDesignGroupSequential(
typeOfDesign = "asOF",
alpha = 0.05, beta = 0.1, typeBetaSpending = "bsOF"
)
sample_res <- (getSampleSizeMeans(design, alternative = 0.5))
kable(sample_res)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | maxNumberOfSubjects | maxNumberOfSubjects1 | maxNumberOfSubjects2 | numberOfSubjects | rejectPerStage | futilityStop | futilityPerStage | earlyStop | expectedNumberOfSubjectsH0 | expectedNumberOfSubjectsH01 | expectedNumberOfSubjectsH1 | criticalValuesEffectScale | futilityBoundsEffectScale | futilityBoundsPValueScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 148.1811 | 74.09056 | 74.09056 | 49.39371 | 0.0732745 | 0.0439543 | 0.0043861 | 0.6747398 | 99.5733 | 120.9604 | 111.0173 | 0.9677030 | -0.2506278 | 0.8085415 |

2 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 148.1811 | 74.09056 | 74.09056 | 98.78742 | 0.5575110 | 0.0439543 | 0.0395682 | 0.6747398 | 99.5733 | 120.9604 | 111.0173 | 0.4370829 | 0.1516813 | 0.2264013 |

3 | 0.5 | FALSE | 0 | FALSE | 1 | 2 | 1 | 148.1811 | 74.09056 | 74.09056 | 148.18113 | 0.2692145 | 0.0439543 | NA | 0.6747398 | 99.5733 | 120.9604 | 111.0173 | 0.2803114 | NA | NA |

We see that with a -spending function the Expected number of subjects under has increased from 109.3 to 111.0. The maximum number of subjects has increased from 141 to 148.2. So, if the alternative hypothesis is true, stopping for futility comes at a cost. However, it is possible that is true.

At the last look in our sequential design, which we designed to have 90% power, we are willing to act as if is true with a 10% error rate. We can reverse the null and alternative hypothesis, and view the same decision process as an equivalence test. In this view, we test whether we can reject the presence of a meaningful effect. For example, if our smallest effect size of interest is a mean difference of 0.5, and we observe a mean difference that is surprisingly far away from 0.5, we can reject the presence of an effect that is large enough to care about. In essence, in such an equivalence test the Type II error of the original null hypothesis significance test has now become the Type I error rate. Because we have designed our null hypothesis significance test to have 90% power for a mean difference of 0.5, 10% of the time we would incorrectly decide to act as if an effect of at least 0.5 is absent. This is statistically comparable to performing an equivalence test with an -level of 10%, and decide to act as if we can reject the presence of an effect at least as large as 0.5, which should also happen 10% of the time, in the long run.

If we can reject the presence of a meaningful effect, whenever is true, at an earlier look, we would save resources when is true. We see that the expected number of subjects under was 140.2. In other words, when is true, we would continue to the last look most of the time (unless we made a Type 1 error at look 1 or 2). With a -spending function, the expected number of subjects under has decreased substantially, to 99.6. The choice of whether you want to use a -spending function depends on the goals of your study. If you believe there is a decent probability is true, and you would like to efficiently conclude this from the data, the use of a -spending approach might be worth considering.

A challenge when wanting to interpret the observed effect size is that whenever a study is stopped early when rejecting , there is a risk that we stopped because due to random variation we happened to observe a large effect size at the time of the interim analysis. This which means that the observed effect size at these interim analyses over-estimates the true effect.

A similar issue is at play when reporting values and confidence intervals. When a sequential design is used, the distribution of a value that does not account for the sequential nature of the design is no longer uniform when is true. A value is the probability of observing a result at least as extreme as the result that was observed, given that is true. It is no longer straightforward to determine what ‘at least as extreme’ means a sequential design (Cook, 2002). It is possible to compute adjusted effect size estimates, confidence intervals, and values in rpact. This currently cannot be done in the Shiny app.

`Warning: 'thetaH1' (0.5) will be ignored because 'nPlanned' is not defined`

`Warning: 'assumedStDev' (1) will be ignored because 'nPlanned' is not defined`

Imagine we have performed a study planned to have at most 3 equally spaced looks at the data, where we perform a two-sided test with an of 0.05, and we use a Pocock type -spending function, and we observe mean differences between the two conditions of , 95% CI , , at stage 1, , 95% CI , , at stage 2, and , 95% CI , , at the last look. Based on a Pocock-like -spending function with three equally spaced looks the -level at each look for a two-sided test is 0.02264, 0.02174, and 0.02168. We can thus reject after look 3. But we would also like to report an effect size, and adjusted values and confidence intervals.

The first step is to create a dataset with the results at each look, consisting of the sample sizes, means, and standard deviations. Note that these are the sample sizes, means, and standard deviations only based on the data at each stage. In other words, we compute the means and standard deviations of later looks by excluding the data in earlier looks, so every mean and standard deviation in this example is based on 33 observations in each condition.

```
data_means <- getDataset(
n1 = c(33, 33, 33),
n2 = c(33, 33, 33),
means1 = c(0.6067868, 0.2795294, 0.7132186),
means2 = c(0.01976029, 0.08212538, 0.08982903),
stDevs1 = c(1.135266, 1.35426, 1.013671),
stDevs2 = c(1.068052, 0.9610714, 1.225192)
)
kable(summary(data_means))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|

1 | 1 | 33 | 0.6067868 | 1.1352660 | 33 | 0.6067868 | 1.135266 |

1 | 2 | 33 | 0.0197603 | 1.0680520 | 33 | 0.0197603 | 1.068052 |

2 | 1 | 33 | 0.2795294 | 1.3542600 | 66 | 0.4431581 | 1.250835 |

2 | 2 | 33 | 0.0821254 | 0.9610714 | 66 | 0.0509428 | 1.008615 |

3 | 1 | 33 | 0.7132186 | 1.0136710 | 99 | 0.5331783 | 1.178826 |

3 | 2 | 33 | 0.0898290 | 1.2251920 | 99 | 0.0639049 | 1.079461 |

We then take our design:

```
seq_design <- getDesignGroupSequential(
kMax = 3,
typeOfDesign = "asP",
sided = 2,
alpha = 0.05,
beta = 0.1
)
```

and compute the results based on the data we entered:

```
res <- getAnalysisResults(
seq_design,
equalVariances = FALSE,
dataInput = data_means,
thetaH1 = 0.5,
assumedStDev = 1
)
```

`Warning: 'thetaH1' (0.5) will be ignored because 'nPlanned' is not defined`

`Warning: 'assumedStDev' (1) will be ignored because 'nPlanned' is not defined`

We can then print a summary of the results:

`kable(summary(res))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | FALSE | FALSE | TRUE | continue | 0.3410959 | -0.0468454 | 1.2208984 | 0.0756568 | NA | NA | NA | NA |

1 | FALSE | FALSE | TRUE | continue | 0.2302843 | -0.0674732 | 0.8519037 | 0.1066797 | NA | NA | NA | NA |

1 | FALSE | FALSE | TRUE | reject | NA | 0.0974468 | 0.8410999 | 0.0105210 | 0.0392758 | 0.0218652 | 0.7425604 | 0.4028593 |

The results show that the action after look 1 and 2 was to continue data collection, and that we could reject at the third look. The unadjusted mean difference is provided in the row “Overall effect size” and at the final look this was 0.469. The adjusted mean difference is provided in the row “Median unbiased estimate” and is lower, and the adjusted confidence interval is in the row “Final confidence interval”, giving the result 0.403, 95% CI [0.022, 0.743].

The unadjusted values for a one-sided tests are reported in the row “Overall p-value”. The actual values for our two-sided test would be twice as large, so 0.0342596, 0.0495679, 0.0038994. The adjusted value at the final look is provided in the row “Final p-value” and it is 0.03928.

The probability of finding a significant result, given the data that have been observed up to an interim analysis, is called *conditional power*. This approach can be useful in adaptive designs - designs where the final sample sizes is updated based on an early look at the data. In *blinded* sample size recalculation no effect size is calculated at an earlier look, but other aspects of the design, such as the standard deviation, are updated. In an *unblinded* sample size recalculation, the effect size estimate at an early look is used to determine the final sample size.

Let us imagine that we perform a sequential design using a Pocock - and -spending function:

```
seq_design <- getDesignGroupSequential(
sided = 1,
alpha = 0.05,
beta = 0.1,
typeOfDesign = "asP",
typeBetaSpending = "bsP",
bindingFutility = FALSE
)
```

We perform an a-priori power analysis based on a smallest effect size of interest of d = 0.38, which yields a maximum number of subjects of 330.

```
power_res <- getSampleSizeMeans(
design = seq_design,
groups = 2,
alternative = 0.38,
stDev = 1,
allocationRatioPlanned = 1,
normalApproximation = FALSE
)
kable(summary(power_res))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.38 | FALSE | 0 | FALSE | 1 | 2 | 1 | 329.3995 | 164.6997 | 164.6997 | 109.7998 | 0.4933003 | 0.0763383 | 0.0452832 | 0.8808324 | 157.631 | 201.9582 | 173.5479 | 0.3866059 | 0.0560270 | 0.3848367 |

2 | 0.38 | FALSE | 0 | FALSE | 1 | 2 | 1 | 329.3995 | 164.6997 | 164.6997 | 219.5996 | 0.3111939 | 0.0763383 | 0.0310550 | 0.8808324 | 157.631 | 201.9582 | 173.5479 | 0.2706352 | 0.1590627 | 0.1199289 |

3 | 0.38 | FALSE | 0 | FALSE | 1 | 2 | 1 | 329.3995 | 164.6997 | 164.6997 | 329.3995 | 0.0955058 | 0.0763383 | NA | 0.8808324 | 157.631 | 201.9582 | 173.5479 | 0.2190461 | NA | NA |

We first looked at the data after we collected 110 observations. At this time, we observed a mean difference of 0.1. Let us say we assume the population standard deviation is 1, and that we are willing to collect 330 observations in total, as this gave us 90% power for the effect we wanted to detect, a mean difference of 0.5. Given the effect sie we observed, which is smaller than our smallest effect size of interest, what is the probability we will find a significant effect if we continue? We create a dataset:

```
data_means <- getDataset(
n1 = c(55),
n2 = c(55),
means1 = c(0.1), # for directional test, means 1 > means 2
means2 = c(0),
stDevs1 = c(1),
stDevs2 = c(1)
)
```

and analyze the results:

```
stage_res <- getStageResults(seq_design,
equalVariances = TRUE,
dataInput = data_means
)
kable(stage_res)
```

stages | overallTestStatistics | overallPValues | overallMeans1 | overallMeans2 | overallStDevs1 | overallStDevs2 | overallSampleSizes1 | overallSampleSizes2 | testStatistics | pValues | effectSizes | thetaH0 | direction | normalApproximation | equalVariances |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.5244044 | 0.300536 | 0.1 | 0 | 1 | 1 | 55 | 55 | 0.5244044 | 0.300536 | 0.1 | 0 | upper | FALSE | TRUE |

2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | upper | FALSE | TRUE |

3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | upper | FALSE | TRUE |

We can now perform a conditional power analysis based on the data we have observed so far. An important question is which effect size should be entered. Irrespective of the effect size we expected when designing the study, we have observed an effect of d = 0.1, and the smallest effect size of interest was a d = 0.38. We can compute the power under the assumption that the true effect size is d = 0.1 and d = 0.38:

```
# Compute conditional power after the first look
con_power_1 <- getConditionalPower(
design = seq_design,
stageResults = stage_res,
nPlanned = c(110, 110), # The sample size planned for the subsequent stages.
thetaH1 = 0.1, # alternative effect
assumedStDev = 1 # standard deviation
)
kable(con_power_1)
```

nPlanned | allocationRatioPlanned | conditionalPower | thetaH1 | assumedStDev |
---|---|---|---|---|

NA | 1 | NA | 0.1 | 1 |

110 | 1 | 0.0381649 | 0.1 | 1 |

110 | 1 | 0.0903539 | 0.1 | 1 |

If the true effect size is 0.1, the power is 0.09 at the final look. Under this assumption, there is little use in continuing the data collection. Under the assumption that the smallest effect size of interest would be true:

```
# Compute conditional power after the first look
con_power_2 <- getConditionalPower(
design = seq_design,
stageResults = stage_res,
nPlanned = c(110, 110), # The sample size planned for the subsequent stages.
thetaH1 = 0.38, # alternative effect
assumedStDev = 1 # standard deviation
)
kable(con_power_2)
```

nPlanned | allocationRatioPlanned | conditionalPower | thetaH1 | assumedStDev |
---|---|---|---|---|

NA | 1 | NA | 0.38 | 1 |

110 | 1 | 0.3805418 | 0.38 | 1 |

110 | 1 | 0.7126475 | 0.38 | 1 |

Under the assumption that the smallest effect size of interest exists, there is a reasonable probability of still observing a significant result at the last look (71.26%).

Because of the flexibility in choosing the number of looks, and the -spending function, it is important to preregister your statistical analysis plan. Preregistration allows other researchers to evaluate the severity of a test – how likely were you to find an effect if it is there, and how likely were you to not find an effect if there was no effect. Flexibility in the data analysis increases the Type 1 error rate, or the probability of finding an effect if there actually isn’t any effect (i.e., a false positive), and preregistering your sequential analysis plan can reveal to future readers that you severely tested your prediction.

The use of sequential analyses gives researchers more flexibility. To make sure this flexibility is not abused, the planned experimental design should be preregistered. The easiest way to do this is by either adding the rpact R code, or when the Shiny app is used, to use the export function and store the planned design as a PDF, R Markdown, or R file.

The **sample size** for a trial with binary endpoints can be calculated using the function `getSampleSizeRates()`

. This function is fully documented in the help page (`?getSampleSizeRates`

). Hence, we only provide some examples below.

First, load the rpact package.

```
library(rpact)
packageVersion("rpact")
```

`[1] '4.0.0'`

To get the **direction** of the effects correctly, note that in rpact the **index “2” in an argument name always refers to the control group, “1” to the intervention group, and treatment effects compare treatment versus control**. Specifically, for binary endpoints, the probabilities of an event in the control group and intervention group, respectively, are given by arguments `pi2`

and `pi1`

. The default treatment effect is the absolute risk difference `pi1 - pi2`

but the relative risk scale `pi1/pi2`

is also supported if the argument `riskRatio`

is set to `TRUE`

.

```
# Example of a standard trial:
# - probability 25% in control (pi2 = 0.25) vs. 40% (pi1 = 0.4) in intervention
# - one-sided test (sided = 1)
# - Type I error 0.025 (alpha = 0.025) and power 80% (beta = 0.2)
sampleSizeResult <- getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2
)
kable(sampleSizeResult)
```

stages | pi1 | riskRatio | thetaH0 | normalApproximation | pi2 | groups | allocationRatioPlanned | directionUpper | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 303.7377 | 151.8689 | 151.8689 | 0.1032292 |

As per the output above, the required **total sample size** is 304 and the critical value corresponds to a minimal detectable difference (on the absolute risk difference scale, the default) of approximately 0.103. This calculation assumes that pi2 = 0.25 is the observed rate in treatment group 2.

A useful summary is provided with the generic `summary()`

function:

`kable(summary(sampleSizeResult))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 303.7377 | 151.8689 | 151.8689 | 0.1032292 |

You can change the randomization allocation between the treatment groups using `allocationRatioPlanned`

:

```
# Example: Extension of standard trial
# - 2(intervention):1(control) randomization (allocationRatioPlanned = 2)
kable(summary(getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2,
allocationRatioPlanned = 2
)))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 2 | TRUE | 346.3204 | 230.8803 | 115.4401 | 0.1041707 |

`allocationRatioPlanned = 0`

can be defined in order to obtain the optimum allocation ratio minimizing the overall sample size (the optimum ample size is only slightly smaller than sample size with equal allocation; practically, this has no effect):

```
# Example: Extension of standard trial
# optimum randomization ratio
kable(summary(getSampleSizeRates(
pi2 = 0.25, pi1 = 0.4,
sided = 1, alpha = 0.025, beta = 0.2,
allocationRatioPlanned = 0
)))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 0.9526172 | TRUE | TRUE | 303.5628 | 148.0982 | 155.4645 | 0.1031639 |

**Power** at given sample size can be calculated using the function `getPowerRates()`

. This function has the same arguments as `getSampleSizeRates()`

except that the maximum total sample size needs to be defined (`maxNumberOfSubjects`

) and the Type II error `beta`

is no longer needed. For one-sided tests, the direction of the test is also required. The default `directionUpper = TRUE`

indicates that for the alternative the probability in the intervention group `pi1`

is larger than the probability in the control group `pi2`

(`directionUpper = FALSE`

is the other direction):

```
# Example: Calculate power for a simple trial with total sample size 304
# as in the example above in case of pi2 = 0.25 (control) and
# pi1 = 0.37 (intervention)
powerResult <- getPowerRates(
pi2 = 0.25, pi1 = 0.37,
maxNumberOfSubjects = 304, sided = 1, alpha = 0.025
)
kable(powerResult)
```

stages | pi1 | riskRatio | thetaH0 | normalApproximation | pi2 | groups | allocationRatioPlanned | directionUpper | effect | maxNumberOfSubjects | overallReject | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.37 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 0.12 | 304 | 0.6196486 | 304 | 152 | 152 | 0.1031824 |

The calculated **power** is provided in the output as **“Overall reject”** and is 0.620 for the example.

The `summary()`

command produces the output

`kable(summary(powerResult))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.37 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 0.12 | 304 | 0.6196486 | 304 | 152 | 152 | 0.1031824 |

The `getPowerRates()`

(as well as `getSampleSizeRates()`

) functions can also be called with a vector argument for the probability `pi1`

in the intervention group. This is illustrated below via a plot of power depending on this probability. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

```
# Example: Calculate power for simple design (with sample size 304 as above)
# for probabilities in intervention ranging from 0.3 to 0.5
powerResult <- getPowerRates(
pi2 = 0.25, pi1 = seq(0.3, 0.5, by = 0.01),
maxNumberOfSubjects = 304, sided = 1, alpha = 0.025
)
# one of several possible plots, this one plotting true effect size vs power
plot(powerResult, type = 7)
```

Sample size calculation proceeds in the same fashion as for superiority trials except that the role of the null and the alternative hypothesis are reversed. I.e., in this case, the non-inferiority margin corresponds to the treatment effect under the null hypothesis (`thetaH0`

) which one aims to reject. Testing in non-inferiority trials is always one-sided.

```
# Example: Sample size for a non-inferiority trial
# Assume pi(control) = pi(intervention) = 0.2
# Test H0: pi1 - pi2 = 0.1 (risk increase in intervention >= Delta = 0.1)
# vs. H1: pi1 - pi2 < 0.1
sampleSizeNoninf <- getSampleSizeRates(
pi2 = 0.2, pi1 = 0.2,
thetaH0 = 0.1, sided = 1, alpha = 0.025, beta = 0.2
)
kable(sampleSizeNoninf)
```

stages | pi1 | riskRatio | thetaH0 | normalApproximation | pi2 | groups | allocationRatioPlanned | directionUpper | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.2 | FALSE | 0.1 | TRUE | 0.2 | 2 | 1 | FALSE | 508.4354 | 254.2177 | 254.2177 | 0.0284932 |

`kable(summary(sampleSizeNoninf))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.2 | FALSE | 0.1 | TRUE | 0.2 | 2 | 1 | FALSE | 508.4354 | 254.2177 | 254.2177 | 0.0284932 |

The function `getSampleSizeRates()`

allows to set the number of `groups`

(which is 2 by default) to 1 for the design of single arm trials. The probability under the null hypothesis can be specified with the argument `thetaH0`

and the specific alternative hypothesis which is used for the sample size calculation with the argument `pi1`

. The sample size calculation can be based either on a normal approximation (`normalApproximation = TRUE`

, the default) or on exact binomial probabilities (`normalApproximation = FALSE`

).

```
# Example: Sample size for a single arm trial which tests
# H0: pi = 0.1 vs. H1: pi = 0.25
# (use conservative exact binomial calculation)
samplesSizeResults <- getSampleSizeRates(
groups = 1, thetaH0 = 0.1, pi1 = 0.25,
normalApproximation = FALSE, sided = 1, alpha = 0.025, beta = 0.2
)
kable(summary(samplesSizeResults))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|

1 | 0.25 | 0.1 | FALSE | 1 | TRUE | 53 | 0.1807665 |

Sample size calculation for a group sequential trial is performed in **two steps**:

**Define the (abstract) group sequential design**using the function`getDesignGroupSequential()`

. For details regarding this step, see the vignette Defining group sequential boundaries with rpact.**Calculate sample size**for the binary endpoint by feeding the abstract design into the function`getSampleSizeRates()`

. Note that the power 1 - beta needs to be defined in the design function, and not in`getSampleSizeRates()`

.

In general, rpact supports both one-sided and two-sided group sequential designs. However, if futility boundaries are specified, only one-sided tests are permitted.

R code for a simple example is provided below:

```
# Example: Group-sequential design with O'Brien & Fleming type alpha-spending and
# one interim at 60% information
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.6, 1), typeOfDesign = "asOF"
)
# Sample size calculation assuming event probabilities are 25% in control
# (pi2 = 0.25) vs 40% (pi1 = 0.4) in intervention
sampleSizeResultGS <- getSampleSizeRates(design, pi2 = 0.25, pi1 = 0.4)
# Standard rpact output (sample size object only, not design object)
kable(sampleSizeResultGS)
```

stages | pi1 | riskRatio | thetaH0 | normalApproximation | pi2 | groups | allocationRatioPlanned | directionUpper | maxNumberOfSubjects | maxNumberOfSubjects1 | maxNumberOfSubjects2 | numberOfSubjects | rejectPerStage | earlyStop | expectedNumberOfSubjectsH0 | expectedNumberOfSubjectsH01 | expectedNumberOfSubjectsH1 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 306.3311 | 153.1656 | 153.1656 | 183.7987 | 0.3123193 | 0.3123193 | 305.8645 | 299.3256 | 268.0619 | 0.1869477 |

2 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 306.3311 | 153.1656 | 153.1656 | 306.3311 | 0.4876807 | 0.3123193 | 305.8645 | 299.3256 | 268.0619 | 0.1039268 |

The `summary()`

command produces the output

`kable(summary(sampleSizeResultGS))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 306.3311 | 153.1656 | 153.1656 | 183.7987 | 0.3123193 | 0.3123193 | 305.8645 | 299.3256 | 268.0619 | 0.1869477 |

2 | 0.4 | FALSE | 0 | TRUE | 0.25 | 2 | 1 | TRUE | 306.3311 | 153.1656 | 153.1656 | 306.3311 | 0.4876807 | 0.3123193 | 305.8645 | 299.3256 | 268.0619 | 0.1039268 |

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2024). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact")
```

`[1] '4.0.0'`

Suppose a trial should be conducted in 3 stages where at the first stage 50%, at the second stage 75%, and at the final stage 100% of the information should be observed. O’Brien & Fleming boundaries should be used with one-sided and non-binding futility bounds 0 and 0.5 for the first and the second stage, respectively, on the -value scale.

The endpoints are binary (failure rates) and should be compared in a parallel group design, i.e., the null hypothesis to be tested is which is tested against the alternative

The necessary sample size to achieve 90% power if the failure rates are assumed to be and can be obtained as follows:

```
dGS <- getDesignGroupSequential(
informationRates = c(0.5, 0.75, 1), alpha = 0.025, beta = 0.1,
futilityBounds = c(0, 0.5)
)
r <- getSampleSizeRates(dGS, pi1 = 0.4, pi2 = 0.6)
```

The `summary()`

command creates a nice table for the study design parameters:

`kable(summary(r))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.4 | FALSE | 0 | TRUE | 0.6 | 2 | 1 | FALSE | 266.2654 | 133.1327 | 133.1327 | 133.1327 | 0.2958429 | 0.0155584 | 0.0100028 | 0.7153582 | 184.2784 | 228.8316 | 198.2876 | -0.2478148 | 0.0000000 | 0.5000000 |

2 | 0.4 | FALSE | 0 | TRUE | 0.6 | 2 | 1 | FALSE | 266.2654 | 133.1327 | 133.1327 | 199.6990 | 0.4039570 | 0.0155584 | 0.0055556 | 0.7153582 | 184.2784 | 228.8316 | 198.2876 | -0.1652994 | -0.0348964 | 0.3085375 |

3 | 0.4 | FALSE | 0 | TRUE | 0.6 | 2 | 1 | FALSE | 266.2654 | 133.1327 | 133.1327 | 266.2654 | 0.2002002 | 0.0155584 | NA | 0.7153582 | 184.2784 | 228.8316 | 198.2876 | -0.1236876 | NA | NA |

Note that the calculation of the efficacy boundaries on the treatment effect scale is performed under the assumption that is the observed failure rate in the control group and states the *treatment difference to be observed* in order to reach significance (or stop the trial due to futility).

The optimum allocation ratio yields the smallest overall sample size and depends on the choice of and . It can be obtained by specifying `allocationRatioPlanned = 0`

. In our case, due to , the optimum allocation ratio is 1 but calculated numerically, therefore slightly unequal 1:

```
r <- getSampleSizeRates(dGS, pi1 = 0.4, pi2 = 0.6, allocationRatioPlanned = 0)
r$allocationRatioPlanned
```

`[1] 0.9999976`

`round(r$allocationRatioPlanned, 5)`

`[1] 1`

The decision boundaries can be illustrated on different scales.

On the -value scale:

`plot(r, type = 1)`

On the effect size scale:

`plot(r, type = 2)`

On the -value scale:

`plot(r, type = 3)`

Suppose that subjects were planned for the study. The power if the failure rate in the active treatment group is or can be achieved as follows:

```
power <- getPowerRates(dGS,
maxNumberOfSubjects = 280,
pi1 = c(0.4, 0.5), pi2 = 0.6, directionUpper = FALSE
)
power$overallReject
```

`[1] 0.914045 0.377853`

Note that `directionUpper = FALSE`

is used because the study is powered for alternatives being smaller than 0.

The power for (37.8%) is much reduced as compared to the case (where it exceeds 90%):

We also can graphically illustrate the power, the expected sample size, and the early stopping and futility stopping probabilities for a range of alternative values. This can be done by specifying the lower and the upper bound for in `getPowerRates()`

and use the generic `plot()`

command with `type = 6`

:

```
power <- getPowerRates(dGS,
maxNumberOfSubjects = 280,
pi1 = c(0.3, 0.6), pi2 = 0.6, directionUpper = FALSE
)
plot(power, type = 6)
```

Suppose that, using an adaptive design, the sample size from the above example can be increased *in the last interim* up to a 4-fold of the originally planned sample size for the last stage. Conditional power 90% *based on the observed effect sizes (failure rates)* should be used to increase the sample size. We want to use the inverse normal method to allow for the sample size increase and compare the test characteristics with the group sequential design from the above example.

To assess the test characteristics of this adaptive design we first define the inverse normal design and then perform two simulations, one without and one with SSR:

```
dIN <- getDesignInverseNormal(
informationRates = c(0.5, 0.75, 1),
alpha = 0.025, beta = 0.1, futilityBounds = c(0, 0.5)
)
sim1 <- getSimulationRates(dIN,
plannedSubjects = c(140, 210, 280),
pi1 = seq(0.4, 0.5, 0.01), pi2 = 0.6, directionUpper = FALSE,
maxNumberOfIterations = 1000, conditionalPower = 0.9,
minNumberOfSubjectsPerStage = c(140, 70, 70),
maxNumberOfSubjectsPerStage = c(140, 70, 70), seed = 1234
)
sim2 <- getSimulationRates(dIN,
plannedSubjects = c(140, 210, 280),
pi1 = seq(0.4, 0.5, 0.01), pi2 = 0.6, directionUpper = FALSE,
maxNumberOfIterations = 1000, conditionalPower = 0.9,
minNumberOfSubjectsPerStage = c(NA, 70, 70),
maxNumberOfSubjectsPerStage = c(NA, 70, 4 * 70), seed = 1234
)
```

Note that the sample sizes will be calculated under the assumption that the *conditional power for the subsequent stage* is 90%. If the resulting sample size is larger, the upper bound (4*70 = 280) is used.

Note also that `sim1`

can also be *calculated* using `getPowerRates()`

or can also *easier be simulated* without specifying `conditionalPower`

, `minNumberOfSubjectsPerStage`

, and `maxNumberOfSubjectsPerStage`

(which obviously is redundant for `sim1`

) but this way ensures that the calculated objects `sim1`

and `sim2`

*contain exactly the same parameters* and therefore can easier be combined (see below).

We can look at the power and the expected sample size of the two procedures and assess the power gain of using the adaptive design which comes along with an increased expected sample size:

`sim1$pi1`

` [1] 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50`

`round(sim1$overallReject, 3)`

` [1] 0.921 0.890 0.853 0.810 0.752 0.721 0.675 0.582 0.526 0.469 0.405`

`round(sim2$overallReject, 3)`

` [1] 0.976 0.971 0.940 0.912 0.882 0.869 0.800 0.718 0.695 0.601 0.475`

`round(sim1$expectedNumberOfSubjects, 1)`

` [1] 202.1 209.3 216.4 219.5 222.0 230.9 231.1 238.3 234.2 237.9 236.5`

`round(sim2$expectedNumberOfSubjects, 1)`

` [1] 240.7 251.6 270.6 278.8 286.6 305.8 323.8 330.9 336.4 336.8 349.3`

We now want to graphically illustrate the gain in power when using the adaptive sample size recalculation. We use ggplot2 (see ggplot2.tidyverse.org) for doing this. First, a dataset `df`

combining `sim1`

and `sim2`

is defined with the additional variable SSR. Defining `mytheme`

and using the following ggplot2 commands, the difference in power and ASN of the two strategies is illustrated. It shows that at least for (absolute) effect difference > 0.15 an overall power of more than around 85% can be achieved with the proposed sample size recalculation strategy.

```
library(ggplot2)
dataSim1 <- as.data.frame(sim1, niceColumnNamesEnabled = FALSE)
dataSim2 <- as.data.frame(sim2, niceColumnNamesEnabled = FALSE)
dataSim1$SSR <- rep("no SSR", nrow(dataSim1))
dataSim2$SSR <- rep("SSR", nrow(dataSim2))
df <- rbind(dataSim1, dataSim2)
myTheme <- theme(
axis.title.x = element_text(size = 12), axis.text.x = element_text(size = 12),
axis.title.y = element_text(size = 12), axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5)
)
p <- ggplot(
data = df,
aes(x = effect, y = overallReject, group = SSR, color = SSR)
) +
geom_line(size = 1.1) +
geom_line(aes(
x = effect, y = expectedNumberOfSubjects / 400,
group = SSR, color = SSR
), size = 1.1, linetype = "dashed") +
scale_y_continuous("Power",
sec.axis = sec_axis(~ . * 400, name = "ASN"),
limits = c(0.2, 1)
) +
xlab("effect") +
ggtitle("Power and ASN", "Power solid, ASN dashed") +
geom_hline(size = 0.5, yintercept = 0.8, linetype = "dotted") +
geom_hline(size = 0.5, yintercept = 0.9, linetype = "dotted") +
geom_vline(size = 0.5, xintercept = c(-0.2, -0.15), linetype = "dashed") +
theme_classic() +
myTheme
plot(p)
```

For saving the graph, use

`ggplot2::ggsave(filename = "c:/yourdirectory/comparison.png",`

`plot = ggplot2::last_plot(), device = NULL, path = NULL,`

`scale = 1.2, width = 20, height = 12, units = "cm", dpi = 600,`

`limitsize = TRUE)`

For another example of using ggplot2 in rpact see also the vignette Supplementing and enhancing rpact’s graphical capabilities with ggplot2.

Finally, we create a histogram for the attained sample size of the study *when using the adaptive sample size recalculation*.

With the `getData()`

command the simulation results are obtained and `str(simdata)`

provides information of the struture of this data:

```
simData <- getData(sim2)
str(simData)
```

```
'data.frame': 24579 obs. of 19 variables:
$ iterationNumber : num 1 2 2 2 3 3 4 4 4 5 ...
$ stageNumber : num 1 1 2 3 1 2 1 2 3 1 ...
$ pi1 : num 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 ...
$ pi2 : num 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 ...
$ numberOfSubjects : num 140 140 70 147 140 70 140 70 91 140 ...
$ numberOfCumulatedSubjects: num 140 140 210 357 140 210 140 210 301 140 ...
$ rejectPerStage : num 1 0 0 1 0 1 0 0 1 1 ...
$ futilityPerStage : num 0 0 0 0 0 0 0 0 0 0 ...
$ testStatistic : num 3.05 2.03 2.07 4.07 2.03 ...
$ testStatisticsPerStage : num 3.054 2.028 0.718 4.547 2.029 ...
$ overallRate1 : num 0.329 0.414 0.438 0.369 0.429 ...
$ overallRate2 : num 0.586 0.586 0.581 0.607 0.6 ...
$ stagewiseRates1 : num 0.329 0.414 0.486 0.27 0.429 ...
$ stagewiseRates2 : num 0.586 0.586 0.571 0.644 0.6 ...
$ sampleSizesPerStage1 : num 70 70 35 74 70 35 70 35 46 70 ...
$ sampleSizesPerStage2 : num 70 70 35 73 70 35 70 35 45 70 ...
$ trialStop : logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ conditionalPowerAchieved : num NA NA 0.602 0.9 NA ...
$ pValue : num 0.00112984 0.02126124 0.23628281 0.00000272 0.02121903 ...
```

Depending on (in this example, for ), you can create the histogram of the simulated total sample size as follows:

```
simDataPart <- simData[simData$pi1 == 0.5, ]
overallSampleSizes <-
sapply(1:1000, function(i) {
sum(simDataPart[simDataPart$iterationNumber == i, ]$numberOfSubjects)
})
hist(overallSampleSizes, main = "Histogram", xlab = "Achieved sample size")
```

How often the maximum and other sample sizes are reached over the stages can be obtained as follows:

```
subjectsRange <- cut(simDataPart$numberOfSubjects, c(69, 70, 139, 140, 210, 279, 280),
labels = c(
"(69,70]", "(70,139]", "(139,140]",
"(140,210]", "(210,279]", "(279,280]"
)
)
kable(round(prop.table(table(simDataPart$stageNumber, subjectsRange), margin = 1) * 100, 1))
```

(69,70] | (70,139] | (139,140] | (140,210] | (210,279] | (279,280] |
---|---|---|---|---|---|

0 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 |

100 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

0 | 9.1 | 0.3 | 7.9 | 7.1 | 75.5 |

For this simulation, the originally planned sample size (70) was never selected for the third stage and in most of the cases the maximum of sample size (280) was used.

Gernot Wassmer and Werner Brannath,

*Group Sequential and Confirmatory Adaptive Designs in Clinical Trials*, Springer 2016, ISBN 978-3319325606R-Studio,

*Data Visualization with ggplot2 - Cheat Sheet*, version 2.1, 2016, https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

R Core Team (2024). *R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

Wassmer G, Pahlke F (2024). *rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

In rpact, **sample size calculation for a group sequential trial proceeds by following the same two steps regardless of whether the endpoint is a continuous, binary, or a time-to-event endpoint**:

**Define the (abstract) group sequential boundaries**of the design using the function`getDesignGroupSequential()`

.**Calculate sample size for the endpoint of interest**by feeding the abstract boundaries from step 1. into specific functions for the endpoint of interest. This step uses functions such as`getSampleSizeMeans()`

(for continuous endpoints),`getSampleSizeRates()`

(for binary endpoints), and`getSampleSizeSurvival()`

(for survival endpoints).

The mathematical rationale for this two-step approach is that all group sequential trials, regardless of the chosen endpoint type, rely on the fact that the -scores at different interim stages follow the same “canonical joint multivariate distribution” (at least asymptotically).

This document covers the more abstract first step, **Step 2 is not covered in this document but it is covered in the separate endpoint-specific R Markdown files for continuous, binary, and time to event endpoints.** Of note, step 1 can be omitted for trials without interim analyses.

These examples are not intended to replace the official rpact documentation and help pages but rather to supplement them.

In general, rpact supports both one-sided and two-sided group sequential designs. If futility boundaries are specified, however, only one-sided tests are permitted. **For simplicity, it is often preferred to use one-sided tests for group sequential designs** (typically, with ).

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact")
```

`[1] '4.0.0'`

**Example:**

- Interim analyses at information fractions 33%, 67%, and 100% (
`informationRates = c(0.33, 0.67, 1)`

). [Note: For equally spaced interim analyses, one can also specify the maximum number of stages (`kMax`

, including the final analysis) instead of the`informationRates`

.] - Lan & DeMets -spending approximation to the O’Brien & Fleming boundaries (
`typeOfDesign = "asOF"`

) - -spending approaches allow for flexible timing of interim analyses and corresponding adjustment of boundaries.

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF"
)
```

The originally published O’Brien & Fleming boundaries are obtained via `typeOfDesign = "OF"`

which is also the default (therefore, if you do not specify `typeOfDesign`

, this type is selected). Note that strict Type I error control is only guaranteed for standard boundaries without -spending if the pre-defined interim schedule (i.e., the information fractions at which interim analyses are conducted) is exactly adhered to.

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "OF"
)
```

Pocock (`typeOfDesign = "P"`

for constant boundaries over the stages, `typeOfDesign = "asP"`

for corresponding -spending version) or Haybittle & Peto (`typeOfDesign = "HP"`

) boundaries (reject at interim if -value exceeds 3) is obtained with, for example,

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "P"
)
```

- Kim & DeMets -spending (
`typeOfDesign = "asKD`

) with parameter`gammaA`

(power function:`gammaA = 1`

is linear spending,`gammaA = 2`

quadratic) - Hwang, Shi & DeCani -spending (
`typeOfDesign = "asHSD"`

) with parameter`gammaA`

(for details, see Wassmer & Brannath 2016, p. 76) - Standard Wang & Tsiatis Delta classes (
`typeOfDesign = "WT"`

) and (`typeOfDesign = "WToptimum"`

)

```
# Quadratic Kim & DeMets alpha-spending
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asKD", gammaA = 2
)
```

User-defined -spending functions (`typeOfDesign = "asUser"`

) can be obtained via the argument `userAlphaSpending`

which must contain a numeric vector with elements that define the values of the cumulative alpha-spending function at each interim analysis.

```
# Example: User-defined alpha-spending function which is very conservative at
# first interim (spend alpha = 0.001), conservative at second (spend an additional
# alpha = 0.01, i.e., total cumulative alpha spent is 0.011 up to second interim),
# and spends the remaining alpha at the final analysis (i.e., cumulative
# alpha = 0.025)
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1),
typeOfDesign = "asUser",
userAlphaSpending = c(0.001, 0.01 + 0.001, 0.025)
)
# $stageLevels below extract local significance levels across interim analyses.
# Note that the local significance level is exactly 0.001 at the first
# interim, but slightly >0.01 at the second interim because the design
# exploits correlations between interim analyses.
design$stageLevels
```

`[1] 0.00100000 0.01052883 0.02004781`

- The argument
`futilityBounds`

contains a vector of futility bounds (on the -value scale) for each interim (but not the final analysis). - A futility bound of corresponds to an estimated treatment effect of zero or “null”, i.e., in this case futility stopping is recommended if the treatment effect estimate at the interim analysis is zero or “goes in the wrong direction”. Futility bounds of (which are numerically equivalent to ) correspond to no futility stopping at an interim.
- Due to the design of rpact, it is not possible to directly define futility boundaries on the treatment effect scale. If this is desired, one would need to manually convert the treatment effect scale to the -scale or, alternatively, experiment by varying the boundaries on the -scale until this implies the targeted critical values on the treatment effect scale. (Critical values on treatment effect scales are routinely provided by sample size functions for different endpoint types such as
`getSampleSizeMeans()`

(for continuous endpoints),`getSampleSizeRates()`

(for binary endpoints), and`getSampleSizeSurvival()`

(for survival endpoints). Please see the R Markdown files for these endpoint types for further details.) - By default, all futility boundaries are non-binding (
`bindingFutility = FALSE`

). Binding futility boundaries (`bindingFutility = TRUE`

) are not recommended although they are provided for the sake of completeness.

```
# Example: non-binding futility boundary at each interim in case
# estimated treatment effect is null or goes in "the wrong direction"
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

Formal -spending functions are defined in the same way as for -spending functions, e.g., a Pocock type -spending can be specified as `typeBetaSpending = "bsP"`

and `beta`

needs to be specified, the default is `beta = 0.20`

.

```
# Example: beta-spending function approach with O'Brien & Fleming alpha-spending
# function and Pocock beta-spending function
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.1,
typeOfDesign = "asOF",
typeBetaSpending = "bsP"
)
```

Another way to formally derive futility bounds is through the Pampallona and Tsiatis approach. This is through defining `typeBetaSpending = "PT"`

, and the specification of two parameters, `deltaPT1`

(shape of decision regions for rejecting the null) and `deltaPT0`

(shape of shifted decision regions for rejecting the alternative), for example

```
# Example: beta-spending function approach with O'Brien & Fleming boundaries for
# rejecting the null and Pocock boundaries for rejecting H1
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.1,
typeOfDesign = "PT",
deltaPT1 = 0, deltaPT0 = 0.5
)
```

Note that both the -spending as well as the Pampallona & Tsiatis approach can be selected to be one-sided or two-sided, the bounds for rejecting the alternative to be binding (`bindingFutility = TRUE`

) or non-binding (`bindingFutility = FALSE`

).

Such designs can be implemented by using a user-defined -spending function which spends all of the Type I error at the final analysis. Note that such designs do not allow stopping for efficacy regardless how persuasive the effect is.

```
# Example: non-binding futility boundary using an O'Brien & Fleming type
# beta spending function. No early stopping for efficacy (i.e., all alpha
# is spent at the final analysis).
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asUser",
userAlphaSpending = c(0, 0, 0.025), typeBetaSpending = "bsOF",
bindingFutility = FALSE
)
```

`Changed type of design to 'noEarlyEfficacy'`

As indicated, you can specifiy `typeOfDesign = "noEarlyEfficacy"`

which is a shortcut for `typeOfDesign = "asUser"`

and `userAlphaSpending = c(0, 0, 0.025)`

.

We use the design with an O’Brien & Fleming -spending function and prespecified futility bounds:

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

`design`

object`kable(design)`

typeOfDesign | kMax | stages | informationRates | alpha | beta | twoSidedPower | futilityBounds | bindingFutility | sided | tolerance | alphaSpent | typeBetaSpending | criticalValues | stageLevels |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asOF | 3 | 1 | 0.33 | 0.025 | 0.2 | FALSE | 0 | FALSE | 1 | 0 | 0.0000955 | none | 3.730665 | 0.0000955 |

asOF | 3 | 2 | 0.67 | 0.025 | 0.2 | FALSE | 0 | FALSE | 1 | 0 | 0.0061756 | none | 2.503871 | 0.0061421 |

asOF | 3 | 3 | 1.00 | 0.025 | 0.2 | FALSE | NA | FALSE | 1 | 0 | 0.0250000 | none | 1.993710 | 0.0230919 |

The key information is contained in the object including **critical values on the -scale** (“Critical values” in rpact output, `design$criticalValues`

) and **one-sided local significance levels** (“Stage levels” in rpact output,`design$stageLevels`

). Note that the local significance levels are always given as one-sided levels in rpact even if a two-sided design is specified.

`names(design)`

provides names of all objects included in the `design`

object and `as.data.frame(design)`

collects all design information into one data frame. `summary(design)`

gives a slightly more detailed output. For more details about applying R generics to rpact objects, please refer to the separte R Markdown file How to use R generics with rpact.

`names(design)`

```
[1] "kMax" "alpha" "stages"
[4] "informationRates" "userAlphaSpending" "criticalValues"
[7] "stageLevels" "alphaSpent" "bindingFutility"
[10] "tolerance" "typeOfDesign" "beta"
[13] "deltaWT" "deltaPT1" "deltaPT0"
[16] "futilityBounds" "gammaA" "gammaB"
[19] "optimizationCriterion" "sided" "betaSpent"
[22] "typeBetaSpending" "userBetaSpending" "power"
[25] "twoSidedPower" "constantBoundsHP" "betaAdjustment"
[28] "delayedInformation" "decisionCriticalValues" "reversalProbabilities"
```

`summary()`

creates a nice presentation of the design that also contains information about the sample size of the design (see below):

`kable(summary(design))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asOF | 3 | 1 | 0.33 | 0.025 | 0.2 | FALSE | 0 | FALSE | 1 | 0 | 0.0000955 | none | 3.730665 | 0.0000955 |

asOF | 3 | 2 | 0.67 | 0.025 | 0.2 | FALSE | 0 | FALSE | 1 | 0 | 0.0061756 | none | 2.503871 | 0.0061421 |

asOF | 3 | 3 | 1.00 | 0.025 | 0.2 | FALSE | NA | FALSE | 1 | 0 | 0.0250000 | none | 1.993710 | 0.0230919 |

`getDesignCharacteristics(design)`

provides more detailed information about the design:

```
designChar <- getDesignCharacteristics(design)
kable(designChar)
```

inflationFactor | stages | information | power | rejectionProbabilities | futilityProbabilities | averageSampleNumber1 | averageSampleNumber01 | averageSampleNumber0 |
---|---|---|---|---|---|---|---|---|

1.060542 | 1 | 2.746941 | 0.0190733 | 0.0190733 | 0.0487203 | 0.8628148 | 0.8689327 | 0.6589089 |

1.060542 | 2 | 5.577123 | 0.4429632 | 0.4238898 | 0.0034369 | 0.8628148 | 0.8689327 | 0.6589089 |

1.060542 | 3 | 8.324065 | 0.8000000 | 0.3570368 | NA | 0.8628148 | 0.8689327 | 0.6589089 |

`names(designChar)`

```
[1] "nFixed" "shift" "inflationFactor"
[4] "stages" "information" "power"
[7] "rejectionProbabilities" "futilityProbabilities" "averageSampleNumber1"
[10] "averageSampleNumber01" "averageSampleNumber0"
```

**Note that the design characteristics depend on beta that needs to be specified in getDesignGroupSequential(). By default, beta = 0.20.**

Explanations regarding the output:

**Maximum sample size inflation factor**(`$inflationFactor`

): This is the maximal sample size a group sequential trial requires relative to the sample size of a fixed design without interim analyses.- Probabilities of stopping due to a significant result at each interim or the final analysis (
`$rejectionProbabilities`

), cumulative power (`$power`

), and probability of stopping for futility at each interim (`$futilityProbabilities`

). All of these are calculated under the alternative H1. **Expected sample size**of group sequential design (relative to fixed design) under the alternative hypothesis H1 (`$averageSampleNumber1`

), under the null hypothesis H0 (`$averageSampleNumber0`

), and under the parameter in the middle between H0 and H1.- In addition,
`getDesignCharacteristics(design)`

provides the required sample size for an abstract group sequential single arm trial with a normal outcome, effect size 1, and standard deviation 1 (i.e., the simplest group sequential setting from a mathematical point of view). The sample size for such a trial without interim analyses is given as`$nFixed`

and the maximum sample size of the corresponding group sequential design as`$shift`

.

The practical relevance of this abstract design is that the **properties of the design** (critical values, sample size inflation factor, rejection probabilies, etc) **carry over to group sequential designs regardless of the endpoint (e.g. continuous, binary, or survial)** as they all share the same underlying canonical multivariate normal distribution of the -scores.

**Overall stopping probabilities, rejection probabilities, and futility probabilities under the null (H0) and the alternative (H1)** (overall and at each stage) can be calculated using the function `getPowerAndAverageSampleNumber()`

. To get these numbers, one needs to provide the maximum sample size and the effect size (0 under H0, 1 under H1) of the corresponding type of design.

```
# theta = 0 for calculations under H0
kable(getPowerAndAverageSampleNumber(design,
theta = c(0), nMax = designChar$shift
))
```

stages | theta | averageSampleNumber | calculatedPower | overallEarlyStop | earlyStop | overallReject | rejectPerStage | overallFutility | futilityPerStage |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | 5.171697 | 0.0237738 | 0.6323421 | 0.5000955 | 0.0237738 | 0.0000955 | 0.6261878 | 0.5000000 |

2 | 0 | 5.171697 | 0.0237738 | 0.6323421 | 0.1322467 | 0.0237738 | 0.0060589 | 0.6261878 | 0.1261878 |

3 | 0 | 5.171697 | 0.0237738 | 0.6323421 | NA | 0.0237738 | 0.0176194 | 0.6261878 | NA |

```
# theta = 1 for calculations under alternative H1
kable(getPowerAndAverageSampleNumber(design,
theta = 1, nMax = designChar$shift
))
```

stages | theta | averageSampleNumber | calculatedPower | overallEarlyStop | earlyStop | overallReject | rejectPerStage | overallFutility | futilityPerStage |
---|---|---|---|---|---|---|---|---|---|

1 | 1 | 6.77213 | 0.8 | 0.4951204 | 0.0677937 | 0.8 | 0.0190733 | 0.0521573 | 0.0487203 |

2 | 1 | 6.77213 | 0.8 | 0.4951204 | 0.4273268 | 0.8 | 0.4238898 | 0.0521573 | 0.0034369 |

3 | 1 | 6.77213 | 0.8 | 0.4951204 | NA | 0.8 | 0.3570368 | 0.0521573 | NA |

Note that the power under H0, i.e., the significance level, is slightly below 0.025 in this example as it is calculated under the assumption that the non-binding futility boundaries are adhered to.

Both (and even more) values can be obtained with one command `getPowerAndAverageSampleNumber(design, theta = c(0, 1), nMax = designChar$shift)`

We use again the design with an O’Brien & Fleming -spending function and prespecified futility bounds:

```
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.33, 0.67, 1), typeOfDesign = "asOF",
futilityBounds = c(0, 0), bindingFutility = FALSE
)
```

Boundaries can be plotted using the `plot`

(or `plot.TrialDesign`

) function which produces a ggplot2 object.

The most relevant plots for (abstract) boundaries without easily interpretable treatment effect are boundary plots on -scale (`type = 1`

) or -value-scale (`type = 3`

) as well as plots of the -spending function (`type = 4`

). Conveniently, argument `showSource = TRUE`

also provides the source data for the plot. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

`plot(design, type = 1, showSource = TRUE)`

```
Source data of the plot (type 1):
x-axis: design$informationRates
y-axes:
y1: c(design$futilityBounds, design$criticalValues[length(design$criticalValues)])
y2: design$criticalValues
Simple plot command examples:
plot(design$informationRates, c(design$futilityBounds, design$criticalValues[length(design$criticalValues)]), type = "l")
plot(design$informationRates, design$criticalValues, type = "l")
```

`plot(design, type = 3, showSource = TRUE)`

```
Source data of the plot (type 3):
x-axis: design$informationRates
y-axis: design$stageLevels
Simple plot command example:
plot(design$informationRates, design$stageLevels, type = "l")
```

`plot(design, type = 4, showSource = TRUE)`

```
Source data of the plot (type 4):
x-axis: design$informationRates
y-axis: design$alphaSpent
Simple plot command example:
plot(design$informationRates, design$alphaSpent, type = "l")
```

Decision regions for two-sided tests with futility bounds are displayed accordingly:

```
design <- getDesignGroupSequential(
sided = 2, alpha = 0.05, beta = 0.2,
informationRates = c(0.33, 0.67, 1),
typeOfDesign = "asOF",
typeBetaSpending = "bsP",
bindingFutility = FALSE
)
plot(design, type = 1)
```

Multiple designs can be combined into a design set (`getDesignSet()`

) and their properties plotted jointly:

```
# O'Brien & Fleming, 3 equally spaced stages
d1 <- getDesignGroupSequential(typeOfDesign = "OF", kMax = 3)
# Pocock
d2 <- getDesignGroupSequential(typeOfDesign = "P", kMax = 3)
designSet <- getDesignSet(designs = c(d1, d2), variedParameters = "typeOfDesign")
plot(designSet, type = 1)
```

Even simpler, in rpact 3.0, you can also use `plot(d1, d2)`

.

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

Group-sequential designs based on -spending functions protect the Type I error exactly even if the pre-planned interim schedule is not exactly adhered to. However, this requires re-calculation of the group sequential boundaries at each interim analysis based on actually observed information fractions. Unless deviations from the planned information fractions are substantial, the re-calculated boundaries are quite similar to the pre-planned boundaries and the re-calculation will affect the actual test decision only on rare occasions.

Importantly, it is not allowed that the timing of future interim analyses is “motivated” by results from earlier interim analyses as this could inflate the Type I error rate. Deviations from the planned information fractions should thus only occur due to operational reasons (as it is difficult to hit the exact number of events exactly in a real trial) or due to external evidence.

The general principles for these boundary re-calculation are as follows (see also, Wassmer & Brannath, 2016, p78f):

- Updates at interim analyses prior to the final analysis:
- Information fractions are updated according to the actually observed information fraction at the interim analysis relative to the
**planned**maximum information. - The planned -spending function is then applied to these updated information fractions.

- Information fractions are updated according to the actually observed information fraction at the interim analysis relative to the
- Updates at the final analysis in case the observed information at the final analysis is larger (“over-running”) or smaller (“under-running”) than the planned maximum information:
- Information fractions are updated according to the actually observed information fraction at all interim analyses relative to the
**observed**maximum information. Information fraction at final analysis is re-set to 1 but information fractions for earlier interim analyses are also changed. - The originally planned -spending function cannot be applied to these updated information fractions because this would modify the critical boundaries of earlier interim analyses which is clearly not allowed. Instead, one uses the that has actually been spent at earlier interim analyses and spends all remaining at the final analysis.

- Information fractions are updated according to the actually observed information fraction at all interim analyses relative to the

This general principle be implemented via a user-defined -spending function and is illustrated for an example trial with a survival endpoint below. We provide two solutions to the problem: the first is a way how existing tools in rpact can directly be used to solve the problem, the second is an automatic recalculation of the boundaries using a new parameter set (`maxInformation`

and `informationEpsilon`

) which is available in the `getAnalysisResults()`

function since rpact version 3.1. This solution is described in Section @ref(sec:automatic) at the end of this document.

**First, load the rpact package **

```
library(rpact)
packageVersion("rpact") # version should be version 3.1 or later
```

`[1] '4.0.0'`

The original trial design for this example is based on a standard O’Brien & Fleming type -spending function with planned efficacy interim analyses after 50% and 75% of information as specified below.

```
# Initial design
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.5, 0.75, 1), typeOfDesign = "asOF"
)
# Initial sample size calculation
sampleSizeResult <- getSampleSizeSurvival(
design = design,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000
)
# Summarize design
kable(summary(sampleSizeResult))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.1679704 | 0.5399906 | 39.08167 | 57.9635 | 69.10659 | 193.3997 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.6530755 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.3720202 | 0.5399906 | 52.71020 | 57.9635 | 69.10659 | 290.0995 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.7580507 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.2600094 | 0.5399906 | 69.10659 | 57.9635 | 69.10659 | 386.7994 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.8147969 |

Assume that the first interim was conducted after 205 rather than the planned 194 events.

The updated design is calculated as per the code below. Note that for the calculation of boundary values on the treatment effect scale, we use the function `getPowerSurvival()`

with the updated design rather than the function `getSampleSizeSurvival()`

as we are only updating the boundary, not the sample size or the maximum number of events.

```
# Update design using observed information fraction at first interim.
# Information fraction of later interim analyses is not changed.
designUpdate1 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205 / 387, 0.75, 1), typeOfDesign = "asOF"
)
# Recalculate the power to get boundary values on the effect scale
# (Use original maxNumberOfEvents and sample size)
powerUpdate1 <- getPowerSurvival(
design = designUpdate1,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 387, directionUpper = FALSE
)
```

The updated information rates and corresponding boundaries as per the output above are summarized as follows:

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 316.9625 | 0.8000659 | 0.2097158 | 0.5391135 | 40.60037 | 57.75244 | 69.1443 | 205.00 | 1000 | 0.6700080 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 316.9625 | 0.8000659 | 0.3293977 | 0.5391135 | 52.73331 | 57.75244 | 69.1443 | 290.25 | 1000 | 0.7575116 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 316.9625 | 0.8000659 | 0.2609524 | 0.5391135 | 69.14430 | 57.75244 | 69.1443 | 387.00 | 1000 | 0.8147891 |

Assume that the efficacy boundary was not crossed at the first interim analysis and the trial continued to the second interim analysis which was conducted after 285 rather than the planned 291 events. The updated design is calculated in the same way as for the first interim analysis as per the code below. The idea is to use the cumulative spent from the first stage and an updated cumulative that is spent for the second stage. For the second stage, this can be obtained with the original O’Brien & Fleming -spending function:

```
# Update design using observed information fraction at first and second interim.
designUpdate2 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205 / 387, 285 / 387, 1), typeOfDesign = "asOF"
)
# Recalculate power to get boundary values on effect scale
# (Use original maxNumberOfEvents and sample size)
powerUpdate2 <- getPowerSurvival(
design = designUpdate2,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 387, directionUpper = FALSE
)
kable(summary(powerUpdate2))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 317.2007 | 0.8004461 | 0.2097158 | 0.519824 | 40.60037 | 57.82025 | 69.1443 | 205 | 1000 | 0.6700080 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 317.2007 | 0.8004461 | 0.3101082 | 0.519824 | 51.93116 | 57.82025 | 69.1443 | 285 | 1000 | 0.7531456 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 387 | 1 | 33.33333 | 30 | 1 | 0 | 35.81097 | 0.025 | 0.025 | 12 | 317.2007 | 0.8004461 | 0.2806220 | 0.519824 | 69.14430 | 57.82025 | 69.1443 | 387 | 1000 | 0.8150820 |

Assume that the efficacy boundary was also not crossed at the second interim analysis and the trial continued to the final analysis which was conducted after 393 rather than the planned 387 events. The updated design is calculated as per the code below. The idea here to use the cumulative spent from the first *and* the second stage stage and the final that is spent for the last stage. An updated correlation has to be used and the original O’Brien & Fleming -spending function cannot be used anymore. Instead, the -spending function needs to be user defined as follows:

```
# Update boundary with information fractions as per actually observed event numbers
# !! use user-defined alpha-spending and spend alpha according to actual alpha spent
# according to the second interim analysis
designUpdate3 <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(205, 285, 393) / 393,
typeOfDesign = "asUser",
userAlphaSpending = designUpdate2$alphaSpent
)
# Recalculate power to get boundary values on effect scale
# (Use planned sample size and **observed** maxNumberOfEvents)
powerUpdate3 <- getPowerSurvival(
design = designUpdate3,
lambda2 = log(2) / 60, hazardRatio = 0.75,
dropoutRate1 = 0.025, dropoutRate2 = 0.025, dropoutTime = 12,
accrualTime = 0, accrualIntensity = 30,
maxNumberOfSubjects = 1000, maxNumberOfEvents = 393, directionUpper = FALSE
)
kable(summary(powerUpdate3))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.2097158 | 0.519824 | 40.60037 | 58.3657 | 70.28023 | 205 | 1000 | 0.6700080 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.3101082 | 0.519824 | 51.93116 | 58.3657 | 70.28023 | 285 | 1000 | 0.7531456 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.2861577 | 0.519824 | 70.28023 | 58.3657 | 70.28023 | 393 | 1000 | 0.8161525 |

For easier comparison, all discussed boundary updates and power calculations are summarized more conveniently below. Note that each update only affects boundaries for the current or later analyses, i.e., earlier boundaries are never retrospectively modified.

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.1679704 | 0.5399906 | 39.08167 | 57.9635 | 69.10659 | 193.3997 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.6530755 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.3720202 | 0.5399906 | 52.71020 | 57.9635 | 69.10659 | 290.0995 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.7580507 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 500 | 500 | 1000 | 386.7994 | 1 | 33.33333 | 30 | 1 | 0 | 35.77325 | 0.025 | 0.025 | 12 | 0.2600094 | 0.5399906 | 69.10659 | 57.9635 | 69.10659 | 386.7994 | 385.7188 | 371.7163 | 318.3396 | 1000 | 0.8147969 |

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.2097158 | 0.519824 | 40.60037 | 58.3657 | 70.28023 | 205 | 1000 | 0.6700080 |

2 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.3101082 | 0.519824 | 51.93116 | 58.3657 | 70.28023 | 285 | 1000 | 0.7531456 |

3 | 0.75 | 1 | Schoenfeld | FALSE | 80 | 60 | 0.0086643 | 0.0115525 | 1000 | 1000 | 393 | 1 | 33.33333 | 30 | 1 | 0 | 36.94689 | 0.025 | 0.025 | 12 | 320.0817 | 0.8059817 | 0.2861577 | 0.519824 | 70.28023 | 58.3657 | 70.28023 | 393 | 1000 | 0.8161525 |

We now show how a concrete data analysis with an -spending function design can be performed by specifying the parameter `maxInformation`

in the `getAnalysisResults()`

function. As above, we start with an initial design, which in this situation is arbitrary and can be considered as a dummy design. Note that neither the number of stages nor the information rates need to be fixed.

```
# Dummy design
dummy <- getDesignGroupSequential(sided = 1, alpha = 0.025, typeOfDesign = "asOF")
```

The survival design was planned with a maximum of 387 events, the first interim took place after the observation of 205 events, the second after 285 events. Specifying the parameter `maxInformation`

makes it now extremly easy to perform the analysis for the first and the second stage. Assume that we have observed log-rank statistics 1.87 and 2.19 at the first and the second interim, respectively. This observation together with the event numbers is defined in the `getDataset()`

function through

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285),
cumulativeLogRanks = c(1.87, 2.19)
)
```

Note that it is important to define **cumulative**Events and **cumulative**LogRanks because otherwise the stage wise events and logrank statistics should be entered (in the given case, these will be calculated).

We now can enter the planned maximum number of events in the `getAnalysisResults()`

function as follows:

```
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387
)
```

This provides the summary:

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|

387 | TRUE | TRUE | continue | 0.1926659 | 0.870008 | 1.938043 | 0.1158636 |

387 | TRUE | TRUE | continue | 0.3986944 | 0.976239 | 1.721069 | 0.0379734 |

387 | TRUE | TRUE | NA | NA | NA | NA | NA |

We see that the boundaries are correctly calculated according to the observed information rates. If there is overrunning, i.e., the final analysis was conducted after 393 rather than the planned 387 events, first define the observed dataset

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285, 393),
cumulativeLogRanks = c(1.87, 2.19, 2.33)
)
```

and then use the `getAnalysisResults()`

function as before:

```
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387
)
```

The messages describe the way of how the critical value for the last stage using the recalculated information rates (leaving the critical values for the first two stages unchanged) was calculated. This way was described in Section @ref(sec:update). The last warning indicates that for this case, since there is no “natural” family of decision boundaries, repeated p-values for the final stage of the trial are not calculated.

The summary shows that indeed the recalculated boundary for the last stage and the already used boundaries for the first two stages are used for decision making:

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|

387 | TRUE | TRUE | continue | 0.1909837 | 0.870008 | 1.938043 | 0.1158636 | NA | NA | NA | NA |

387 | TRUE | TRUE | continue | 0.3883192 | 0.976239 | 1.721069 | 0.0379734 | NA | NA | NA | NA |

387 | TRUE | TRUE | reject | NA | 1.032426 | 1.549946 | NA | 0.0147568 | 1.023289 | 1.533512 | 1.254979 |

We also can consider the case of underrunning which is the case if, for example, it was decided **before conducting the analysis** that, say, also if up to 3 less events than the considered maximum number will be observed, this should be considered as the final analysis (i.e., the final stage is reached if 384 or more events were observed). Inserting the parameter `informationEpsilon`

in the `getAnalysisResults()`

function can be used for this. There are two ways for defining this parameter. You can do it

- in an absolute sense: the parameter
`informationEpsilon`

specifies the number of events to be allowed to deviate from the maximum number of events. This is achieved by specifying a positive integer number for`informationEpsilon`

. - in a relative sense: if a number x < 1 for
`informationEpsilon`

is specified, the stage is considered as the final stage if x% of`maxInformation`

is observed.

Both ways yield a correct calculation of the critical value to be used for the final stage. Suppose, for example, 385 events were observed and `informationEpsilon`

was set equal to 3. Then, since `387 - 385 < 3`

, this is an underrunning case and the critical value at the final stage is provided in the summary:

```
dataSurvival <- getDataset(
cumulativeEvents = c(205, 285, 385),
cumulativeLogRanks = c(1.87, 2.19, 2.21)
)
testResults <- getAnalysisResults(
design = dummy,
dataInput = dataSurvival,
maxInformation = 387,
informationEpsilon = 3
)
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|

387 | 3 | TRUE | TRUE | continue | 0.1932416 | 0.870008 | 1.938043 | 0.1158636 | NA | NA | NA | NA |

387 | 3 | TRUE | TRUE | continue | 0.4023160 | 0.976239 | 1.721069 | 0.0379734 | NA | NA | NA | NA |

387 | 3 | TRUE | TRUE | reject | NA | 1.020563 | 1.537524 | NA | 0.0175296 | 1.015702 | 1.523861 | 1.2456 |

We see that again the recalculated boundary for the last stage and the already used boundaries for the first two stages are used for decision making.

In summary, `maxInformation`

in the `getAnalysisResults()`

function can be used to perform an -spending function approach in practice. Also, if at the analysis stage overrunning or (pre-defined) underrunnig takes place the use of the parameters `maxInformation`

and `informationEpsilon`

in the function provides an easy was to perform a correct analysis with the specified design.

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

These examples are not intended to replace the official rpact documentation and help pages but rather to supplement them. They also only cover a selection of all rpact features.

General convention: In rpact, arguments containing the **index “2”** always refer to the **control group**, **“1”** refer to the **intervention group**, and **treatment effects compare treatment versus control**.

**First, load the rpact package**

```
library(rpact)
packageVersion("rpact") # version should be version 3.0 or later
```

`[1] '4.0.0'`

The **sample size** for a trial with continuous endpoints can be calculated using the function `getSampleSizeMeans()`

. This function is fully documented in the relevant help page (`?getSampleSizeMeans`

). Some examples are provided below.

`getSampleSizeMeans()`

requires that the mean difference between the two arms is larger under the alternative than under the null hypothesis. For superiority trials, this implies that **rpact requires that the targeted mean difference is >0 under the alternative hypothesis**. If this is not the case, the function produces an error message. To circumvent this and power for a negative mean difference, **one can simply switch the two arms** (leading to a positive mean difference) as the situation is perfectly symmetric.

By default, `getSampleSizeMeans()`

tests hypotheses about the mean difference. rpact also supports testing hypotheses about mean ratios if the argument `meanRatio`

is set to `TRUE`

but this will not be discussed further in this document.

By default, rpact uses sample size formulas for the -test, i.e., it assumes that the standard deviation in the two groups is equal but unknown and estimated from the data. If sample size calculations for the -test are desired, one can set the argument `normalApproximation`

to `TRUE`

but this is usually not recommended.

```
# Example of a standard trial:
# - targeted mean difference is 10 (alternative = 10)
# - standard deviation in both arms is assumed to be 24 (stDev = 24)
# - two-sided test (sided = 2), Type I error 0.05 (alpha = 0.05) and power 80%
# - (beta = 0.2)
sampleSizeResult <- getSampleSizeMeans(
alternative = 10, stDev = 24, sided = 2,
alpha = 0.05, beta = 0.2
)
kable(sampleSizeResult)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScaleLower | criticalValuesEffectScaleUpper | criticalValuesPValueScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | FALSE | 0 | FALSE | 24 | 2 | 1 | 182.7789 | 91.38944 | 91.38944 | -7.00557 | 7.00557 | 0.05 |

The generic `summary()`

function produces the output

`kable(summary(sampleSizeResult))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | FALSE | 0 | FALSE | 24 | 2 | 1 | 182.7789 | 91.38944 | 91.38944 | -7.00557 | 7.00557 | 0.05 |

As per the output above, the required **total sample size** for the trial is 183 and the critical value corresponds to a minimal detectable mean difference of approximately 7.01.

Unequal randomization between the treatment groups can be defind with `allocationRatioPlanned`

, for example,

```
# Extension of standard trial:
# - 2(intervention):1(control) randomization (allocationRatioPlanned = 2)
kable(summary(getSampleSizeMeans(
alternative = 10, stDev = 24,
allocationRatioPlanned = 2, sided = 2, alpha = 0.05, beta = 0.2
)))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | FALSE | 0 | FALSE | 24 | 2 | 2 | 205.3814 | 136.921 | 68.46048 | -7.004498 | 7.004498 | 0.05 |

**Power** for a given sample size can be calculated using the function `getPowerMeans()`

which has the same arguments as `getSampleSizeMeans()`

except that the maximum total sample is given (`maxNumberOfSubjects`

) instead of the Type II error (`beta`

).

```
# Calculate power for the 2:1 rendomized trial with total sample size 206
# (as above) assuming a larger difference of 12
powerResult <- getPowerMeans(
alternative = 12, stDev = 24, sided = 2,
allocationRatioPlanned = 2, maxNumberOfSubjects = 206, alpha = 0.05
)
kable(powerResult)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | directionUpper | effect | maxNumberOfSubjects | overallReject | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScaleLower | criticalValuesEffectScaleUpper | criticalValuesPValueScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 12 | FALSE | 0 | FALSE | 24 | 2 | 2 | NA | 12 | 206 | 0.920291 | 206 | 137.3333 | 68.66667 | -6.993847 | 6.993847 | 0.05 |

The calculated **power** is provided in the output as **“Overall reject”** and is 0.92 for the example `alternative = 12`

.

The `summary()`

function produces

`kable(summary(powerResult))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 12 | FALSE | 0 | FALSE | 24 | 2 | 2 | NA | 12 | 206 | 0.920291 | 206 | 137.3333 | 68.66667 | -6.993847 | 6.993847 | 0.05 |

`getPowerMeans()`

(as well as `getSampleSizeMeans()`

) can also be called with a vector argument for the mean difference under the alternative H1 (`alternative`

). This is illustrated below via a plot of power depending on these values. For examples of all available plots, see the R Markdown document How to create admirable plots with rpact.

```
# Example: Calculate power for design with sample size 206 as above
# alternative values ranging from 5 to 15
powerResult <- getPowerMeans(
alternative = 5:15, stDev = 24, sided = 2,
allocationRatioPlanned = 2, maxNumberOfSubjects = 206, alpha = 0.05
)
plot(powerResult, type = 7) # one of several possible plots
```

The sample size calculation proceeds in the same fashion as for superiority trials except that the role of the null and the alternative hypothesis are reversed and the test is always one-sided. In this case, the non-inferiority margin corresponds to the treatment effect under the null hypothesis (`thetaH0`

) which one aims to reject.

```
# Example: Non-inferiority trial with margin delta = 10, standard deviation = 14
# - One-sided alpha = 0.05, 1:1 randomization
# - H0: treatment difference <= -12 (i.e., = -12 for calculations, thetaH0 = -1)
# vs. alternative H1: treatment difference = 0 (alternative = 0)
sampleSizeNoninf <- getSampleSizeMeans(
thetaH0 = -12, alternative = 0,
stDev = 14, alpha = 0.025, beta = 0.2, sided = 1
)
kable(sampleSizeNoninf)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | nFixed | nFixed1 | nFixed2 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | FALSE | -12 | FALSE | 14 | 2 | 1 | 44.73721 | 22.3686 | 22.3686 | -3.556151 |

Sample size calculation for a group sequential trials is performed in **two steps**:

**Define the (abstract) group sequential design**using the function`getDesignGroupSequential()`

. For details regarding this step, see the R markdown file Defining group sequential boundaries with rpact.**Calculate sample size**for the continuous endpoint by feeding the abstract design into the function`getSampleSizeMeans()`

.

In general, rpact supports both one-sided and two-sided group sequential designs. However, if futility boundaries are specified, only one-sided tests are permitted. **For simplicity, it is often preferred to use one-sided tests for group sequential designs** (typically, with ).

R code for a simple example is provided below:

```
# Example: Group-sequential design with O'Brien & Fleming type alpha-spending
# and one interim at 60% information
design <- getDesignGroupSequential(
sided = 1, alpha = 0.025, beta = 0.2,
informationRates = c(0.6, 1), typeOfDesign = "asOF"
)
# Trial assumes an effect size of 10 as above, a stDev = 24, and an allocation
# ratio of 2
sampleSizeResultGS <- getSampleSizeMeans(
design,
alternative = 10, stDev = 24, allocationRatioPlanned = 2
)
# Standard rpact output (sample size object only, not design object)
kable(sampleSizeResultGS)
```

stages | alternative | meanRatio | thetaH0 | normalApproximation | stDev | groups | allocationRatioPlanned | maxNumberOfSubjects | maxNumberOfSubjects1 | maxNumberOfSubjects2 | numberOfSubjects | numberOfSubjects1 | numberOfSubjects2 | rejectPerStage | earlyStop | expectedNumberOfSubjectsH0 | expectedNumberOfSubjectsH01 | expectedNumberOfSubjectsH1 | criticalValuesEffectScale |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | FALSE | 0 | FALSE | 24 | 2 | 2 | 207.1351 | 138.09 | 69.04502 | 124.2810 | 82.85402 | 41.42701 | 0.3123193 | 0.3123193 | 206.8195 | 202.3981 | 181.2581 | 12.392731 |

2 | 10 | FALSE | 0 | FALSE | 24 | 2 | 2 | 207.1351 | 138.09 | 69.04502 | 207.1351 | 138.09003 | 69.04502 | 0.4876807 | 0.3123193 | 206.8195 | 202.3981 | 181.2581 | 7.049874 |

```
# Summary rpact output for sample size object
kable(summary(sampleSizeResultGS))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 10 | FALSE | 0 | FALSE | 24 | 2 | 2 | 207.1351 | 138.09 | 69.04502 | 124.2810 | 82.85402 | 41.42701 | 0.3123193 | 0.3123193 | 206.8195 | 202.3981 | 181.2581 | 12.392731 |

2 | 10 | FALSE | 0 | FALSE | 24 | 2 | 2 | 207.1351 | 138.09 | 69.04502 | 207.1351 | 138.09003 | 69.04502 | 0.4876807 | 0.3123193 | 206.8195 | 202.3981 | 181.2581 | 7.049874 |

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document describes how sample size and power calculations for count data can be performed using rpact. This is shown for the fixed sample and the group sequential case thereby illustrating different ways of entering recruitment and observation schemes. It also describes how blinded sample size recalculation procedures can be performed.

Examples for count data described in the literature are

- exacerbations in asthma and chronic obstructive pulmonary disease (COPD)
- counts of brain lesions by MRI in Multiple Sclerosis (MS)
- relapses in pediatric MS
- hospitalizations in heart failure trials
- number of occurrences of adverse events

Typically, the count outcome is assumed to be distributed according to a negative binomial distribution and the hypothesis to be tested is

where are the mean rates (in one time unit) of a negative binomial distributed random variable with overdispersion (shape) parameter ,

where refers to the exposure time of subject in treatment group at interim stage of the group sequential test procedure (cf., Mütze et al., 2019). The expectation and variance of are given by

respectively, i.e., the case refers to the case where is Poisson distributed. For the fixed sample case, the index for the interim stage is omitted. In superiority trials, , whereas, for non-inferiority trials, a positive non-inferiority margin is specified.

In many cases, each subject is observed a given length of time, e.g., one year. In this case, , and, as will be shown below, the sample size formulas described in the literature are applicable. If subjects entering the study have different exposure times, typically an accrual time is followed by an additional follow-up time. If subjects entering the study in an accrual period and the study time is , at time point , the time under exposure for subject in treatment at stage of the trial is with and denoting the recruitment time for subject in treatment . This more general approach is specifically necessary if the observation times at interim stages need to be estimated. This will also be illustrated by examples later on.

For group sequential designs, the test statistic is based on the Wald statistic which is the difference of the rates on the log-scale divided by its standard error. As shown in Mütze et al. (2019), if Maximum Likelihood estimates are used to estimate the true parameters, the sequence of Wald statistics asymptotically has the independent and normally distributed increments property. For designs with interim stages, it is essential that interim analyses take place after specified amounts of information. The information level of the Wald statistic (the Fisher information) at stage is given by

which simplifies to

if , i.e., if all subjects have complete observations. From these terms, essentially, the sample size and other calculations for a count data type design are derived.

The sample size calculation to meet power for two-sample comparisons is performed for

- for group sequential designs, the type of design (e.g., -spending),
- an assumed and ,
- assumed exposure times for treatment , and subjects , at interim stage ,
- a planned allocation ratio ,
- and an assumed overdispersion .

`getSampleSizeCounts()`

performs sample size and power calculations for count data designs. You can specify

- a group sequential or a fixed sample size setting
- either and , or and , or the pooled and , the latter being essential for blinded sample size reassessment (SSR) procedures (see below). and can be vectors
- different ways of calculation: fixed exposure time, accrual and study time, or accrual and fixed number of subjects
- staggered subjects entry

The usage of the function (listing the parameters that can be specified) is as follows:

```
getSampleSizeCounts(
design = NULL,
...,
lambda1 = NA_real_,
lambda2 = NA_real_,
lambda = NA_real_,
theta = NA_real_,
thetaH0 = 1,
overdispersion = 0,
fixedExposureTime = NA_real_,
accrualTime = NA_real_,
accrualIntensity = NA_real_,
followUpTime = NA_real_,
maxNumberOfSubjects = NA_real_,
allocationRatioPlanned = NA_real_
)
```

which will now be illustrated by examples.

`getPowerCounts()`

conversely calculates the power at given sample sizes, and essentially the same parameters can be specified.

Consider the clinical trial in COPD subjects from Zhu and Lakkis (2014). Assume that a new therapy is assumed to decrease the exacerbation rate from 0.80 to 0.68 (15% decrease relative to control) within an observation period of 0.75 years, i.e., each subject has a equal follow-up of 0.75 years. Subjects are randomly allocated to treatment and control with equal allocation 1:1.

The sample size that yields 90% power for detecting such a difference, if the overdispersion is assumed to be equal to 0.4, is obtained as follows.

First, load the `rpact`

package

```
library(rpact)
packageVersion("rpact") # version should be version 3.5.0 or higher
```

`[1] '4.0.0'`

The `example1$nFixed1`

element is the number of subjects in the treatment group, `example1$nFixed2`

refers to the number of subjects in the control group:

```
example1 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
fixedExposureTime = 0.75
)
c(example1$nFixed1, example1$nFixed2)
```

`[1] 1316 1316`

and we conclude that N = 2632 subjects in total are needed to provide 80% power.

Conversely, `getPowerCounts()`

performs the power calculation at given sample size, and note `directionUpper = FALSE`

to specify that the power is directed for :

```
example2 <- getPowerCounts(
alpha = 0.025,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
fixedExposureTime = 0.75,
directionUpper = FALSE,
maxNumberOfSubjects = example1$nFixed
)
example2$overallReject
```

`[1] 0.8000924`

The following graph illustrates the sample sizes for stronger effects . Note that for this plot only the lower and upper bound of need to be specified:

```
getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = c(0.75, 0.85),
overdispersion = 0.4,
fixedExposureTime = 0.75
) |>
plot()
```

In the fixed sample case this is the only available plot type (`type = 5`

).

For `getPowerCounts()`

the only available plot type in the fixed sample case is `type = 7`

, the following graph also illustrates how elements can be added to the `ggplot2`

object:

```
getPowerCounts(
alpha = 0.025,
lambda2 = 0.8,
theta = c(0.8, 1),
overdispersion = 0.4,
fixedExposureTime = 0.75,
directionUpper = FALSE,
maxNumberOfSubjects = example1$nFixed
) |>
plot() +
ylab("Power") +
ggtitle("Power for count data design for varying effect") +
geom_hline(linewidth = 0.5, yintercept = 0.025, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 0.8, linetype = "dotted")
```

The influence of the overdispersion parameter on the total sample size is illustrated in the following graph for increasing effect :

```
results <- c()
for (theta in seq(0.75, 0.85, 0.05)) {
for (phi in seq(0, 1, 0.1)) {
results <- rbind(
results,
getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = theta,
overdispersion = phi,
fixedExposureTime = 0.75
) |>
as.data.frame()
)
}
}
ggplot(
data = results,
aes(x = overdispersion, y = nFixed, group = theta, color = as.factor(theta))
) +
xlab("Overdispersion") +
ylab("Total sample size") +
geom_line(linewidth = 1.1) +
geom_hline(linewidth = 0.5, yintercept = 1000, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 2000, linetype = "dotted") +
geom_hline(linewidth = 0.5, yintercept = 3000, linetype = "dotted") +
labs(color = "Theta") +
theme_classic()
```

Zhu and Lakkis (2014) proposed three methods for calculating the sample size and the methodology implemented in `rpact`

refers to the M2 method described in their paper. The M2 method corresponds to the sample size formulas given in, e.g., Friede and Schmidli (2010a, 2010b) and Mütze et al (2019). It is in fact easy to recalculate the sample sizes in Table 1 of their paper:

```
results <- c()
for (phi in c(0.4, 0.7, 1, 1.5)) {
for (theta in c(0.85, 1.15)) {
for (lambda2 in seq(0.8, 1.4, 0.2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = lambda2,
theta = theta,
overdispersion = phi,
fixedExposureTime = 0.75
)$nFixed1)
}
}
}
cat(paste0(results, collapse = ", "))
```

1316, 1101, 957, 854, 1574, 1324, 1157, 1037, 1494, 1279, 1135, 1033, 1815, 1565, 1398, 1278, 1673, 1457, 1313, 1211, 2056, 1806, 1639, 1520, 1970, 1754, 1611, 1508, 2458, 2208, 2041, 1921

Similarly, Table 2 results (column M2) with unequal allocation between the treatment arms can be reconstructed by

```
results <- c()
for (phi in c(1, 5)) {
for (theta in c(0.5, 1.5)) {
for (lambda2 in c(2, 5, 10)) {
for (r in c(2 / 3, 1, 3 / 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = lambda2,
theta = theta,
overdispersion = phi,
allocationRatioPlanned = r,
fixedExposureTime = 1
)$nFixed)
}
}
}
}
cat(paste0(results, collapse = ", "))
```

124, 116, 117, 90, 86, 88, 80, 76, 79, 280, 272, 287, 232, 224, 235, 215, 208, 217, 395, 376, 389, 363, 348, 360, 352, 338, 350, 1075, 1036, 1082, 1027, 988, 1030, 1012, 972, 1013

Slight deviations resulting from rounding errors.

With the `getSampleSizeCounts()`

function it is easy to determine the allocation ratio that provides the smallest overall sample size at given power . This can be done by setting `allocationRatioPlanned = 0`

. In the example from above,

```
example3 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda2 = 0.8,
theta = 0.85,
overdispersion = 0.4,
allocationRatioPlanned = 0,
fixedExposureTime = 0.75
)
```

`example3$allocationRatioPlanned`

`[1] 1.068791`

`example3$nFixed`

`[1] 2629`

calculates the optimum allocation ratio to be equal to 1.069 thereby reducing the necessary sample size only very slightly from 2632 to 2629. With this result it might not be reasonable at all to deviate from a planned 1:1 allocation.

Friede and Schmidli (2010a, 2010b) consider blinded SSR procedures with count data. They show that blinded SSR to reestimate the overdispersion parameter maintains the required power without increasing the Type I error rate. The procedure is simply to calculate the overdispersion at interim in a blinded manner and to recalculate the sample size with a pooled event rate estimate and *under the assumption of the originally assumed effect*.

For example, if in the situation from above the overdispersion was estimated from the pooled sample to be, say, 0.352, and the overall event rate is estimated as , the recalculated sample size is

```
example4 <- getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = 0.921,
theta = 0.85,
overdispersion = 0.352,
fixedExposureTime = 0.75
)
example4$nFixed
```

`[1] 2152`

thus reducing the necessary sample size from 2632 to 2152. Note that, of course, it is important when the interim review is performed. If it is done early, the nuisance parameters and cannot be estimated precisely enough, if it is done very late the recalculated sample size might be smaller and therefore the observation is larger than needed. This of course also has an impact on the test characteristics and might be investigated by simulations (Friede and Schmidli, 2010a). Methods for blinded estimation are compared by Schneider et al. (2013).

For checking the results of `rpact`

, the sample sizes in Table 1 from Friede and Schmidli (2010b) can be reconstructed by

```
results <- c()
for (theta in c(0.7, 0.8)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = lambda,
theta = theta,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

177, 135, 114, 190, 147, 126, 202, 159, 138, 446, 339, 286, 477, 371, 318, 509, 402, 349

and

```
results <- c()
for (theta in c(0.7, 0.8)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.1,
lambda = lambda,
theta = theta,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

237, 180, 152, 254, 197, 168, 270, 213, 185, 597, 454, 383, 639, 496, 425, 681, 539, 467

Slight deviations resulting from rounding errors.

For the non-inferiority case a non-inferiority margin needs to be specified and entered as `thetaH0`

. Typically, no difference in the event rates is assumed between the treatment groups (i.e., ). In that case, the control arm sample sizes from Table 2 and Table 3 from Friede and Schmidli (2010b) are obtained with

```
results <- c()
for (delta0 in c(1.15, 1.2)) {
for (phi in c(0.4, 0.5, 0.6)) {
for (lambda in c(1, 1.5, 2)) {
results <- c(results, getSampleSizeCounts(
alpha = 0.025,
beta = 0.2,
lambda = lambda,
theta = 1,
thetaH0 = delta0,
overdispersion = phi,
fixedExposureTime = 1
)$nFixed2)
}
}
}
cat(paste0(results, collapse = ", "))
```

1126, 858, 724, 1206, 938, 804, 1286, 1018, 885, 662, 504, 426, 709, 551, 473, 756, 599, 520

We will now consider count data designs with interim stages. First, you need to specify the design which is here defined as an O’Brien and Fleming alpha spending function design with interim analyses planned after 40% and 70% of the information:

```
design <- getDesignGroupSequential(
informationRates = c(0.4, 0.7, 1),
typeOfDesign = "asOF"
)
```

Suppose study subjects are observed with fixed exposure time of 12 months and have event rates 0.2 and 0.3 in the treatment and the control arm, respectively, with an overdispersion parameter equal to 1. Specify these parameters as follows to obtain the summary:

```
getSampleSizeCounts(
design = design,
lambda1 = 0.2,
lambda2 = 0.3,
fixedExposureTime = 12,
overdispersion = 1.5
) |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Planned information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Cumulative power | 0.0580 | 0.4682 | 0.8000 |

Maximum number of subjects | 360.0 | ||

Information over stages | 19.4 | 33.9 | 48.5 |

Expected information under H0 | 48.4 | ||

Expected information under H0/H1 | 46.9 | ||

Expected information under H1 | 40.8 | ||

Maximum information | 48.5 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0580 | 0.4102 |

This summary displays the maximum amount of information (`round(example5$maxInformation, 2) =`

48.47) that needs to be achieved with N = 360 subjects together with stopping probabilities under , midway between and , under , and stopping probabilities under and if interim analyses are performed at 19.39 and 33.93, and the final analysis at 48.47.

If non-binding futility stops are planned, these might be derived from an O’Brien and Fleming beta spending function with , i.e., the following design as displayed in the graph below:

```
designFutility <- getDesignGroupSequential(
informationRates = c(0.4, 0.7, 1),
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
bindingFutility = FALSE
)
designFutility |>
plot()
```

This yields the following test characteristics with additional futility stop probabilities, resulting in slightly higher number of subjects and information levels necessary to achieve power 80%:

```
getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
fixedExposureTime = 12,
overdispersion = 1.5
) |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Planned information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Cumulative power | 0.0688 | 0.5133 | 0.8000 |

Maximum number of subjects | 394.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

Similar to survival designs (see, e.g., Planning a Survival Trial with rpact) it is possible with the `getSampleSizeCounts()`

function to calculate calendar times where the information is estimated to be observed under the given parameters.

For the first case, suppose there is uniform recruitment of subjects over 6 months, and subjects *are followed for a prespecified time period which is identical for all subjects* as above. Specify `accrualTime = 6`

as an additional function parameter and obtain the following summary:

```
example7 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
fixedExposureTime = 12,
accrualTime = 6
)
example7 |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, fixed exposure time = 12, accrual time = 6, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Planned information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Cumulative power | 0.0688 | 0.5133 | 0.8000 |

Calendar time | 4.70 | 7.11 | 18.00 |

Expected study duration under H1 | 10.77 | ||

Number of subjects | 308.0 | 394.0 | 394.0 |

Expected number of subjects under H1 | 384.4 | ||

Maximum number of subjects | 394.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

You might also use the gscounts package in order to obtain very similar results. The relevant functionality for count data, however, is included in `rpact`

and the maintainer of `gscounts`

encourages the use of `rpact`

.

A different situation is given if subjects *have varying exposure time*. For this setting, assume we again have uniform recruitment of subjects over 6 months, but the study ends 12 months after the last subject entered the study. That is, the study is planned to be conducted in 18 months, having subjects that are observed (i.e., under exposure) between 12 and 18 months.

In order to perform the sample size calculation for this case, instead of `fixedExposureTime`

the parameter `followUpTime`

has to specified. It is the assumed (additional) follow-up time for the study and so the total study duration is `accrualTime + followUpTime`

.

```
example8 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = 6,
followUpTime = 12
)
example8 |>
summary()
```

*Sample size calculation for a count data endpoint*

Sequential analysis with a maximum of 3 looks (group sequential design), overall significance level 2.5% (one-sided). The results were calculated for a two-sample test for count data, H0: lambda(1) / lambda(2) = 1, H1: effect = 0.667, lambda(1) = 0.2, lambda(2) = 0.3, overdispersion = 1.5, accrual time = 6, follow-up time = 12, power 80%.

Stage | 1 | 2 | 3 |
---|---|---|---|

Planned information rate | 40% | 70% | 100% |

Efficacy boundary (z-value scale) | 3.357 | 2.445 | 2.001 |

Futility boundary (z-value scale) | 0.152 | 1.267 | |

Cumulative power | 0.0688 | 0.5133 | 0.8000 |

Calendar time | 4.81 | 7.42 | 18.00 |

Expected study duration under H1 | 10.95 | ||

Number of subjects | 304.0 | 380.0 | 380.0 |

Expected number of subjects under H1 | 371.5 | ||

Maximum number of subjects | 380.0 | ||

Information over stages | 21.3 | 37.3 | 53.3 |

Expected information under H0 | 29.8 | ||

Expected information under H0/H1 | 39.4 | ||

Expected information under H1 | 41.3 | ||

Maximum information | 53.3 | ||

Cumulative alpha spent | 0.0004 | 0.0074 | 0.0250 |

Cumulative beta spent | 0.0427 | 0.1256 | 0.2000 |

One-sided local significance level | 0.0004 | 0.0073 | 0.0227 |

Overall exit probability (under H0) | 0.5608 | 0.3491 | |

Overall exit probability (under H1) | 0.1115 | 0.5274 | |

Exit probability for efficacy (under H0) | 0.0004 | 0.0070 | |

Exit probability for efficacy (under H1) | 0.0688 | 0.4446 | |

Exit probability for futility (under H0) | 0.5604 | 0.3421 | |

Exit probability for futility (under H1) | 0.0427 | 0.0829 |

As expected, the maximum number of subjects is a bit lower (380 vs. 394) having also different calendar time estimates.

In `getSampleSizeCounts()`

, you can specify `maxNumberOfSubjects`

or `accrualTime`

and `acrualIntensity`

and find the study time, i.e., the necessary follow up time in order to achieve the required information levels. For example, one can calculate how long the study duration should be if subject recruitment is performed over 7.5 months instead of 6 months, i.e., if 475 instead of 380 subjects will be recruited?

```
example9 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = 7.5,
maxNumberOfSubjects = 7.5 / 6 * example8$maxNumberOfSubjects
)
example9$calendarTime
```

```
[,1]
[1,] 4.799704
[2,] 7.031870
[3,] 9.979368
```

You might also specify the parameter `accrualIntensity`

that describes the *number of subjects per time unit* in order to obtain the same result:

```
example10 <- getSampleSizeCounts(
design = designFutility,
lambda1 = 0.2,
lambda2 = 0.3,
overdispersion = 1.5,
accrualTime = c(0, 7.5),
accrualIntensity = c(1 / 6 * example8$maxNumberOfSubjects)
)
example10$calendarTime
```

```
[,1]
[1,] 4.799704
[2,] 7.031870
[3,] 9.979368
```

Since `accrualTime`

and `accrualIntensity`

can be defined as vectors, it is also possible to define a non-uniform recruitment scheme and investigate the influence on the estimated parameters.

As an important note, the Fisher information used for the calendar time calculation is bounded for and varying time point. Therefore it might happen that the numerical search algorithm fails, there is no derivable observation time, and an error message is displayed. For , this problem does not occur.

Friede, T., Schmidli, H. (2010a). Blinded sample size reestimation with count data: methods and applications in multiple sclerosis. *Statistics in Medicine*, 29, 1145-1156. https://doi.org/10.1002/sim.3861

Friede, T., Schmidli, H. (2010b). Blinded sample size reestimation with negative binomial counts in superiority and non-inferiority trials. *Methods of Information in Medicine*, 49, 618-624. https://doi.org/10.3414/ME09-02-0060

Mütze, T., Glimm, E., Schmidli, H., Friede, T. (2019). Group sequential designs for negative binomial outcomes. *Statistical Methods in Medical Research*, 28(8), 2326-2347. https://doi.org/10.1177/0962280218773115

Schneider, S., Schmidli, H., Friede, T. (2013). Robustness of methods for blinded sample size re‐estimation with overdispersed count data. Statistics in Medicine, 32(21), 3623-3653. https://doi.org/10.1002/sim.5800

Wassmer, G and Brannath, W. *Group Sequential and Confirmatory Adaptive Designs in Clinical Trials* (2016), ISBN 978-3319325606 https://doi.org/10.1007/978-3-319-32562-0

Zhu, H., Lakkis, H. (2014). Sample size calculation for comparing two negative binomial rates. *Statistics in Medicine*, 33, 376-387. https://doi.org/10.1002/sim.5947

*System* rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), *platform* x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

In rpact version 3.3, the group sequential methodology from Hampson and Jennison (2013) is implemented. As traditional group sequential designs are characterized specifically by the underlying boundary sets, one main task was to write a function returning the decision critical values according to the calculation rules in Hampson and Jennison (2013). The function returning the respective critical values has been validated, particularly via simulation towards Type I error rate control in various settings (for an example see below). Subsequently, functions characterizing a delayed response group sequential test in terms of power, maximum sample size and expected sample size have been written. These functions were integrated in the rpact functions `getDesignGroupSequential()`

, `getDesignCharacteristics()`

, and the corresponding `getSampleSize...()`

and `getPower...()`

functions.

The classical group sequential methodology works on the assumptions of having no treatment response delay, i.e., it is assumed that enrolled subjects are observed upon recruitment or at least shortly after. In many practical situation, this assumption does not hold true. Instead, it might be that there is a latency between the timing of recruitment and the actual measurement of a primary endpoint. That is, at interim there is some information in pipeline.

One method to handle this pipeline information was proposed by Hampson & Jennison (2013) and is called *delayed response group sequential design*. Assume that, in a -stage trial, given we will proceed to the trial end, we will observe an information sequence , and the corresponding -statistics . As we now have information in pipeline, define as the information available after awaiting the delay after having observed . Let be the vector of -statistics that are calculated based upon the information levels . Given boundary sets , and , a -stage delayed response group sequential design has the following structure:

That is, at each of the interim analyses, there is information outstanding. If does not fall within , the conclusion is to stop the trial neither for efficacy nor for futility (as it would be in a traditional group sequential design), but to *irreversibly* stop the recruitment. Afterwards, the outstanding data is awaited such that after awaiting the delayed information fraction of time, information is available and a new -statistic can be calculated. This statistic is then used to test the actual hypothesis of interest using a *decision critical value* . The heuristic idea is that the recruitment is stopped only if one is “confident about the subsequent decision”. This means that if the recruitment has been stopped for , it should be likely to obtain a subsequent rejection. In contrast, if the recruitment is stopped for , obtaining a subsequent rejection should be rather unlikely (whereas still possible).

The main difference to a group sequential design is that due to the delayed information, each interim potentially consists of two analyses: a recruitment stop analysis and if indicated a subsequent decision analysis. Hampson & Jennison (2013) propose to define the boundary sets and as error-spending boundaries determined using -spending and -spending functions in the one-sided testing setting with *binding* futility boundaries. In rpact, this methodology is extended to (one-sided) testing situations where (binding or non-binding) futility boundaries are available. According to Hampson & Jennison (2013), the boundaries with are chosen such that “reversal probabilities” are balanced. More precisely, is chosen as the (unique) solution of:

and for , is the (unique) solution of:

It can easily be shown that this constraint yields critical values that ensure Type I error rate control. We call this approach the *reverse probabilities approach*.

The values , , are determined via a root-search. Having determined all applicable boundaries, the rejection probability of the procedure given a treatment effect is:

i.e., the probability to firstly stop the recruitment, followed by a rejection at any of the stages. Setting , this expression gives the Type I error rate. These values are calculated also for and specified maximum sample size (in the prototype case of testing against with ).

As for the group sequential designs, the *inflation factor*, , of a delayed response design is the maximum sample size, , to achieve power for testing against (in the prototype case) relative to the fixed sample size, :

Let denote the number of subjects observed at the -th recruitment stop analysis and the number of subjects recruited at the subsequent -th decision analysis. Given , it holds that

The expected sample size, , of a delayed response design is:

with . As for the maximum sample size, this is provided relative to the sample size in a fixed sample design, i.e., as the expected reduction in sample size.

We illustrate the calculation of decision critical values and design characteristics using the described approach for a three-stage group sequential design with Kim & DeMets - and -spending functions with .

**First, load the rpact package **

```
library(rpact)
packageVersion("rpact") # version should be version 3.3 or later
```

`[1] '4.0.0'`

The delayed responses utility is simply to add the parameter `delayedInformation`

to the `getDesignGroupSequential()`

(or `getDesignInverseNormal()`

) function. `delayedInformation`

is either a positive constant or a vector of length with positive elements describing the amount of pipeline information at interim :

```
gsdWithDelay <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asKD",
gammaA = 2,
gammaB = 2,
typeBetaSpending = "bsKD",
informationRates = c(0.3, 0.7, 1),
delayedInformation = c(0.16, 0.2),
bindingFutility = TRUE
)
```

```
Warning: The delayed information design feature is experimental and hence not
fully validated (see www.rpact.com/experimental)
```

The output contains the continuation region for each interim analysis defined through the upper and lower boundary of the continuation region. Additionally, the interim analyses are additionally characterized through decision critical values (1.387, 1.82, 2.03):

`kable(gsdWithDelay)`

typeOfDesign | kMax | stages | informationRates | alpha | beta | power | twoSidedPower | futilityBounds | bindingFutility | gammaA | gammaB | sided | delayedInformation | tolerance | alphaSpent | betaSpent | typeBetaSpending | criticalValues | stageLevels | decisionCriticalValues | reversalProbabilities | delayedInformationRates |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asKD | 3 | 1 | 0.3 | 0.025 | 0.2 | 0.1052856 | FALSE | -0.5081199 | TRUE | 2 | 2 | 1 | 0.16 | 0 | 0.00225 | 0.018 | bsKD | 2.840804 | 0.0022500 | 1.386587 | 0.0000733 | 0.46 |

asKD | 3 | 2 | 0.7 | 0.025 | 0.2 | 0.5578892 | FALSE | 1.0957436 | TRUE | 2 | 2 | 1 | 0.20 | 0 | 0.01225 | 0.098 | bsKD | 2.294934 | 0.0108684 | 1.820131 | 0.0017979 | 0.90 |

asKD | 3 | 3 | 1.0 | 0.025 | 0.2 | 0.8000000 | FALSE | NA | TRUE | 2 | 2 | 1 | NA | 0 | 0.02500 | 0.200 | bsKD | 2.030383 | 0.0211588 | 2.030383 | NA | NA |

Note that the last decision critical values (2.03) is equal to the last critical value of the corresponding group sequential design without delayed response. To obtain the design characteristics, the function `getDesignCharacteristics()`

calculates the maximum sample size for the design (`shift`

), the inflation factor and the average sample sizes under the null hypothesis, the alternative hypothesis, and under a value in between and :

`kable(getDesignCharacteristics(gsdWithDelay))`

inflationFactor | stages | information | power | rejectionProbabilities | futilityProbabilities | averageSampleNumber1 | averageSampleNumber01 | averageSampleNumber0 |
---|---|---|---|---|---|---|---|---|

1.051379 | 1 | 2.475644 | 0.1026316 | 0.1026316 | 0.0186924 | 0.9268982 | 0.93292 | 0.8165229 |

1.051379 | 2 | 5.776502 | 0.5563326 | 0.4537010 | 0.0833539 | 0.9268982 | 0.93292 | 0.8165229 |

1.051379 | 3 | 8.252146 | 0.8000000 | 0.2436674 | NA | 0.9268982 | 0.93292 | 0.8165229 |

Using the `summary()`

function, these number can be directly displayed without using the `getDesignCharacteristics()`

function as follows:

`kable(summary(gsdWithDelay))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asKD | 3 | 1 | 0.3 | 0.025 | 0.2 | 0.1052856 | FALSE | -0.5081199 | TRUE | 2 | 2 | 1 | 0.16 | 0 | 0.00225 | 0.018 | bsKD | 2.840804 | 0.0022500 | 1.386587 | 0.0000733 | 0.46 |

asKD | 3 | 2 | 0.7 | 0.025 | 0.2 | 0.5578892 | FALSE | 1.0957436 | TRUE | 2 | 2 | 1 | 0.20 | 0 | 0.01225 | 0.098 | bsKD | 2.294934 | 0.0108684 | 1.820131 | 0.0017979 | 0.90 |

asKD | 3 | 3 | 1.0 | 0.025 | 0.2 | 0.8000000 | FALSE | NA | TRUE | 2 | 2 | 1 | NA | 0 | 0.02500 | 0.200 | bsKD | 2.030383 | 0.0211588 | 2.030383 | NA | NA |

summary(gsdWithDelay)

It might be of interest to check whether this in fact yields Type I error rate control. This can be done with the internal function `getSimulatedRejectionsDelayedResponse()`

as follows:

`rpact:::getSimulatedRejectionsDelayedResponse(gsdWithDelay, iterations = 10^6)`

$simulatedAlpha [1] 0.025061

$delta [1] 0

$iterations [1] 1000000

$seed [1] 1775378758

$confidenceIntervall [1] 0.02475463 0.02536737

$alphaWithin95ConfidenceIntervall [1] TRUE

$time Time difference of 7.317216 secs

It also checks whether the simulated Type T error rate is within the 95% error boundaries.

Compared to the design with no delayed information, it turns out that the inflation factor is quite the same though the average sample sizes are different. This is due to the fact that, compared to the design with no delayed response, the actual number of patients to be used for the analysis is larger:

```
gsdWithoutDelay <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asKD",
gammaA = 2,
gammaB = 2,
typeBetaSpending = "bsKD",
informationRates = c(0.3, 0.7, 1),
bindingFutility = TRUE
)
kable(summary(gsdWithoutDelay))
```

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asKD | 3 | 1 | 0.3 | 0.025 | 0.2 | 0.1052856 | FALSE | -0.5081199 | TRUE | 2 | 2 | 1 | 0 | 0.00225 | 0.018 | bsKD | 2.840804 | 0.0022500 |

asKD | 3 | 2 | 0.7 | 0.025 | 0.2 | 0.5578892 | FALSE | 1.0957436 | TRUE | 2 | 2 | 1 | 0 | 0.01225 | 0.098 | bsKD | 2.294934 | 0.0108684 |

asKD | 3 | 3 | 1.0 | 0.025 | 0.2 | 0.8000000 | FALSE | NA | TRUE | 2 | 2 | 1 | 0 | 0.02500 | 0.200 | bsKD | 2.030383 | 0.0211588 |

It might be also of interest to evaluate the expected sample size under a range of parameter values, e.g., to obtain an optimum design under some criterion that is based on all parameter value within some specified range. Keeping in mind that the `prototype case`

is for testing against with (known) , this is obtained with the following commands:

```
nMax <- getDesignCharacteristics(gsdWithDelay)$shift # use calculated sample size for the protoype case
deltaRange <- seq(-0.2, 1.5, 0.05)
ASN <- getPowerMeans(gsdWithDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$expectedNumberOfSubjects
dat <- data.frame(delta = deltaRange, ASN = ASN, delay = "delay")
ASN <- getPowerMeans(gsdWithoutDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$expectedNumberOfSubjects
dat <- rbind(dat, data.frame(delta = deltaRange, ASN = ASN, delay = "no delay"))
library(ggplot2)
myTheme <- theme(
axis.title.x = element_text(size = 14),
axis.text.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 14)
)
ggplot(data = dat, aes(x = delta, y = ASN, group = delay, linetype = factor(delay))) +
geom_line(size = 0.8) +
ylim(0, ceiling(nMax)) +
myTheme +
theme_classic() +
xlab("alternative") +
ylab("Expected number of subjects") +
geom_hline(size = 1, yintercept = nMax, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0.5, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 1, linetype = "dotted") +
labs(linetype = "") +
annotate(geom = "text", x = 0, y = nMax - 0.3, label = "fixed sample size", size = 4)
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

Note that, in contrast, the rejection probabilities are quite the same for the different designs:

```
reject <- c(
getPowerMeans(gsdWithDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$overallReject,
getPowerMeans(gsdWithoutDelay,
groups = 1, normalApproximation = TRUE, alternative = deltaRange,
maxNumberOfSubjects = nMax
)$overallReject
)
dat$reject <- reject
ggplot(data = dat, aes(x = delta, y = reject, group = delay, linetype = factor(delay))) +
geom_line(size = 0.8) +
ylim(0, 1) +
myTheme +
theme_classic() +
xlab("alternative") +
ylab("rejection probability") +
geom_hline(size = 1, yintercept = 1 - gsdWithDelay$beta, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 0.5, linetype = "dotted") +
geom_vline(size = 0.6, xintercept = 1, linetype = "dotted") +
labs(linetype = "")
```

Since we used the `nMax`

from the design with delayed responses, the power is 80% for this design (for the design without delayed response it is slightly below).

We illustrate the calculation of power and average sample size with an example provided by Schüürhuis (2022), p.68: Suppose it is planned to conduct a parallel group trial with subjects per arm to be linearly recruited within 24 months in presence of a delay of . The significance level is , the nominal type II error is at a treatment effect of . The boundaries are chosen to be calculated using O’Brien-Fleming-like - and -spending functions and the interim is planned after information has been collected, i.e., after months into the trial. Therefore, it is important to note that is the time point of analysis for the first interim since at this time point the full information of 30% of the subjects is available.

The numbers rpovided in Table 5.2 of Schüürhuis (2022) for the *Hampson and Jennison* approach can be obtained with the following commands:

```
gsdTwoStagesWithDelay <- getDesignGroupSequential(
kMax = 2,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
informationRates = c(0.3, 1),
delayedInformation = 5 / 24,
bindingFutility = TRUE
)
```

```
Warning: The delayed information design feature is experimental and hence not
fully validated (see www.rpact.com/experimental)
```

```
results <- getPowerMeans(
design = gsdTwoStagesWithDelay,
groups = 2,
normalApproximation = TRUE,
alternative = 0.3,
stDev = 1,
maxNumberOfSubjects = 350
)
# expected number of subjects table 5.2
round(results$expectedNumberOfSubjects / 2, 3)
```

`[1] 172.6`

```
# expected trial duration table 5.2
round(results$earlyStop * 17.2 + (1 - results$earlyStop) * 29, 3)
```

`[1] 28.671`

```
# power table 5.2
round(results$overallReject, 3)
```

`[1] 0.798`

It has to be noted that the time points of analysis are derived under the assumption of a *linear recruitment* of patients. Otherwise, these calculations need to be adapted for a non-linear case (which is easy to be done).

The *Recruitment Pause* approach can be obtained with the following commands (without using the `delayedInformation`

parameter). As above, the interim analysis information is fully observed after months, whereas the final information is here after months:

```
gsdTwoStagesWithoutDelay <- getDesignGroupSequential(
kMax = 2,
alpha = 0.025,
beta = 0.2,
typeOfDesign = "asOF",
typeBetaSpending = "bsOF",
informationRates = c(0.3 + 5 / 24, 1),
bindingFutility = FALSE
)
results <- getPowerMeans(
design = gsdTwoStagesWithoutDelay,
groups = 2,
normalApproximation = TRUE,
alternative = 0.3,
stDev = 1,
maxNumberOfSubjects = 350
)
# expected number of subjects table 5.2
round(results$expectedNumberOfSubjects / 2, 3)
```

`[1] 153.053`

```
# expected trial duration table 5.2
round(results$earlyStop * 17.2 + (1 - results$earlyStop) * 34, 3)
```

`[1] 29.715`

```
# power table 5.2
round(results$overallReject, 3)
```

`[1] 0.779`

The decision boundaries of the delayed response group sequential design can be illustrated as `type = 1`

plot for the design. This adds the decision boundaries as crosses together with the continuation region. Note that other plot types are directly accounting for the delayed response situation as the required numbers are calculated for this case.

`plot(gsdWithDelay)`

The approach described so far uses the identity of reversal probabilities to derive the decision boundaries. An alternative approach is to demand *two* conditions for rejecting the null at a given stage of the trial and spend a specified amount of significance at this stage. This definition is independent of the specification of futility boundaries but can be used with these as well. We show how the (new) rpact function `getGroupSequentialProbabilities()`

can handle this situation.

At stage , in order to reject , the test statistic needs to exceed the upper continuation bound *and* needs to exceed the critical value . Hence, the set of upper continuation boundaries and critical values is defined through the conditions

Since this cannot be solved without additional constraints, the critical values are fixed as . This makes sense since it often turns out that the optimum boundaries using the *reversal probabilities approach* are smaller than and the unadjusted boundary is a reasonable choice for a minimum requirement for rejecting .

Starting by , the values , with fixed , are successively determined via a root-search.

The conditions for the Type II error rate are the following:

The algorithm to derive the acceptance boundaries can be briefly described as follows: At given rejection boundaries (calculated from above) the algorithm successively calculates the acceptance boundaries using the specifed -spending function and an arbitrary sample size or “shift” value. If the last stage acceptance critical value is smaller than the last stage critical value, the shift value is increased, otherwise it is decreased. This is repeated until the last stage critical values coincide. The resulting shift value can be interpreted as the maximum necessary sample size to achieve power . Using the algorithm, we have to additionally specify upper and lower boundaries for the `shift`

(which can be interpreted as the maximum sample size for the group sequential design in the prototype case, cf., Wassmer & Brannath, 2016). This is set to be within 0 and 100 which covers practically relevant situations.

We use the function `getGroupSequentialProbabilities()`

to calculate the critical values by a numerical root search. For this, we define a -spending function `spend()`

which can be arbitrarily chosen. Here, we define a function according to the power family of Kim & DeMets with `gammaA = 1.345`

. This value was shown by Hampson & Jennison (2013) to be optimum in a specific context but this does not matter here. The (upper) continuation boundaries together with the decision boundaries and the last stage critical boundary are computed using the `uniroot()`

function as follows.

For , `decisionMatrix`

is set equal to for , we use whereas, for , we use

for calculating the stagewise rejection probabilities that yield Type I error rate control with appropriately defined information rates.

```
### Derive decision boundaries for delayed response alpha spending approach
alpha <- 0.025
gammaA <- 1.345
tolerance <- 1E-6
# Specify use function
spend <- function(x, size, gamma) {
return(size * x^gamma)
}
infRates <- c(28, 54, 96) / 96
kMax <- length(infRates)
delay <- rep(16, kMax - 1) / 96
u <- rep(NA, kMax)
c <- rep(qnorm(1 - alpha), kMax - 1)
for (k in (1:kMax)) {
if (k < kMax) {
infRatesPlusDelay <- c(infRates[1:k], infRates[k] + delay[k])
} else {
infRatesPlusDelay <- infRates
}
u[k] <- uniroot(
function(x) {
if (k == 1) {
d <- matrix(c(
x, c[k],
Inf, Inf
), nrow = 2, byrow = TRUE)
} else if (k < kMax) {
d <- matrix(c(
rep(-Inf, k - 1), x, c[k],
u[1:(k - 1)], Inf, Inf
), nrow = 2, byrow = TRUE)
} else {
d <- matrix(c(
rep(-Inf, k - 1), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE)
}
probs <- getGroupSequentialProbabilities(d, infRatesPlusDelay)
if (k == 1) {
probs[2, k + 1] - probs[1, k + 1] - spend(infRates[k], alpha, gammaA)
} else if (k < kMax) {
probs[2, k + 1] - probs[1, k + 1] - (spend(infRates[k], alpha, gammaA) -
spend(infRates[k - 1], alpha, gammaA))
} else {
probs[2, k] - probs[1, k] - (spend(infRates[k], alpha, gammaA) -
spend(infRates[k - 1], alpha, gammaA))
}
},
lower = -8, upper = 8
)$root
}
round(u, 5)
```

`[1] 2.43743 2.24413 2.06854`

We note that any other spending function can be used to define the design. That is, you can also use the spending probabilities of, say, an O`Brien & Fleming design approach that is defined through the shape of the boundaries. Furthermore, it is also possible to use the boundaries together with the unadjusted critical values in an inverse normal -value combination test where the weights are fixed through the planned infoprmation rates and the delay.

The calculation of the test characteristics is straightforward and can be derived for designs with or without futility boundaries. In the following example, we show how to derive lower continuation (or futility) boundaries that are based on a -spending function approach. As above, we use the same Kim & DeMets spending function with `gammaB = 1.345`

. Due to numerical reasons, we do not use the `uniroot()`

function here but a bisection method to numerically search for the boundaries.

```
beta <- 0.1
gammaB <- 1.345
u0 <- rep(NA, kMax)
cLower1 <- 0
cUpper1 <- 100
prec1 <- 1
iteration <- 1E5
while (prec1 > tolerance) {
shift <- (cLower1 + cUpper1) / 2
for (k in (1:kMax)) {
if (k < kMax) {
infRatesPlusDelay <- c(infRates[1:k], infRates[k] + delay[k])
} else {
infRatesPlusDelay <- infRates
}
nz <- matrix(rep(sqrt(infRatesPlusDelay), 2), nrow = 2, byrow = TRUE) * sqrt(shift)
prec2 <- 1
cLower2 <- -8
cUpper2 <- 8
while (prec2 > tolerance) {
x <- (cLower2 + cUpper2) / 2
if (k == 1) {
d2 <- matrix(c(
u[k], c[k],
Inf, Inf
), nrow = 2, byrow = TRUE) - nz
probs <- getGroupSequentialProbabilities(d2, infRatesPlusDelay)
ifelse(pnorm(x - nz[1]) + probs[1, k + 1] < spend(infRates[k], beta, gammaB),
cLower2 <- x, cUpper2 <- x
)
} else if (k < kMax) {
d1 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE) - nz[, 1:k]
probs1 <- getGroupSequentialProbabilities(d1, infRatesPlusDelay[1:k])
d2 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), u[k], c[k],
u[1:(k - 1)], Inf, Inf
), nrow = 2, byrow = TRUE) - nz
probs2 <- getGroupSequentialProbabilities(d2, infRatesPlusDelay)
ifelse(probs1[1, k] + probs2[1, k + 1] < spend(infRates[k], beta, gammaB) -
spend(
infRates[k - 1],
beta, gammaB
),
cLower2 <- x, cUpper2 <- x
)
} else {
d1 <- matrix(c(
pmin(u0[1:(k - 1)], u[1:(k - 1)]), x,
u[1:(k - 1)], Inf
), nrow = 2, byrow = TRUE) - nz
probs <- getGroupSequentialProbabilities(d1, infRates)
ifelse(probs[1, k] < spend(infRates[k], beta, gammaB) -
spend(infRates[k - 1], beta, gammaB),
cLower2 <- x, cUpper2 <- x
)
}
iteration <- iteration - 1
ifelse(iteration > 0, prec2 <- cUpper2 - cLower2, prec2 <- 0)
}
u0[k] <- x
}
ifelse(u0[kMax] < u[kMax], cLower1 <- shift, cUpper1 <- shift)
ifelse(iteration > 0, prec1 <- cUpper1 - cLower1, prec1 <- 0)
}
round(u0, 5)
```

`[1] -0.40891 0.66367 2.06854`

`round(shift, 2)`

`[1] 12`

`round(shift / (qnorm(1 - alpha) + qnorm(1 - beta))^2, 3)`

`[1] 1.142`

We can compare these values with the “original” -spending approach with non-binding futility boundaries using the function `getDesignGroupSequential()`

:

```
x <- getDesignGroupSequential(
informationRates = infRates,
typeOfDesign = "asKD", typeBetaSpending = "bsKD",
gammaA = gammaA, gammaB = gammaB,
alpha = alpha, beta = beta, bindingFutility = FALSE
)
round(x$futilityBounds, 5)
```

`[1] -0.19958 0.80463`

`round(x$criticalValues, 5)`

`[1] 2.59231 2.39219 2.10214`

`round(getDesignCharacteristics(x)$inflationFactor, 3)`

`[1] 1.146`

We have shown how to handle a group sequential design with delayed responses in two diffenrent ways. So far, we have implemented the approach proposed by Hampson & Jennison (2013) that is based on reversal probabilities. The direct usage of the delayed information within the design definition make it easy for the user to apply these designs to commonly used trials with continuous, binary, and time to event endpoints. We have also shown how to use the `getGroupSequentialProbabilities()`

function to derive the critical values and the test characteristics for the alternative approach that “more directly” determines the critical values through a spending function approach.

System: rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), platform: x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document provides examples for simulating test characteristics for multi-stage enrichment designs for testing rates with rpact. Specifically, it is shown how to obtain the simulation results for the example provided in Wassmer & Brannath (2016), Sect. 11.2. An example showing how to obtain results from a data analysis is provided too.

Adaptive enrichment designs are applicable where studies of unselected patients might be unable to detect a drug effect and therefore it seems necessary to “enrich” the study with potential responders, defined as a subpopulation of the unselected patient population. If this is done in an adaptive and data-driven way (i.e., it is not clear upfront whether to use the selected population and this is decided based on data observed at an interim stage) we might use adaptive population enrichment designs.

Adaptive population enrichment designs enable the data-driven selection of one or more pre-specified subpopulations in an interim analysis and the confirmatory proof of efficacy in the selected subset(s) at the end of the trial. Sample size reassessment and other adaptive design changes can be performed as well. Strong control of the familywise error rate (FWER) is guaranteed by the use of -value combination tests together with the closed testing principle.

Enrichment factors may be predictive biomarkers, or they may be biomarkers or clinicopathologic or demographic characteristics associated with a predictive biomarker or with the target of a therapeutic agent. The lower the proportion of truly benefiting patients, the more advantageous it is to consider studying an enriched population. However, instead of limiting the enrollment only to the narrow subpopulation of interest, prospectively specified adaptive designs may also be used to consider the effect of the experimental treatment both in the wider entire patient population under investigation and in various subpopulations.

Assume that there is a full population with pre-specified subpopulations of interest denoted as such that . Let denote the full population . Assume a binary endpoint and consider a set of elementary hypotheses

where denotes the unknown rates in treatment group , , and population , . That is, refers to the response rate of the experimental treatment versus control in subpopulation .

At each stage , , consider the test statistic

where and are the observed rates at stage in the two treatment groups and is the observed cumulative rate in population (at stage ). These test statistics are approximately normal and so the stage-wise -values are calculated with the use of the normal cdf. If a subpopulation consists of several subsets, a stratified analysis should be performed where the corresponding Cochran-Mantel-Haenszel (CMH) test is applied.

The closed system of hypotheses consists of all possible intersection hypotheses

The global test decision follows from testing the global hypothesis

with a suitable global intersection test. If the global null hypothesis can be rejected, all other intersection hypotheses are then tested. By performing the closed test procedure an elementary hypothesis can be rejected if the combination test fulfills the rejection criterion for all with .

Given a combination function , at the second stage the hypothesis belonging to a selected subpopulation is rejected if

where denotes the index set of all excluded , , and denotes the critical value for the second stage.

For , using the adaptive closed test procedure, at an interim stage one can decide to continue to stage 2 to test and , only, or only. Note that for a two-stage trial where no subpopulation is selected at interim, the complete set of intersection hypotheses is tested at each stage, yielding -values and for each intersection hypothesis . These -values are combined according to the specified combination test, for example, the inverse normal method or Fisher’s combination test. This combination test might have a power disadvantage as compared to the single-stage non-adaptive test where the -values would be obtained from the pooled data. However, we have the advantage that data driven adaptations including subgroup selection are possible, thereby improving power.

First, load the R packages `dplyr`

, `ggplot2`

, and `gridExtra`

:

```
library(dplyr)
library(ggplot2)
library(gridExtra)
```

Second, as always, load the rpact package:

```
library(rpact)
packageVersion("rpact") # should be version 3.3 or later
```

`[1] '4.0.0'`

The `rpact`

function `getSimulationEnrichmentRates()`

performs the simulation of a population enrichment design for testing the difference of proportions in two treatment groups. The way of how these simulations can be performed is similar to the case of multi-armed designs. Particularly, see the vignettes Planning and Analyzing a Group-Sequential Multi-Arm Multi-Stage Design with Binary Endpoint using rpact and Simulating Multi-Arm Designs with a Continuous Endpoint using rpact. Examples for creating plots for enrichment design simulations are illustrated in the vignette How to Create One- and Multi-Arm Simulation Result Plots with rpact.

An essentially new part is the definition of effect sizes and prevalences for the considered subpopulations. These typically consist of subsets of the considered entire (full) population (see Sect. 11.2 in Wassmer & Brannath, 2016).

For , one subpopulation is considered. Here we have to specify the prevalence of and the assumed effect sizes in and . The effect in is a weighted average of the effect sizes in the disjunct subgroups and , respectively. There is typically a large number of possible configurations because each effect size in is combined with the effect sizes in . For , things are more difficult. Generally, prevalences and effect sizes have to be specified for four subsets of and even more possible configurations have to be considered. Note that also the case of nested of non-overlapping subpopulations can be considered as illustrated in Fig. 11.4 in Wassmer & Brannath (2016). `rpact`

allows up to , however, eight subsets need to be specified here.

In the `rpact`

function `getSimulationEnrichmentRates()`

, the parameter `effectList`

has to be defined which is a list of parameters defining the subsets and yielding the subpopulations and their prevalences and effect sizes. We illustrate this for by the example presented in Wassmer & Dragalin (2015) and Wassmer & Brannath (2016), Sect. 11.2, and show how the results given there can be recalculated with the `rpact`

function `getSimulationEnrichmentRates()`

.

The study under consideration is the **I**nvestigation of **S**erial Studies to **P**redict **Y**our **T**herapeutic **R**esponse with **I**maging **A**nd mo**L**ecular **A**nalysis (I-SPY 2 TRIAL) which is an ongoing clinical trial in patients with high-risk primary breast cancer. It involves a randomized phase II screening process in which a series of experimental drugs are evaluated in combination with standard neoadjuvant chemotherapy which is given prior to surgery. The primary endpoint is pathologic complete response (pCR) at the time of surgery (for details, see Barker et al., 2009).

The screening process includes a Magnetic Resonance Imaging to establish tumor size at baseline and a biopsy to identify the tumor’s hormone-receptor status (HR) and the HER2/neu status (HER2). Triple negative breast cancer (TNBC) refers to breast cancer that does not express the genes for the estrogen receptor, the progesterone receptor, and HER2.

Assume that one of the experimental drugs has been identified from I-SPY 2 TRIAL with the biomarker signature of TNBC but also with some promising effect in the HER2 negative (HER2-) biomarker signature. The sponsor may consider a confirmatory Phase III trial in TNBC patients only. The prevalence of TNBC, however, is only 34%, while the prevalence of HER2- is 63%. Therefore, an alternative option is to run a confirmatory trial with a two-stage enrichment design starting with the HER2- patients as the full population, but with the preplanned option of selecting the TNBC patients after the first stage if the observed effect is not promising in the HER2- patients with positive hormone-receptor status (HR+).

If a pCR rate in the control arm of 0.34 and a treatment effect of 0.2 (measured as the difference in pCR rates between the new drug and control) is assumed, the required total sample size for a conventional two-arm test with power 90% and one-sided significance level 0.0125 (i.e., applying Bonferroni correction) is 294. This can be found with `rpact`

using the command

```
getSampleSizeRates(
alpha = 0.0125,
beta = 0.1,
pi1 = 0.5,
pi2 = 0.3)$nFixed |>
ceiling()
```

`[1] 294`

It will serve as a first guess for the actually needed sample size and we illustrate the enrichment design for this study assuming that a total sample size of 300 subjects will be enrolled in the trial.

The interim analysis is planned after 150 subjects and no early stopping is intended. The design under consideration therefore will be

```
designIN <- getDesignInverseNormal(
kMax = 2,
typeOfDesign = "noEarlyEfficacy"
)
```

A subpopulation selection using the -selection rule with will be considered, i.e., if the difference in pCR is smaller than 0.1, both populations are selected and the decision at the interim analysis will be to either select the TNBC subpopulation or going on with the full population of HER2- patients. If the observed treatment effect difference exceeds 0.1 in favor of the TNBC population, the TNBC subpopulation will be selected, otherwise no selection will be considered. In this case, the test for the full population only will be conducted if the observed treatment effect difference exceeds 0.1 in favor of the F population; otherwise the test for both populations will be conducted. The inverse normal combination testing strategy together with the Simes intersection test will be used. We use the Simes test here to avoid futility stops that are possibly due to the use of the Bonferroni correction.

In I-SPY 1 TRIAL, a prevalence of TNBC patients in the HER2- population of about 54% (%) and a control pCR rate in TNBC patients of 0.34 was observed. The pCR rate in the HER2- patients with HR+ hormone receptor is 0.23. The operating characteristics of the enrichment design are investigated for treatment effect differences ranging from 0 to 0.3 by an increment of 0.05 in the TNBC subpopulation and ranging from 0 to 0.2 by an increment of 0.10 in the HER2- patients with HR+ hormone receptor. This yields 21 different scenarios for the effect sizes that are defined in the parameter list `effectList`

as follows.

```
effectList <- list(
subGroups = c("R", "S"),
prevalences = c(0.46, 0.54),
piControl = c(0.23, 0.34),
piTreatments = expand.grid(
seq(0, 0.2, 0.1) + 0.23,
seq(0, 0.3, 0.05) + 0.34
)
)
effectList
```

```
$subGroups
[1] "R" "S"
$prevalences
[1] 0.46 0.54
$piControl
[1] 0.23 0.34
$piTreatments
Var1 Var2
1 0.23 0.34
2 0.33 0.34
3 0.43 0.34
4 0.23 0.39
5 0.33 0.39
6 0.43 0.39
7 0.23 0.44
8 0.33 0.44
9 0.43 0.44
10 0.23 0.49
11 0.33 0.49
12 0.43 0.49
13 0.23 0.54
14 0.33 0.54
15 0.43 0.54
16 0.23 0.59
17 0.33 0.59
18 0.43 0.59
19 0.23 0.64
20 0.33 0.64
21 0.43 0.64
```

We see that different situations as specified per row of `piTreatments`

can be considered. For the above definition, using the specified prevalences, this yields the following 21 treatment effect differences for the full population (see Table 11.3 in Wassmer & Brannath, 2016):

```
diffEffectsF <- effectList$piTreatments -
matrix(rep(effectList$piControl, 21), ncol = 2, byrow = TRUE)
(diffEffectsF * matrix(rep(effectList$prevalences, 21), ncol = 2, byrow = TRUE)) |>
rowSums()
```

```
[1] 0.000 0.046 0.092 0.027 0.073 0.119 0.054 0.100 0.146 0.081 0.127 0.173
[13] 0.108 0.154 0.200 0.135 0.181 0.227 0.162 0.208 0.254
```

For the moment, we want to consider only one situation, namely

```
effectList <- list(
subGroups = c("R", "S"),
prevalences = c(0.46, 0.54),
piControl = c(0.23, 0.34),
piTreatments = c(0.43, 0.54)
)
```

That is, we consider the same effect difference (= 0.20) in each of the two subset, S and R. The design characteristics of 10,000 simulations are generated and summarized as follows:

```
getSimulationEnrichmentRates(
design = designIN,
plannedSubjects = c(150, 300),
effectList = effectList,
stratifiedAnalysis = TRUE,
intersectionTest = "Simes",
typeOfSelection = "epsilon",
epsilonValue = 0.1,
seed = 12345,
maxNumberOfIterations = 10000
) |>
summary()
```

*Simulation of a binary endpoint (enrichment design)*

Sequential analysis with a maximum of 2 looks (inverse normal combination test design), overall significance level 2.5% (one-sided). The results were simulated for a population enrichment comparisons for rates (treatment vs. control, 2 populations), H0: pi(treatment) - pi(control) = 0, power directed towards larger values, H1: assumed treatment rate pi(treatment) = c(0.54, 0.43), subgroups = c(S, R), prevalences = c(0.54, 0.46), control rates pi(control) = c(0.34, 0.23), planned cumulative sample size = c(150, 300), intersection test = Simes, selection = epsilon rule, eps = 0.1, effect measure based on effect estimate, success criterion: all, simulation runs = 500, seed = 12345.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | Inf | 1.960 |

Stage levels (one-sided) | 0 | 0.0250 |

Reject at least one | 0.9160 | |

Rejected populations per stage | ||

Subset S | 0 | 0.7360 |

Full population F | 0 | 0.8360 |

Success per stage | 0 | 0.8020 |

Expected number of subjects under H1 | 300.0 | |

Overall exit probability | 0 | |

Stagewise number of subjects | ||

Subset S | 81.0 | 86.5 |

Remaining population R | 69.0 | 63.5 |

Selected populations | ||

Subset S | 1.0000 | 0.9280 |

Full population F | 1.0000 | 0.9200 |

Number of populations | 2.000 | 1.848 |

Conditional power (achieved) | 0.7847 |

We see in `Reject at least one`

that the power requirement is more or less exactly fulfilled and that there is a considerably higher chance to reject F. By default, `successCriterion = "all"`

and therefore rejecting both populations, in this situation, has a probability of around 76%.

The design characteristics for the whole number of situations and 10,000 simulations per scenario are generated as follows:

```
effectList <- list(
subGroups = c("R", "S"),
prevalences = c(0.46, 0.54),
piControl = c(0.23, 0.34),
piTreatments = expand.grid(
seq(0, 0.2, 0.1) + 0.23,
seq(0, 0.3, 0.05) + 0.34
)
)
simResultsPE <- designIN |>
getSimulationEnrichmentRates(
plannedSubjects = c(150, 300),
effectList = effectList,
stratifiedAnalysis = TRUE,
intersectionTest = "Simes",
typeOfSelection = "epsilon",
epsilonValue = 0.1,
seed = 12345,
maxNumberOfIterations = 10000
)
```

The results from Table 11.3 in Wassmer & Brannath (2016) are obtained from `simResultsPE`

as follows:

`simResultsPE$rejectAtLeastOne |> round(3)`

```
[1] 0.018 0.076 0.220 0.082 0.146 0.406 0.228 0.414 0.598 0.450 0.536 0.758
[13] 0.712 0.784 0.872 0.876 0.908 0.968 0.964 0.976 0.994
```

`simResultsPE$rejectedPopulationsPerStage[2, , ] |> round(3)`

```
[,1] [,2]
[1,] 0.014 0.016
[2,] 0.012 0.076
[3,] 0.024 0.218
[4,] 0.064 0.050
[5,] 0.082 0.126
[6,] 0.092 0.402
[7,] 0.202 0.108
[8,] 0.258 0.364
[9,] 0.232 0.584
[10,] 0.428 0.172
[11,] 0.434 0.386
[12,] 0.430 0.708
[13,] 0.698 0.212
[14,] 0.706 0.588
[15,] 0.658 0.766
[16,] 0.870 0.276
[17,] 0.862 0.568
[18,] 0.850 0.834
[19,] 0.962 0.226
[20,] 0.968 0.484
[21,] 0.954 0.768
```

`simResultsPE$selectedPopulations[2, , ] |> round(3)`

```
[,1] [,2]
[1,] 0.934 0.948
[2,] 0.782 0.988
[3,] 0.598 0.992
[4,] 0.974 0.872
[5,] 0.862 0.956
[6,] 0.666 0.994
[7,] 0.994 0.792
[8,] 0.930 0.930
[9,] 0.810 0.980
[10,] 0.992 0.666
[11,] 0.972 0.830
[12,] 0.852 0.946
[13,] 0.996 0.520
[14,] 0.980 0.808
[15,] 0.930 0.894
[16,] 0.998 0.424
[17,] 0.986 0.674
[18,] 0.950 0.868
[19,] 1.000 0.278
[20,] 0.998 0.522
[21,] 0.978 0.774
```

The probability to select one population is obtained from `numberOfPopulations`

through

P(select one population) + 2 * (1 - P(select one population) = numberOfPopulations:

`(2 - simResultsPE$numberOfPopulations[2, ]) |> round(3)`

```
[1] 0.118 0.230 0.410 0.154 0.182 0.340 0.214 0.140 0.210 0.342 0.198 0.202
[13] 0.484 0.212 0.176 0.578 0.340 0.182 0.722 0.480 0.248
```

The power of the design (the probability to reject at least one null hypothesis) is greater than 90% for scenarios 17-21, mainly corresponding to treatment effects 0.25 and 0.3 for . Hence, in these cases a total sample size of 300 patients reaches the desired power, and the rough estimate provided through the use of the Bonferroni correction provides a good estimate for the necessary sample size. Note that the term power is used here also for the cases where the null hypothesis is true (Scenario 1 - 3). This, however, illustrates *strong* control of FWER (see the first three values of `rejectedPopulationsPerStage[2, ,1]`

). Note that in the records the first column refers to S, whereas the second refers to F.

The results also show that the power to reject in the full population (except for effect size 0.2 and in both subsets, i.e., scenarios 15 and 18) is smaller than 80%, for largest effect sizes the power even decreases a bit. The latter is due to the fact that in this case the probability to deselect and to select increases. For most scenarios the probability to reduce the confirmatory proof to one hypothesis, or , is quite small.

The case for enrichment, i.e., the selection of at the interim stage, varies between 1% and 70% over the scenarios and can be derived from P(Select ) = `selectedPopulations[2, ,2]`

: P(Enrichment) = 1 - P(Select ). The question arises if this might reduce power (defined as above) due to wrongly selecting a population. The answer is no, as illustrated in the figure below. Here it is shown that for all effect sizes in there is no decrease, for effect size 0 there is even a clear increase in power showing the advantage of an adaptive enrichment design as compared to the non-adaptive case. In the figure, the dashed lines refer to the cases with selection and the solid lines to the non-adaptive case.

```
simResultsFixed <- getSimulationEnrichmentRates(
design = designIN,
plannedSubjects = c(150, 300),
effectList = effectList,
stratifiedAnalysis = TRUE,
intersectionTest = "Simes",
typeOfSelection = "all",
seed = 12345,
maxNumberOfIterations = 10000
)
dataAll <- rbind(
simResultsFixed |> as.data.frame(),
simResultsPE |> as.data.frame()
)
dataAll$effectS <- effectList$piTreatments$Var2 - effectList$piControl[2]
dataAll$effectNotS <- (effectList$piTreatments$Var1 - effectList$piControl[1]) |>
round(1)
dataStage2 <- dataAll |>
filter(stages == 2)
plotPowerDiff <- function(effectR) {
dataSub <- dataStage2 |>
filter(effectNotS == effectR)
ggplot(
dataSub,
aes(
x = effectS, y = rejectAtLeastOne,
group = typeOfSelection,
linetype = typeOfSelection
)
) +
geom_line(linewidth = 0.8, show.legend = FALSE) +
ylim(0, 1) +
theme_classic() +
xlab("Effect S") +
ylab("Power") +
ggtitle(paste0("Effect R = ", effectR))
}
plot1 <- plotPowerDiff(effectR = 0)
plot2 <- plotPowerDiff(effectR = 0.1)
plot3 <- plotPowerDiff(effectR = 0.2)
gridExtra::grid.arrange(plot1, plot2, plot3, ncol = 3)
```

Note that the definition of `subGroups`

in the list determines how many and which type of subpopulations is considered for the clinical trial situation. In `rpact`

, up to four subpopulations can be considered:

For , `subGroups`

with (fixed) names “R”,“S1”, “S2”, and “S12” have to be specified, i.e.,

`subGroups = c("R", "S1", "S2", "S12")`

and `prevalences`

, `piControl`

and `piTreatments`

consist of four elements per situation.

For , `subGroups`

with (fixed) names “R”,“S1”, “S2”, “S2”, “S12”, “S13” “S23” and “S123” have to be specified, i.e.,

`subGroups = c("R","S1", "S2", "S3", "S12", "S13", "S23", "S123")`

and `prevalences`

, `piControl`

and `piTreatments`

consist of eight elements per situation.

If the endpoint is binary, for the calculation of the test statistics and related computations for the test procedure the number of events and sample sizes have to be given. In `rpact`

, these can be simply specified through *separate* data sets for the distinct subsets, S and R. For example, assume the results for the first and the second stage in the two subsets are as follows:

```
S <- getDataSet(
events1 = c(11, 12),
events2 = c(6, 7),
n1 = c(36, 39),
n2 = c(38, 40)
)
R <- getDataSet(
events1 = c(12, 10),
events2 = c(8, 8),
n1 = c(32, 33),
n2 = c(31, 29)
)
F <- getDataSet(
events1 = c(23, 22),
events2 = c(14, 15),
n1 = c(68, 72),
n2 = c(69, 69)
)
```

The whole data set and the analysis with Simes’ test is then obtained with

```
designIN |>
getDataSet(S1 = S, R = R) |>
getAnalysisResults(intersectionTest = "Simes") |>
summary()
```

*Enrichment analysis results for a binary endpoint (2 populations)*

Sequential analysis with 2 looks (inverse normal combination test design). The results were calculated using a two-sample test for rates (one-sided, alpha = 0.025), Simes intersection test, normal approximation test, stratified analysis. H0: pi(treatment) - pi(control) = 0 against H1: pi(treatment) - pi(control) > 0.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | Inf | 1.960 |

Cumulative alpha spent | 0 | 0.0250 |

Stage level | 0 | 0.0250 |

Cumulative effect size S1 | 0.148 | 0.140 |

Cumulative effect size F | 0.135 | 0.111 |

Cumulative treatment rate S1 | 0.306 | 0.307 |

Cumulative treatment rate F | 0.338 | 0.321 |

Cumulative control rate | 0.158 | 0.203 |

Stage-wise test statistic S1 | 1.509 | 1.380 |

Stage-wise test statistic F | 1.768 | 1.167 |

Stage-wise p-value S1 | 0.0656 | 0.0838 |

Stage-wise p-value F | 0.0385 | 0.1217 |

Adjusted stage-wise p-value S1, F | 0.0656 | 0.1217 |

Adjusted stage-wise p-value S1 | 0.0656 | 0.0838 |

Adjusted stage-wise p-value F | 0.0385 | 0.1217 |

Overall adjusted test statistic S1, F | 1.509 | 1.892 |

Overall adjusted test statistic S1 | 1.509 | 2.043 |

Overall adjusted test statistic F | 1.768 | 2.075 |

Test action: reject S1 | FALSE | FALSE |

Test action: reject F | FALSE | FALSE |

Conditional rejection probability S1 | 0.1034 | |

Conditional rejection probability F | 0.1034 | |

95% repeated confidence interval S1 | [-0.029; 0.306] | |

95% repeated confidence interval F | [-0.021; 0.238] | |

Repeated p-value S1 | 0.0292 | |

Repeated p-value F | 0.0292 |

Legend:

*F*: full population*S[i]*: population i

From the first stage, it is seen that the effect differences between S and R, and hence between S and F, are quite similar. Following the -selection rule with , and since `11/36 - 6/38`

= 0.148 and `23/68 - 14/69`

= 0.135, no enrichment was performed and from both subsets observations were made. From the observation from both stages, however, none of the hypotheses can be rejected. You can check as an exercise that the use of the Spiessens and Debois test yields an only negligibly smaller overall -value ( = 0.0288) that does not reach significance either. However, the Bonferroni test is considerably worse yielding an overall -value of 0.0456.

As an alternative, one might consider the case of enrichment, i.e., to decide at interim to proceed with the trial with the subset S only. The sample size for the second stage can be chosen in a data-driven manner. Because of

```
designIN |>
getDataSet(S1 = S, R = R) |>
getAnalysisResults(
stage = 1,
nPlanned = 150,
intersectionTest = "Simes"
) |>
fetch(conditionalPower)
```

```
$conditionalPower
[,1] [,2]
[1,] NA 0.8144225
[2,] NA 0.7290993
```

the conditional power when sticking to the originally planned sample size is large enough (> 80%) for the rejection of the hypothesis for subset S and hence it might be decided to recruit patients from S only. Assume that the following observations were made:

```
S <- getDataSet(
events1 = c(11, 46),
events2 = c(6, 27),
n1 = c(36, 151),
n2 = c(38, 148)
)
R <- getDataSet(
events1 = c(12, NA),
events2 = c(8, NA),
n1 = c(32, NA),
n2 = c(31, NA)
)
```

This is the result showing clear significance of an effect in S although the observed effect is virtually the same as before:

```
designIN |>
getDataSet(S1 = S, R = R) |>
getAnalysisResults(intersectionTest = "Simes") |>
summary()
```

*Enrichment analysis results for a binary endpoint (2 populations)*

Sequential analysis with 2 looks (inverse normal combination test design). The results were calculated using a two-sample test for rates (one-sided, alpha = 0.025), Simes intersection test, normal approximation test, stratified analysis. H0: pi(treatment) - pi(control) = 0 against H1: pi(treatment) - pi(control) > 0.

Stage | 1 | 2 |
---|---|---|

Fixed weight | 0.707 | 0.707 |

Efficacy boundary (z-value scale) | Inf | 1.960 |

Cumulative alpha spent | 0 | 0.0250 |

Stage level | 0 | 0.0250 |

Cumulative effect size S1 | 0.148 | 0.127 |

Cumulative effect size F | 0.135 | |

Cumulative treatment rate S1 | 0.306 | 0.305 |

Cumulative treatment rate F | 0.338 | |

Cumulative control rate | 0.158 | 0.203 |

Stage-wise test statistic S1 | 1.509 | 2.459 |

Stage-wise test statistic F | 1.768 | |

Stage-wise p-value S1 | 0.0656 | 0.0070 |

Stage-wise p-value F | 0.0385 | |

Adjusted stage-wise p-value S1, F | 0.0656 | 0.0070 |

Adjusted stage-wise p-value S1 | 0.0656 | 0.0070 |

Adjusted stage-wise p-value F | 0.0385 | |

Overall adjusted test statistic S1, F | 1.509 | 2.806 |

Overall adjusted test statistic S1 | 1.509 | 2.806 |

Overall adjusted test statistic F | 1.768 | |

Test action: reject S1 | FALSE | TRUE |

Test action: reject F | FALSE | FALSE |

Conditional rejection probability S1 | 0.1034 | |

Conditional rejection probability F | 0.1034 | |

95% repeated confidence interval S1 | [0.025; 0.239] | |

95% repeated confidence interval F | ||

Repeated p-value S1 | 0.0025 | |

Repeated p-value F |

Legend:

*F*: full population*S[i]*: population i

Barker, A., Sigman, C., Kelloff, G., Hylton, N., Berry, D., Esserman, L. (2009). I–SPY 2: An adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. *Clinical Pharmacology and Therapeutics* 86, 97–100. https://doi.org/10.1038/clpt.2009.68

Wassmer, G and Brannath, W. *Group Sequential and Confirmatory Adaptive Designs in Clinical Trials* (2016), ISBN 978-3319325606 https://doi.org/10.1007/978-3-319-32562-0

Wassmer, G., Dragalin, V. (2015). Designing issues in confirmatory adaptive population enrichment trials. *Journal of Biopharmaceutical Statistics* 25, 651–669. https://doi.org/10.1080/10543406.2014.920869

*System* rpact 4.0.0, R version 4.3.3 (2024-02-29 ucrt), *platform* x86_64-w64-mingw32

To cite R in publications use:

*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. To cite package ‘rpact’ in publications use:

*rpact: Confirmatory Adaptive Clinical Trial Design and Analysis*. R package version 4.0.0, https://www.rpact.com, https://github.com/rpact-com/rpact, https://rpact-com.github.io/rpact/, https://www.rpact.org.

This document provides an exemplary implementation of multi-arm-multi-stage (MAMS) design with binary endpoint using rpact. Concretely, the vignette consists of considering design implementation with respect to futility bounds on treatment effect scale, sample size calculations and simulations given different treatment selection approaches. Further, analysis using closed testing will be performed on generic data binary and survival data.

Note that since rpact itself doesn’t support landmark analysis, i.e. comparison of survival probabilities at a fixed point in time, using the Greenwoods standard error (SE) formula. Thus the first two analyses are based on the empirical event rates only. The R packages gestate and survival are then utilized to briefly show how one could merge the packages to perform the actually intended analysis using boundaries obtained by rpact and test statistics obtained by survival probabilities and standard errors estimated using gestate and survival.

For methodological and theoretical background, refer to “Group Sequential and Confirmatory Adaptive Designs in Clinical Trials” by Gernot Wassmer and Werner Brannath.

Before starting, load the rpact package and make sure the version of the package is at least 3.1.0:

```
library(rpact)
packageVersion("rpact")
```

`[1] '4.0.0'`

Consider we are interested in implementing a group sequential design with stages, treatment arms ( active, control), treatment arm comparisons (each active arm vs. the placebo arm), a binary endpoint, with a global one-sided , and the power for each treatment arm comparison set to , hence . Additionally, let the critical boundaries be calculated using alpha-spending O’Brien & Fleming approach and for planning purposes, suppose equally distributed information rates. Further, non-binding futility bounds are to be set in the following way:

Considering the active treatment arms, the goal of both arms is to confirm significant reduction in event (e.g. disease, death) rates as compared to a control arm. Thus, futility should appear exactly when there is only low reduction, no reduction or even increase seen in the data. Assume one designs a study where a treatment arm is futile in the first stage when there has only been a relative reduction of % (i.e. rate ratio of ) and in the second stage when data indicates a relative reduction of % only (i.e. rate ratio of ). For simplification, futility bounds are assumed to be independent. As rpact primarily uses futility bounds on -scale, one needs to determined the -scale value that correspond with the values listed above. Advantageously, the futility bounds on treatment effect scale are part of sample size calculation output. Therefore, rpact allows to determine the right -value by basically examining various options for -scale futility as input and decide on which values result in the needed treatment effect futility bounds. This can exemplarily be done by trying different -scale futility bounds as input up until the sample size calculation command output (given certain treatment effect assumptions) indicates that the input corresponds with the desired treatment effect futility bound.

As an example on how to get the corresponding futility bound values on the different scales, see the example below:

First, one needs to initialize the design and basically take arbitrary futility bounds on -scale as input:

```
# first and second stage futility on z-scale
fut1 <- 0.16
fut2 <- 0.39
d_fut <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(fut1, fut2),
bindingFutility = FALSE
)
```

Now, as mentioned, the sample size calculation output provides information about the futility boundaries on the treatment effect scale. Therefore, after assuming certain treatment effects, one needs to perform sample size calculation and extract the treatment effect futility bound from the respective output:

```
c_assum <- 0.1 # assumed rate in control
effect_assum <- 0.5 # relative reduction that is to be detected with probability of 0.8
# rates indicate binary endpoint
ssc_fut <- getSampleSizeRates(
design = d_fut,
riskRatio = TRUE,
pi1 = c_assum * (1 - effect_assum),
pi2 = c_assum
)
ssc_fut$futilityBoundsEffectScale
```

```
[,1]
[1,] 0.9464954
[2,] 0.9085874
```

The values printed out above are now the futility bounds on treatment effect scale that do correspond with the values *fut1* and *fut2* from above. Since and , the input value *fut1=* seems to be slightly to large while *fut2=* seems to be slightly to low, while this indicates how to adjust the input values such that one comes closer to the the true corresponding values.

Now, using a search algorithm or by simply trying different values, one results in futility bounds on -scale summarized in the following table created using knitr:

First stage | Second stage | |
---|---|---|

Orig. treatment effect scale | 0.950 | 0.900 |

Approx. z-Value for given t-Value | 0.149 | 0.414 |

Corresponding t-scale Value | 0.950 | 0.900 |

Diff. between Approx. and Input | 0.000 | 0.000 |

In the table above, the *Orig. treatment effect scale* is the intended futility bound on treatment effect scale, *Approx. -value* denotes the corresponding -scale value approximation and *Corresponding -scale Value* the actual -scale value using the calculated approx. -values as input.. One can see that the first stage futility bound on -scale should approximately be chosen as and the second stage -scale futility bound as , resulting in the desired futility bound on treatment effect scale of approximately 0.95 in first and 0.9 in second stage, respectively. In the next chapter containing the sample size calculations, the respective outputs validate that these -scale futility bounds are a reasonable choice.

Now that (approximate) futility bounds on -scale are known, the above specified design can be entirely initialized, *kMax=3* indicates a study design of three stages.

```
# GSD with futility bounds according to above calculations
d <- getDesignGroupSequential(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(0.149145, 0.41381),
bindingFutility = FALSE
)
kable(summary(d))
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

asOF | 3 | 1 | 0.3333333 | 0.025 | 0.2 | FALSE | 0.149145 | FALSE | 1 | 0 | 0.0001035 | none | 3.710303 | 0.0001035 |

asOF | 3 | 2 | 0.6666667 | 0.025 | 0.2 | FALSE | 0.413810 | FALSE | 1 | 0 | 0.0060484 | none | 2.511427 | 0.0060122 |

asOF | 3 | 3 | 1.0000000 | 0.025 | 0.2 | FALSE | NA | FALSE | 1 | 0 | 0.0250000 | none | 1.993048 | 0.0231281 |

Simply printing out the defined object gives a nice overview of all relevant design parameters. Note that the adjusted of the last stage () is slightly lower that the predefined global , corresponding to a critical values of being slightly larger than , which is due to alpha-spending O’Brien Fleming adjustment. Further, the output basically provides an overview of the input parameters.

When it comes to sample size calculations for designs with binary endpoints, rpact provides the command `getSampleSizeRates()`

. It should be noted upfront that the sample size calculation in rpact always only refers to only one treatment arm comparison.

The sample size calculations code applicable here has already been indirectly used to properly determine the -scale futility boundaries whenever futility boundaries are only given on treatment effect scale. However, this sections’ purpose now is to provide some more detail on sample size calculations in MAMS-designs with binary endpoint.

Suppose the treatment effect under is or there is even a rate increase in active treatment group, meaning the difference , with representing the assumed event rate in the treatment group and being the assumed event rate in the reference/control group. Again, let the parameter setting be given as above and assume an expected relative reduction of event occurrence of given (hence ). Since we are interested in directly comparing risks, we set `riskRatio`

to `TRUE`

which particularly results in testing against . The sample size per stage for one treatment arm comparison can then be calculated using the commands:

```
c_rate <- 0.1 # assumed rate in control
effect <- 0.5 # relative reduction that is to be detected with probability of 0.8
# rates indicate binary endpoint
d_sample <- getSampleSizeRates(
design = d,
riskRatio = TRUE,
pi1 = c_rate * (1 - effect),
pi2 = c_rate
)
kable(summary(d_sample))
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.05 | TRUE | 1 | TRUE | 0.1 | 2 | 1 | FALSE | 941.2693 | 470.6347 | 470.6347 | 313.7564 | 0.0213435 | 0.0732217 | 0.0624659 | 0.5203644 | 532.8838 | 732.4763 | 751.7059 | 0.0610249 | 0.9500457 | 0.4407196 |

2 | 0.05 | TRUE | 1 | TRUE | 0.1 | 2 | 1 | FALSE | 941.2693 | 470.6347 | 470.6347 | 627.5129 | 0.4257992 | 0.0732217 | 0.0107558 | 0.5203644 | 532.8838 | 732.4763 | 751.7059 | 0.4758044 | 0.9030566 | 0.3395066 |

3 | 0.05 | TRUE | 1 | TRUE | 0.1 | 2 | 1 | FALSE | 941.2693 | 470.6347 | 470.6347 | 941.2693 | 0.3528573 | 0.0732217 | NA | 0.5203644 | 532.8838 | 732.4763 | 751.7059 | 0.6432153 | NA | NA |

The variable *Futility boundary (t)* represents the futility bounds transferred to treatment effect scale. Therefore, one can see that the predefined bounds calculated above correspond with a rate ratio of at the first stage (i.e. relative reduction of ) and a rate ratio of at the second (i.e. relative reduction of approx. ), being actually pretty close to the intended ones. The slight differences to the results above, again, are due to assumed independence (i.e. futility bound calculations for the different stages are done on an independent basis). However, even using this simplification, results are equivalent with a precision of digits or more.

Another important information provided by the output is that to achieve the desired study characteristics, such as keeping the -error controlled at while obtaining power of , approx. study subjects are needed per arm per stage given equally spread information rates. Thus, one results in a maximum of subjects in case one assumes a design with treatment arms, being active and being a control . Multiplications by are then due to having stages and active treatment arms compared to control arm. It should be noted here that this is just an approximation of the sample size needed to achieve a power of , while power here is defined as the probability of successfully detecting the assumed effect. Simulations in following chapters will indicate that this rather rough approximate sample size is actually rather conservative since the studies appear to be overpowered in simulations, while the term power has a different meaning here: when considering simulations of studies with multiple active treatment arms, power is referred to as the probability to claim success of at least one active treatment arm in the study.

The row *Exit probability for futility* indicates the rather low probability of early stopping due to futility (. stage: , . stage: ), while this depends on the assumed treatment effects and the defined boundaries. *Efficacy boundary (t)*, just as *Efficacy boundary (z-value scale)* shows that an enormous decrease in events needs to be detected in the treatment group in order to reject in the first stage (rate ratio of , i.e. relative reduction of ) which, again reflects the “conservatism” of alpha-spending O’Brien Fleming approach in early stages, however resulting in rather liberal and monotonically decreasing boundaries along the stages.

```
# boundary plots
par(mar = c(4, 4, .1, .1))
plot(d_sample, main = paste0(
"Boundaries against stage-wise ",
"cumul. sample sizes - 1"
), type = 1)
plot(d_sample, main = paste0(
"Boundaries against stage-wise ",
"cumul. sample sizes - 2"
), type = 2)
```

Plotting the cumulative sample size against the applicable boundaries is a way of nicely visualizing these important study characteristics. On the left plot, one can see a plot having the boundaries on the y-axis on -scale.. The dashed line represents the critical value of a fixed study design with one-sided testing and (, denoting the cumulative distribution function of the standardnormal distribution, the corresponding quantile function). The right plot basically contains the same information, while the y-axis is now given on treatment effect scale. The red line represents the efficacy bound which needs to be crossed to obtain statistical significance at the applicable stage, the blue line represents the futility bounds. Note that the plot again indicates that low risk ratio values are desirable.

Performing simulations prior to conducting the study is often reasonable since they allow for evaluating characteristics, such as power (please refer to definition above), assuming different scenarios or constellations of treatment effects. Thus, simulations enable to get a more holistic view on how and where the planned study could go in case the assumptions appear to be (approximately) true.

Simulation of the study design will be done using the function `getSimulationMultiArmRates()`

, which needs an Inverse Normal Design as input, similarly defined, with the add-on of containing information about how the analysis in the simulations should be performed:

```
# design as above, just as inverse normal
d_IN <- getDesignInverseNormal(
kMax = 3,
alpha = 0.025,
beta = 0.2,
sided = 1,
typeOfDesign = "asOF",
informationRates = c(1 / 3, 2 / 3, 1),
futilityBounds = c(0.149145, 0.41380800),
bindingFutility = FALSE
)
```

To perform study simulations for instance for power or probability-of-success evaluation, one need to assume (different) effect rates in the active treatment arms, which need to be defined in a matrix-object. For binary data, die `effectMatrix`

refers to the actual event rate in each arm, but not to the the difference in event rates between control and active treatment arms. Further, number of iterations needs to be defined in advance.

```
# set number of iterations to be used in simulation
maxNumberOfIterations <- 100 # 10000
# specify the scenarios, nrow: number of scenarios, ncol: number of treatment arms
effectMatrix <- matrix(c(0.100, 0.100, 0.05, 0.05, 0.055, 0.045),
byrow = TRUE, nrow = 3, ncol = 2
)
# first column: first treatment arm, second column: second treatment arm
show(effectMatrix)
```

```
[,1] [,2]
[1,] 0.100 0.100
[2,] 0.050 0.050
[3,] 0.055 0.045
```

Considering a design with active treatment arms, both to be compared against a control arm, the effect matrix contains treatment effect in the first active arm in the first column and the effect in the second active arm in the second column respectively.

Next step would be to actually perform the simulation. Several variables need to be initialized in order to individualize the simulation according to the specific needs. Choosing `typeOfShape`

as `userDefined`

refers to the above defined effect matrix, however, one could also assume a *linear* relationship and initialize a vector of maximal assumed effects (`piMaxVector`

) in treatment groups. When some functional form of dose response curve are used, it should be noted that the `piMaxVector`

may be used as the maximum treatment effect on the whole dose response curve instead of observed dose range, which could be misleading. Therefore it is suggested to use the `userDefined`

shape. `directionUpper`

is set to `FALSE`

since low obtained rates of events do correspond with a better clinical outcome and reducing the rate is considered beneficial to the subjects. Intersection test to be performed is *Simes* , while other options would be given by e.g. *Bonferroni* or *Dunnett*, with *Dunnett* being the default setting. `typeOfSelection`

is initially set to `rbest`

with `rValue`

being 2, which means that at each stage, the best treatment arms should be carried forward. `successCriterion = "all"`

means that to stop the study early due to efficacy, both active treatment arms need to be tested significantly at interim analysis and another option would be to set `successCriterion = "atLeastOne"`

to declare significant testing of one treatment arm to be sufficient for study success at interim. The vector `plannedSubjects`

contains the cumulative per arm per stage sample sizes previously calculated.

```
# first simulation
simulation <- getSimulationMultiArmRates(
design = d_IN,
activeArms = 2,
effectMatrix = effectMatrix,
typeOfShape = "userDefined",
piControl = 0.1,
intersectionTest = "Simes",
directionUpper = FALSE,
typeOfSelection = "rBest",
rValue = 2,
effectMeasure = "testStatistic",
successCriterion = "all",
plannedSubjects = c(157, 314, 471),
allocationRatioPlanned = 1,
maxNumberOfIterations = maxNumberOfIterations,
seed = 145873,
showStatistics = TRUE
)
kable(summary(simulation))
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 100 | 0.02 | 0.86 | 0.63 | 0.63 | 0.00 | 2 | 711.21 | NA |

1 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 100 | 0.79 | 0.07 | 0.06 | 0.06 | 0.00 | 2 | 1229.31 | NA |

1 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 100 | 0.92 | 0.04 | 0.04 | 0.04 | 0.00 | 2 | 1243.44 | NA |

2 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 37 | 0.02 | 0.86 | 0.23 | 0.23 | 0.00 | 2 | 711.21 | 0.0614097 |

2 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 94 | 0.79 | 0.07 | 0.01 | 0.27 | 0.26 | 2 | 1229.31 | 0.3844948 |

2 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 96 | 0.92 | 0.04 | 0.00 | 0.28 | 0.28 | 2 | 1243.44 | 0.4097861 |

3 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 14 | 0.02 | 0.86 | NA | NA | 0.00 | 2 | 711.21 | 0.1061101 |

3 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 67 | 0.79 | 0.07 | NA | NA | 0.38 | 2 | 1229.31 | 0.5979780 |

3 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | rBest | testStatistic | all | NA | 2 | -Inf | 68 | 0.92 | 0.04 | NA | NA | 0.37 | 2 | 1243.44 | 0.7030861 |

In this output, the different input situations are indicated through while .

Under with assumed treatment effect equal to the assumed effect in the control group (scenario 1), probability of rejecting at least one is low (i.e. committing type I error, *Reject at least one [1]* ), while rejection probability (under alternative power) is high especially assuming high treatment effects in both treatment arms (under alternative, *Reject at least one [2]* for scenario ). Note also that, in any case, both treatment arms are selected as `rValue=2`

means that the two best treatment arms (i.e. all arms in this case) are selected, regardless of if the one or both treatment arms meet the applicable futility bounds. Further, the simulation output contains several information. That is, e.g. stage-wise number of subjects are calculated and probabilities for arms to be selected in each stage are provided for each assumed effect. Additionally, expected sample size, which is commonly used as optimality criterion of designs, stopping due to futility-probabilities and conditional power, defined as the probability of having a statistically significant result given the data observed thus far, are listed. Note that, since probability of futility stop is rather high in scenario (first stage: , second stage: ), the corresponding *expected number of subjects* is approximately , lying far below the expected sample sizes in the other scenarios.

Now, suppose the underlying treatment arm selection scheme is different and especially not straightforwardly covered by the other available and pre-defined options in rpact (i.e. `best`

, `rBest`

, `epsilon`

, `all`

). Then, rpact accounts for that by allowing users to implement a user defined treatment arm selection function thus being capable of covering various different selection approaches. The user needs to set `typeOfSelection`

to `userDefined`

and define a function used as input for the `selectArmsFunction`

-argument.

Say e.g. a treatment arm should be selected if and only if it doesn’t meet the futility bound, and if it does, function should be specified such that this arm is deselected at the applicable stage. See the different treatment arm selection approaches illustrated for the first stage depending on the different potential outcomes in the figures below, while the first diagram illustrates the procedure using `rBest`

and the second using the `userDefined`

scheme here, while the numbers represent the following potential first stage analysis results:

- Both active Treament Arms significant
- 1 active Treatment Arm significant, 1 active Treatment Arm non-significant
- 1 active Treatment Arm futile, 1 active Treatment Arm significant
- Both active Treatment Arms non-significant
- 1 active Treatment Arm futile, 1 active Treatment Arm non-significant
- Both active Treatment Arm futile

rBest, rValue=2

This first diagram graphically represent how treatment selection proceeds choosing *typeOfSelection=rBest* with *rValue=2* in case of the study design previously defined. Early study success can be obtained whenever both treatment arms can be tested significantly at interim, continuation of study with only non-significant (meaning neither stopped for efficacy, not for futility) active treatment arms is to be done whenever one of the active arms is discontinued due to efficacy and if none of the arms can be tested significantly at interim, the study continues with both treatment arms as even arms crossing the futility bounds are to be carried forward using *rBest, rValue=2*.

userDefined

This second diagram illustrates the active treatment arm selection that is to be implemented through the user defined selection function. Efficacy stop occurs when both active treatment arms are tested significantly at interim, continuation with non-significant arm only happens only when one active arm is significant while the other arm is not, but neither futile. The study also terminates early with success of one arm when one active arm is futile while the other can be deemed superior to control, continuation with non-futile arms only is done when both arms are non-significant or one of them is non-significant while the other one is futile. Lastly, futility stop can also happen at interim.

Now that the futility bounds are defined in the design specification already one could believe that selection rule is only an add-on to the futility bound. However, this is not true since specifying selection rule overwrites the futility bound as continuation-criterion for treatment arms. Thus, even if futility bounds are pre-specified, to implement stopping due to futility in application of `getSimulationMultiArmRates()`

(but also `getSimulationMultiArmMeans()`

, `getSimulationMultiArmSurvival()`

), one could use these as input in the customized selection function. In this case, having the previously defined futility bounds of at the first interim and at the second respectively, there is a need to individualize the selection scheme along the different stages. It is important to note here that `effectMeasure = "testStatistic"`

needs to be set as the futility bounds have intentionally been transformed to and calculated on -scale. Now, rpact enables implementation of this selection scheme by allowing *stage* as an argument in the selection function, additional to `effectVector`

:

```
# first row: first stage futility bounds, second stage: second stage futility bound
futility_bounds <- matrix(c(d_IN$futilityBounds, d_IN$futilityBounds), nrow = 2)
# selection function
selection <- function(effectVector, stage) {
# if stage==1, compare to first stage fut. bounds,
# if stage==2, compare to second stage fut. bounds
selectedArms <- switch(stage,
(effectVector >= futility_bounds[1, ]),
(effectVector >= futility_bounds[2, ])
)
return(selectedArms)
}
simulation <- getSimulationMultiArmRates(
design = d_IN,
activeArms = 2,
effectMatrix = effectMatrix,
typeOfShape = "userDefined",
piControl = 0.1,
intersectionTest = "Simes",
directionUpper = FALSE,
typeOfSelection = "userDefined",
selectArmsFunction = selection,
effectMeasure = "testStatistic",
successCriterion = "all",
plannedSubjects = c(157, 314, 471),
allocationRatioPlanned = 1,
maxNumberOfIterations = maxNumberOfIterations,
seed = 145873,
showStatistics = TRUE
)
kable(summary(simulation))
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 100 | 0.02 | 0.89 | 0.62 | 0.62 | 0.00 | 2.000000 | 657.83 | NA |

1 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 100 | 0.82 | 0.12 | 0.05 | 0.05 | 0.00 | 2.000000 | 1120.98 | NA |

1 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 100 | 0.81 | 0.15 | 0.07 | 0.08 | 0.01 | 2.000000 | 1106.85 | NA |

2 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 38 | 0.02 | 0.89 | 0.27 | 0.27 | 0.00 | 1.552632 | 657.83 | 0.0627247 |

2 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 95 | 0.82 | 0.12 | 0.07 | 0.41 | 0.34 | 1.863158 | 1120.98 | 0.4521689 |

2 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 92 | 0.81 | 0.15 | 0.08 | 0.36 | 0.28 | 1.771739 | 1106.85 | 0.4165268 |

3 | 100 | 145873 | 1 | FALSE | 157 | 2 | 1 | userDefined | 0.100 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 11 | 0.02 | 0.89 | NA | NA | 0.02 | 1.000000 | 657.83 | 0.5144470 |

3 | 100 | 145873 | 1 | FALSE | 314 | 2 | 2 | userDefined | 0.050 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 54 | 0.82 | 0.12 | NA | NA | 0.45 | 1.629630 | 1120.98 | 0.7123201 |

3 | 100 | 145873 | 1 | FALSE | 471 | 2 | 3 | userDefined | 0.045 | 0.1 | NA | NA | 1 | Simes | userDefined | testStatistic | all | 56 | 0.81 | 0.15 | NA | NA | 0.42 | 1.678571 | 1106.85 | 0.7538598 |

Depending on what stage the simulation is currently iterating through, the `switch()`

in the treatment selection function recognizes the stage changes and adapts the applicable futility bound by switching rows in the predefined futility bound matrix.

Again, assuming to be true, the probability of falsely rejecting it is , thus this simulation again indicates type I error rate control at . The power assuming a reduction is given through . Since now, stopping of treatment arms is determined by crossing the futility bounds or not, the probability of having futility in the first stage increases to in the first and to in the second stage under as compared to the first simulation. Having a different treatment selection scheme also result in lower expected number of subjects and, in opposite to the first simulation, the number of arms per scenario lies starting from stage since this treatment selection allows for early discontinuation of study arms whereas in the first simulation, arms are to be carried forward regardless of potential futility.

In both simulations, with initially calculated 157 subjects per arm per stage and a relative event rate reduction of as alternative, one can see that the study is overpowered (since *Reject at least one [j]*>0.8 in scenario ), which is due to the considered alternative having approximately the same effect in the two active treatment arms. For the second simulation, performing simulations through different potentially optimal sample sizes (in a sense that being the smallest sample size needed to achieve power) indicate that one could save some subjects and still achieve the desired power. The following plot shows the minimum sample size that one could choose to achieve the power requirements, which appears to be about , however, one should keep in mind that this number might slightly be deviating from an analytically optimal solution due to deviations introduced by the simulation:

Further, it should be noted here that even if one explicitly has under the null hypothesis, testing the hypothesis means that whenever with , one has a scenario where the null hypothesis is true. Consequently, one should properly check if the type I error rate is controlled not only assuming , but also considering the other scenarios when . For the second simulation, considering only the cases where rate equality holds, following plot obtained by simulation indicate that, given the study configurations here, the type I error rate is controlled under various control event rate assumptions:

In the next chapters, different hypothetical binary endpoint datasets are generated and analyzed using the `getAnalysisResults()`

command. As previously mentioned, rpact itself doesn’t support landmark analysis using the Greenwoods standard error (SE) formula. Thus the first two analyses base on the empirical event rates only. Afterwards, gestate and survival are then used to show how one could merge the packages to perform the intended analysis using boundaries obtained by rpact and test statistics obtained by survival probabilities and standard errors estimated using gestate and survival.

In the first stage analysis, one first has to manually enter a dataset for data observed in the trial:

```
genData_1 <- getDataset(
events1 = 4,
events2 = 8,
events3 = 16,
sampleSizes1 = 153,
sampleSizes2 = 157,
sampleSizes3 = 156
)
kable(summary(genData_1))
```

object | NA | NA | NA | NA | NA |
---|---|---|---|---|---|

1 | 1 | 153 | 4 | 153 | 4 |

1 | 2 | 157 | 8 | 157 | 8 |

1 | 3 | 156 | 16 | 156 | 16 |

This dataset is a generic realization of data from the first stage in a design with binary endpoint under the input assumption of having a rate reduction in the treatment groups given . The highest index () corresponds with the events occurred in the control group and with the underlying sample sizes, respectively. The other indices represent active treatment groups. Data here is chosen such that one can see lower event rates in the active treatment groups. Note the slight imbalances in sample sizes, which might occur due to dropouts or recruitment issues.

Actual analysis of the first stage using *Simes* as intersection test goes as follows:

```
results_1 <- getAnalysisResults(
design = d_IN,
dataInput = genData_1,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_1))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|

1 | FALSE | Simes | TRUE | 0.0261438 | 0.1025641 | 0.2906945 | NA | -0.2122927 | 0.0429712 | 0.1150139 |

1 | FALSE | Simes | TRUE | 0.0509554 | 0.1025641 | 0.1203974 | NA | -0.1913594 | 0.0787411 | 0.2428861 |

2 | FALSE | Simes | TRUE | 0.0261438 | 0.1025641 | NA | NA | NA | NA | NA |

2 | FALSE | Simes | TRUE | 0.0509554 | 0.1025641 | NA | NA | NA | NA | NA |

3 | FALSE | Simes | TRUE | 0.0261438 | 0.1025641 | NA | NA | NA | NA | NA |

3 | FALSE | Simes | TRUE | 0.0509554 | 0.1025641 | NA | NA | NA | NA | NA |

Although having obviously lower event rates in the first stage already, none of the hypotheses can be rejected, which can manually be tested comparing e.g. the overall adjusted test statistics for the global intersection hypothesis of having no effect in neither one of the active treatment arms to the efficacy bound and see that , i.e. no rejection of global intersection. Neither stop due to futility appears since no futility boundary is crossed at stage , which means that both treatment arms are to be carried forward to a second stage analysis. For obtaining non-significance in stage , one could also compare repeated p-values to full .

Proceeding to second stage, a generic dataset might look as follows, while the second vector entries represent second stage data:

```
# assuming there was no futility or efficacy stop study proceeds to randomize subjects
genData_2 <- getDataset(
events1 = c(4, 7),
events2 = c(8, 7),
events3 = c(16, 15),
sampleSizes1 = c(153, 155),
sampleSizes2 = c(157, 155),
sampleSizes3 = c(156, 155)
)
kable(summary(genData_2))
```

object | NA | NA | NA | NA | NA |
---|---|---|---|---|---|

1 | 1 | 153 | 4 | 153 | 4 |

1 | 2 | 157 | 8 | 157 | 8 |

1 | 3 | 156 | 16 | 156 | 16 |

2 | 1 | 155 | 7 | 308 | 11 |

2 | 2 | 155 | 7 | 312 | 15 |

2 | 3 | 155 | 15 | 311 | 31 |

Here, again, event rates are lower in treatment groups.

```
results_2 <- getAnalysisResults(
design = d_IN,
dataInput = genData_2,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_2))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|

1 | FALSE | Simes | TRUE | 0.0357143 | 0.0996785 | 0.2906945 | NA | -0.2122927 | 0.0429712 | 0.1150139 |

1 | FALSE | Simes | TRUE | 0.0480769 | 0.0996785 | 0.1203974 | NA | -0.1913594 | 0.0787411 | 0.2428861 |

2 | FALSE | Simes | TRUE | 0.0357143 | 0.0996785 | 0.7911121 | NA | -0.1301334 | -0.0051606 | 0.0086050 |

2 | FALSE | Simes | TRUE | 0.0480769 | 0.0996785 | 0.5132570 | NA | -0.1187833 | 0.0106252 | 0.0274124 |

3 | FALSE | Simes | TRUE | 0.0357143 | 0.0996785 | NA | NA | NA | NA | NA |

3 | FALSE | Simes | TRUE | 0.0480769 | 0.0996785 | NA | NA | NA | NA | NA |

Performing the same comparisons as in stage , one can see: The global null is to be rejected (). Subsequently, the hypothesis for the first active treatment is rejected, since indicating that the first active arm performs better than control. The same result can be obtained by recognizing that the *repeated p-value (1)* of falls below . The consequence is that this arm can be discontinued early due to efficacy. However, since the second treatment arm is not tested significant () and early stopping for efficacy is true only when all active treatments appear to be significant, study will continue up to the the final stage with the second treatment arm only. Again, no futility stop occurs. It should be noted here that the *adjusted stage-wise p-values* cannot directly be used for testing, but the values do only base on the second stage data. Since an inverse normal combination test is performed with weights given by here, the first and second stage p-values can be used to calculate the overall adjusted test statistics, exemplarily .

Third-stage-dataset:

```
genData_3 <- getDataset(
events1 = c(4, 7, NA),
events2 = c(8, 7, 6),
events3 = c(16, 15, 16),
sampleSizes1 = c(153, 155, NA),
sampleSizes2 = c(157, 155, 156),
sampleSizes3 = c(156, 155, 160)
)
kable(summary(genData_3))
```

object | NA | NA | NA | NA | NA |
---|---|---|---|---|---|

1 | 1 | 153 | 4 | 153 | 4 |

1 | 2 | 157 | 8 | 157 | 8 |

1 | 3 | 156 | 16 | 156 | 16 |

2 | 1 | 155 | 7 | 308 | 11 |

2 | 2 | 155 | 7 | 312 | 15 |

2 | 3 | 155 | 15 | 311 | 31 |

3 | 1 | NA | NA | NA | NA |

3 | 2 | 156 | 6 | 468 | 21 |

3 | 3 | 160 | 16 | 471 | 47 |

Final analysis:

```
results_3 <- getAnalysisResults(
design = d_IN,
dataInput = genData_3,
directionUpper = FALSE,
intersectionTest = "Simes"
)
```

`kable(summary(results_3))`

```
Warning in is.na(parameterValues): is.na() auf Nicht-(Liste oder Vektor) des
Typs 'environment' angewendet
```

object | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|

1 | FALSE | Simes | TRUE | NA | 0.0997877 |