Modelling in Depth: lm, gam, and wrap
modelling.RmdThis vignette covers the three modelling functions in ModelArray in
detail. If you haven’t already, start with
vignette("walkthrough") for the end-to-end workflow, and
vignette("elements") for background on what an element
is.
library(ModelArray)
# Using the PNC fixel demo data (see walkthrough for download instructions)
h5_path <- "~/Desktop/myProject/demo_FDC_n100.h5"
modelarray <- ModelArray(h5_path, scalar_types = c("FDC"))
phenotypes <- read.csv("~/Desktop/myProject/cohort_FDC_n100.csv")Linear models with ModelArray.lm()
Basic usage
ModelArray.lm() fits lm() independently at
every element. Define a formula with the scalar name as the response
variable:
formula.lm <- FDC ~ Age + sex + dti64MeanRelRMS
result <- ModelArray.lm(formula.lm, modelarray, phenotypes, "FDC",
element.subset = 1:100
)Output columns
By default, for each term (Intercept, Age, sex, etc.) the output includes:
| Column | Description |
|---|---|
<term>.estimate |
Coefficient estimate (slope) |
<term>.statistic |
t-statistic |
<term>.p.value |
Uncorrected p-value |
<term>.p.value.fdr |
FDR-corrected p-value |
And for the overall model:
| Column | Description |
|---|---|
model.adj.r.squared |
Adjusted R-squared |
model.p.value |
Model F-test p-value |
model.p.value.fdr |
FDR-corrected model p-value |
More comprehensive output
Use full.outputs = TRUE to get additional
statistics:
result_full <- ModelArray.lm(formula.lm, modelarray, phenotypes, "FDC",
element.subset = 1:100,
full.outputs = TRUE
)P-value correction
By default, FDR correction is applied. You can request additional correction methods:
result <- ModelArray.lm(formula.lm, modelarray, phenotypes, "FDC",
element.subset = 1:100,
correct.p.value.terms = c("fdr", "bonferroni")
)This adds <term>.p.value.bonferroni columns
alongside the FDR-corrected ones.
Generalized Additive Models with ModelArray.gam()
GAMs are useful when you expect nonlinear relationships - for example, nonlinear changes in brain structure across the lifespan.
Basic usage
Use s() to specify smooth terms. The k
parameter sets an upper limit on the degrees of freedom for the smooth
function:
formula.gam <- FDC ~ s(Age, k = 4) + sex + dti64MeanRelRMS
result <- ModelArray.gam(formula.gam, modelarray, phenotypes, "FDC",
element.subset = 1:100,
method = "REML"
)ModelArray prints a summary of the smooth term settings when you run this:
The formula requested: FDC ~ s(Age, k = 4) + sex + dti64MeanRelRMS
s(Age): k = 4; fx = FALSE (default); bs = tp (default)
method = REML (default: GCV.Cp)
Output columns
GAM output differs from lm for smooth terms vs parametric terms:
Smooth terms (e.g., s(Age)):
| Column | Description |
|---|---|
s_Age.statistic |
F-statistic for the smooth |
s_Age.p.value |
Uncorrected p-value |
s_Age.p.value.fdr |
FDR-corrected p-value |
Parametric terms (e.g., sex,
dti64MeanRelRMS):
| Column | Description |
|---|---|
<term>.estimate |
Coefficient estimate |
<term>.statistic |
t-statistic |
<term>.p.value |
Uncorrected p-value |
Model-level:
| Column | Description |
|---|---|
model.dev.expl |
Proportion of null deviance explained |
P-value correction for GAMs
Smooth terms and parametric terms have separate correction arguments:
result <- ModelArray.gam(formula.gam, modelarray, phenotypes, "FDC",
element.subset = 1:100,
correct.p.value.smoothTerms = c("fdr", "bonferroni"),
correct.p.value.parametricTerms = c("fdr", "bonferroni"),
method = "REML"
)Smooth term options
ModelArray supports the full range of mgcv::gam() smooth
terms:
-
s(): penalized regression splines (default thin-plate,bs = "tp") -
te(),ti(),t2(): tensor product smooths for interactions between continuous variables
The fx parameter controls whether the smooth is
penalized (fx = FALSE, the default) or unpenalized/fixed
(fx = TRUE). The bs parameter selects the
basis type.
GAM formula helpers
ModelArray provides two helper functions for common GAM formula patterns:
Factor-smooth interaction
(gen_gamFormula_fxSmooth): Generates
y ~ orderedFactor + s(x) + s(x, by=orderedFactor):
result <- gen_gamFormula_fxSmooth(
response.var = "FDC",
factor.var = "group",
smooth.var = "Age",
phenotypes = phenotypes,
reference.group = "control",
fx = TRUE, k = 4
)
formula <- result$formula
phenotypes <- result$phenotypes # may contain new ordered factor columnContinuous interaction
(gen_gamFormula_contIx): Generates
y ~ ti(x) + ti(z) + ti(x, z):
formula <- gen_gamFormula_contIx(
response.var = "FDC",
cont1.var = "Age",
cont2.var = "BMI",
fx = TRUE, k = 4
)Custom functions with ModelArray.wrap()
ModelArray.wrap() lets you run any R function across all
elements, using ModelArray’s efficient looping and parallelization
machinery.
Prototyping with exampleElementData()
Before writing your function, use exampleElementData()
to see what a per-element data frame looks like:
# Using the HBN cortical thickness data
h5_path <- "path/to/thickness.h5"
modelarray <- ModelArray(h5_path, scalar_types = c("thickness"))
phenotypes <- read.csv("path/to/phenotypes.csv")
example_df <- exampleElementData(modelarray,
scalar = "thickness", i_element = 12345, phenotypes = phenotypes
)
names(example_df)This returns a data frame with all the phenotype columns plus a new column for the scalar value at that element. You can explore it like any data frame:
Writing your function
Your function must accept a single argument named data
and return a single-row data frame (or tibble):
my_site_analysis <- function(data) {
# Compute mean thickness per site
site_means <- data %>%
dplyr::group_by(study_site) %>%
dplyr::summarize(mean_thickness = mean(thickness, na.rm = TRUE)) %>%
tidyr::pivot_wider(
names_from = study_site,
values_from = mean_thickness,
names_glue = "thickness_{study_site}"
)
# Run ANOVA on site
site_anova <- lm(thickness ~ study_site, data = data) %>%
anova() %>%
broom::tidy() %>%
dplyr::filter(term != "Residuals") %>%
dplyr::select(term, statistic, p.value) %>%
tidyr::pivot_wider(
names_from = term,
values_from = c(statistic, p.value),
names_glue = "{term}.{.value}"
)
dplyr::bind_cols(site_means, site_anova)
}Testing on a single element
Use analyseOneElement.wrap() to verify your function
works in ModelArray’s machinery:
analyseOneElement.wrap(
12345, my_site_analysis, modelarray, phenotypes, "thickness",
num.subj.lthr = 10, flag_initiate = TRUE, on_error = "debug"
)Running across all elements
result <- ModelArray.wrap(
FUN = my_site_analysis,
data = modelarray,
phenotypes = phenotypes,
scalar = "thickness",
n_cores = 8
)
head(result)
writeResults(h5_path, df.output = result, analysis_name = "site_analysis")Case study: harmonize with covfam and stream to a new
scalar
You can also use ModelArray.wrap() as a transformation
engine rather than a test runner. The example below applies a
harmonization step per element, streams the harmonized values directly
into /scalars, and then loads that harmonized scalar in a
new ModelArray.
library(covfam)
# Assume phenotypes has harmonization variables, e.g.:
# source_file, site, age, sex
# and that "thickness" is already present in the h5 as an input scalar.
h5_in <- "thickness_raw.h5"
h5_harmonized <- "thickness_harmonized.h5"
ma_raw <- ModelArray(h5_in, scalar_types = "thickness")
phenotypes <- read.csv("phenotypes.csv")
# Copy metadata/layout into a new file so we can append new scalars there.
file.copy(h5_in, h5_harmonized, overwrite = TRUE)
covfam_harmonize_element <- function(data) {
# Harmonize one element across subjects.
# Adjust arguments here to match your covfam configuration.
out <- covfam::covfam(
y = data$thickness,
batch = data$site,
mod = data.frame(age = data$age, sex = data$sex)
)
# Return named subject-level vector so wrap columns map to source_file order.
vals <- out$y_harmonized
names(vals) <- data$source_file
vals
}
# Stream harmonized subject-level values into scalars/thickness_covfam/values.
# Set return_output = FALSE to avoid keeping the full output table in memory.
ModelArray.wrap(
FUN = covfam_harmonize_element,
data = ma_raw,
phenotypes = phenotypes,
scalar = "thickness",
n_cores = 8,
write_scalar_name = "thickness_covfam",
write_scalar_file = h5_harmonized,
write_scalar_flush_every = 2000L,
return_output = FALSE
)
# Load the harmonized scalar as a new modelling input
ma_harmonized <- ModelArray(h5_harmonized, scalar_types = c("thickness_covfam"))
# Use harmonized data in downstream models
fit_harmonized <- ModelArray.lm(
thickness_covfam ~ age + sex,
data = ma_harmonized,
phenotypes = phenotypes,
scalar = "thickness_covfam",
n_cores = 8
)If you want to stream model statistics too (not only transformed
scalars), use write_results_name and
write_results_file in ModelArray.lm(),
ModelArray.gam(), or ModelArray.wrap().
Quick QA: export one harmonized image to inspect
After writing thickness_covfam into
/scalars, you can export one row back to an image and open
it in your usual viewer as a spot-check. Use --column-index
to select which row from scalars/thickness_covfam/values to
export.
modelarrayio h5-export-nifti-file \
--input-hdf5 thickness_harmonized.h5 \
--scalar-name thickness_covfam \
--column-index 0 \
--group-mask-file group_mask_thickness.nii.gz \
--output-file qa_thickness_covfam_col000.nii.gzIf your workflow uses CIFTI or MIF instead of NIfTI, use the matching
export command: h5-export-cifti-file or
h5-export-mif-file.
Modelling across multiple h5 files with
mergeModelArrays()
When scalars live in separate h5 files — for example, cortical thickness in one file and curvature in another — you can combine them for cross-scalar modelling without rewriting or merging h5 files on disk.
Setup
Each h5 file gets its own ModelArray and phenotypes data frame. The
phenotypes must share a common identifier column (e.g.,
subject_id):
ma_thick <- ModelArray("thickness.h5", scalar_types = "thickness")
ma_curv <- ModelArray("curv.h5", scalar_types = "curv")
phen_thick <- read.csv("phenotypes_thickness.csv")
# Must contain: subject_id, source_file, (other covariates)
phen_curv <- read.csv("phenotypes_curv.csv")
# Must contain: subject_id, source_fileMerging
merged <- mergeModelArrays(
list(ma_thick, ma_curv),
list(phen_thick, phen_curv),
merge_on = "subject_id"
)This performs an inner join on subject_id, keeping only
subjects present in both files. The result is a list with:
-
merged$data— a combined ModelArray with boththicknessandcurvscalars -
merged$phenotypes— the joined phenotypes, withsource_file.thicknessandsource_file.curvcolumns preserving the original filenames, plus a unifiedsource_filecolumn used internally
Cross-scalar modelling
Now you can use one scalar as a predictor for another:
result <- ModelArray.lm(
thickness ~ curv + Age + sex,
data = merged$data,
phenotypes = merged$phenotypes,
scalar = "thickness"
)This works with ModelArray.gam() and
ModelArray.wrap() too.
N-way merges
mergeModelArrays() accepts any number of
ModelArrays:
merged <- mergeModelArrays(
list(ma_thick, ma_curv, ma_sulc),
list(phen_thick, phen_curv, phen_sulc),
merge_on = "subject_id"
)All scalar names must be unique across the inputs.
Converting results back to image format
After writeResults(), use the appropriate ModelArrayIO
tool to convert results back to the original image format for
visualization:
| Data type | Command | Viewer |
|---|---|---|
| Fixel | fixelstats_write |
MRtrix MRView |
| Voxel | voxelstats_write |
FSLeyes, MRView |
| Surface | ciftistats_write |
Connectome Workbench |
See the ModelArrayIO
documentation for command-line usage details, and
vignette("walkthrough") for a worked example with
MRView.