Using HDF5 for Large-Scale Analyses • ModelArray

Why HDF5?

A typical fixel-based analysis might involve 600,000 fixels across 1,000 subjects. Stored naively as a double-precision matrix, that’s about 4.5 GB — too large to hold in memory on most systems, let alone alongside the R session overhead and model fitting.

HDF5 solves this by keeping data on disk and providing efficient random access to slices of the data. ModelArray leverages this to fit models element-by-element without ever loading the full matrix into RAM.

Chunking

HDF5 files store data in chunks — rectangular blocks of the array that are read and written as a unit. When you access a single row (element), HDF5 only reads the chunk(s) that contain that row, not the entire dataset.

The chunk layout matters for performance. Because ModelArray reads data row by row (one element at a time), chunks that span a modest number of rows and all columns work best. This is the default behavior of the ModelArrayIO conversion commands (confixel, convoxel, concifti).

When creating HDF5 files with the confixel command from ModelArrayIO, you can control the chunk size with the --chunkmb flag:

$ confixel \
    --index-file FDC/index.mif \
    --directions-file FDC/directions.mif \
    --cohort-file cohort.csv \
    --relative-root /path/to/project \
    --output-hdf5 data.h5 \
    --chunkmb 4       # chunk size in MB (default)

Smaller chunks use less memory per read but may require more disk seeks. The default of 4 MB is a reasonable trade-off for most datasets.

Compression

HDF5 supports transparent compression of chunks using gzip. When enabled, data is compressed on disk and decompressed on-the-fly when read. This typically reduces file sizes by 30–60% with minimal impact on read speed.

ModelArrayIO enables gzip compression by default. You can see this reflected in the output filenames:

study-HBN_compression-gzip_chunkmb-4_thickness.h5

The compression is transparent to ModelArray — it reads the data the same way regardless of whether it’s compressed.

Memory efficiency during model fitting

When you call ModelArray.lm(), ModelArray.gam(), or ModelArray.wrap(), here is what happens for each element:

Read one row from the scalar matrix (the values for all subjects at this element)
Combine with the phenotypes data frame to create a per-element data frame
Fit the model (lm, gam, or your custom function)
Extract statistics into a single row of the output data frame
Discard the per-element data and move to the next element

At no point is more than one row of the scalar matrix in memory. This is why ModelArray can handle arbitrarily large datasets — the memory footprint is determined by the number of subjects (columns), not the number of elements (rows).

Parallelism within a single process

ModelArray supports parallel processing via the n_cores parameter:

result <- ModelArray.lm(FDC ~ Age + sex, modelarray, phenotypes, "FDC",
  n_cores = 4
)

This uses parallel::mclapply() (fork-based parallelism on Linux/macOS) to process multiple elements simultaneously. Each forked worker inherits the same file handle and reads different rows from the HDF5 file. Because HDF5 supports concurrent reads from the same file, this works safely and scales well.

Why concurrent writes don’t work

While multiple processes can safely read from the same HDF5 file, they cannot safely write to it simultaneously. The standard HDF5 library (without special MPI-IO compilation) does not support concurrent writes — doing so can corrupt the file.

This means you should never have multiple R processes calling writeResults() on the same HDF5 file at the same time. If you split your analysis across HPC jobs (see vignette("element-splitting")), each job should save its partial results to a separate .rds file, then a single final step combines them and writes to the HDF5 file.

# WRONG: multiple jobs writing to the same H5 simultaneously
# writeResults("data.h5", df.output = my_partial_result, ...)  # Don't do this in parallel!

# RIGHT: save partial results, then combine in one process
saveRDS(my_partial_result, sprintf("result_chunk_%d.rds", job_id))

# ... after all jobs finish, in a single process:
chunks <- lapply(Sys.glob("result_chunk_*.rds"), readRDS)
full_result <- do.call(rbind, chunks)
writeResults("data.h5", df.output = full_result, analysis_name = "results_lm")

Backing up your HDF5 file

Because writeResults() modifies the HDF5 file in place, it’s good practice to back up your file before writing results:

$ cp data.h5 data_backup.h5

If something goes wrong during a write (e.g., the process is killed mid-write), the HDF5 file may be left in an inconsistent state. Having a backup lets you recover without re-running the conversion step.