Understanding the HDF5 File
hdf5-format.RmdWhat is HDF5?
HDF5 (Hierarchical Data Format version 5) is a file format designed for storing large, structured datasets. Think of it as a filesystem within a file:
- Groups act like directories, organizing data hierarchically
- Datasets act like files, holding arrays of data
- Attributes act like metadata, attached to groups or datasets
HDF5 files use the .h5 extension and are widely used in
scientific computing because they support on-disk access (reading data
without loading the entire file into memory), compression, and efficient
chunked storage.
The ModelArray HDF5 layout
ModelArray expects a specific structure inside the HDF5 file. This structure is created by companion tools — ModelArrayIO for fixel data, voxel data, and CIFTI data.
The layout follows this pattern:
/
├── scalars/
│ └── <scalar_name>/ # e.g., "FDC", "FA", "thickness"
│ └── values # Dataset: elements × source files matrix
│ └── column_names # Dataset (text): source filenames
└── results/ # Created by writeResults()
└── <analysis_name>/ # e.g., "results_lm", "results_gam"
├── results_matrix # Dataset: elements × statistics matrix
└── column_names # Dataset (text): statistic column names
The scalars group
/scalars/<scalar_name>/values is the main data
matrix:
- Rows = elements (fixels, voxels, or greyordinates)
- Columns = source files
/scalars/<scalar_name>/column_names stores the
source filenames that correspond to each column, so you can trace every
column back to its subject.
In modern ModelArrayIO, column names are written to a
dedicated text dataset because long name vectors can exceed practical
HDF5 attribute limits.
The results group
/results/<analysis_name>/results_matrix is created
when you call writeResults() to save statistical outputs.
Each analysis gets its own subgroup, so you can store multiple analyses
(e.g., results_lm and results_gam) in the same
file.
The corresponding statistic names are stored in
/results/<analysis_name>/column_names as a text
dataset.
Modern vs legacy column-name storage
-
Preferred (current): text dataset
column_names -
Legacy (older files): attributes on
values/results_matrix(for examplecolumn_namesorcolnames)
If both are present, prefer the column_names dataset as
the canonical source and treat attributes as backward-compatibility
metadata.
How ModelArray reads HDF5
When you create a ModelArray object, the scalar data is not loaded into memory. Instead, ModelArray uses the DelayedArray framework to create a lazy reference to the on-disk data:
library(ModelArray)
modelarray <- ModelArray("data.h5", scalar_types = c("FDC"))
scalars(modelarray)[["FDC"]]<602229 x 100> matrix of class DelayedMatrix and type "double":
The DelayedMatrix holds a pointer to the HDF5 file. Data
is only read from disk when you actually access specific rows or
columns. This is why ModelArray can handle datasets with hundreds of
thousands of elements and thousands of subjects without running out of
memory.
During model fitting, ModelArray reads one element (row) at a time — pulling a single row of ~N subject values, fitting the model, and moving on. At no point is the full matrix loaded into RAM.
Inspecting an HDF5 file
You can explore the structure of any HDF5 file using
rhdf5::h5ls():
rhdf5::h5ls("data.h5") group name otype dclass dim
0 / analysis_configs H5I_GROUP
1 /analysis_configs results_lm H5I_GROUP
2 / results H5I_GROUP
3 /results results_lm H5I_GROUP
4 /results/results_lm results_matrix H5I_DATASET FLOAT 602229 x 17
5 / scalars H5I_GROUP
6 /scalars FDC H5I_GROUP
7 /scalars/FDC values H5I_DATASET FLOAT 602229 x 100
You can also read specific pieces of data directly:
# Preferred: read names from the text dataset
rhdf5::h5read("data.h5", "scalars/FDC/column_names")
# Legacy fallback for older files
rhdf5::h5readAttributes("data.h5", "scalars/FDC/values")$column_names
# Read a small slice of the data matrix
rhdf5::h5read("data.h5", "scalars/FDC/values", index = list(1:5, 1:3))Creating HDF5 files
ModelArray does not create HDF5 files from raw imaging data — that’s the job of the companion conversion tools:
| Data type | Command |
|---|---|
Fixel (.mif) |
confixel |
Voxel (.nii.gz) |
convoxel |
Surface (.dscalar.nii) |
concifti |
Each command reads the source imaging files listed in a cohort CSV and writes them into the HDF5 layout described above. See the ModelArrayIO documentation for detailed usage instructions.