Developer documentation • ModelArray

Target audience: developers, and anyone from the broader community who hopes to contribute to ModelArray.

In this documentation, we walk through important methods used in ModelArray package, and the structure of the package and its functions.

We welcome community contributions to ModelArray. If you hope to improve functions or fix a bug in our package, please follow the following steps:

fork the GitHub repository
modify the scripts
update the tests and/or add necessary new tests (see folder tests)
run tests + update docs and website:
- to make sure the unit tests and package building are without error;
- to make sure the documentations and website can be built, and to update any documentations;
- to do so, run the following commands: devtools::build_readme(), devtools::check(), devtools::build(), devtools::build_vignettes(), and finally pkgdown::build_site(). Please check whether there is no error along the process, and the built website looks good.
if everything looks good, please submit a pull request to our github repository.

Overview

ModelArray R package supports data in Hierarchical Data Format 5 (HDF5) file format. It utilizes DelayedArray and HDF5Array R packages, so that statistical analysis can be performed without loading all the original data in HDF5 file into memory. This makes ModelArray very efficient in memory usage. For details, please see section “ModelArray Construction” below.

Methods for showing and accessing ModelArray object are basic. Please see section “Show and accessors methods of ModelArray class” for more.

The key function of ModelArray package is to perform statistical analysis. For details, please see section “Model fitting” below.

Then we describe how to save statistical results to HDF5 file on disk.

Furthermore, we describe what unit tests are needed for quality assurance.

Finally we describe building the Docker image.

`ModelArray` Construction

We first define a class called ModelArray using setClass(). An ModelArray object (i.e. an instance of ModelArray class) has several slots:

sources: source filenames (e.g., a list of .mif files of fixel-wise data)
scalars: scalar matrix (or matrices)
results: statistical result matrices (if any)
path: the path to the h5 file on disk

The key feature of an ModelArray object is memory efficient. This is because the entire dataset in HDF5 (.h5) file was not loaded into the memory; only minimal data was loaded. To achieve this, we first need ModelArraySeed(), which utilizes HDF5Array::HDF5ArraySeed(), acting as a pointer to the .hdf5 file on disk. To make the arrays in the ModelArray object look more like “real” arrays, e.g. common array operations such as indexing and transposing can be applied, we utilize package DelayedArray to wraps the data in on-disk HDF5 file into a DelayedArray object. Finally a ModelArray class is defined by integrating above slots together. This is done in the ModelArray() function.

Above functions and setups can be found in script R/ModelArray_Constructor.R

Show and accessors methods of `ModelArray` class

To easily show and access a ModelArray object, we adopt S4 Object Oriented Programming (OOP) model.

End users of R actually are using such S4 methods frequently. For example, when showing the (summarized) content of an object, users simply enter its name in R console and press “enter”. Under the hood, it uses the generic show() method. As for showing a ModelArray object, we adopt the same way. As show() has already been a generic function, we only need to setMethod() for its setup.

With S4 OOP model, users can easily access the data they need. Below are current accessor functions for accessing different slots in ModelArray class’s:

sources(): source filenames (e.g., a list of .mif files of fixel-wise data)
scalars(): scalar matrix (or matrices)
results(): statistical result matrix or matrices (if any)

We use setMethod() to define above methods. Unlike show(), these accessors are not generic yet, so we then do additional step of setGeneric().

All of these show and accessors setups can be found in script R/ModelArray_S4Methods.R.

Model fitting

Functions such as ModelArray.lm() and ModelArray.gam() are for model fitting. Under the hood, these functions iteratively call analyseOneElement.lm() and analyseOneElement.gam() for fitting for one element, respectively. This also facilitates the parallel computing across all elements requested.

Besides the existing functions for linear models (ModelArray.lm()) and GAMs (ModelArray.gam()), new functions for diverse statistical models can be added by following the structure below.

The general structure for model fitting functions ModelArray.<model_name>() in ModelArray is as below:

Some sanity checks of arguments
Print out some important methods for model fitting, such as method in mgcv::gam() for GAMs
Fit statistical models:
- First, initialize: run analyseOneElement.<model_name>() once to get the column names of the statistics in the output statistics; usually using an element in the middle (instead of the first or the last element) to avoid too many NaNs in the data;
- Then, iteratively call analyseOneElement.<model_name>() across all elements requested (can be parallelized). The function analyseOneElement.<model_name>() does the following jobs:
  - Fit the model for this element;
  - Get necessary statistics via broom::tidy(), broom::glance(), and/or summary().
  - Flatten the results into one row of data.frame for this element; add element ids; remove the column names (it’s a list of numeric now) to save a bit more memory
  - Return this result of this element
- Concatenate results of all requested elements into a matrix of elements by statistics, then add the column names; now we get the final result data.frame
If p-value correction (for terms and/or model) is requested:
- For each p-value correction method, apply it and add the values to the result data.frame
Return the result data.frame

For ModelArray.gam(), there is an additional step after one round of iteration:

If changed.rsq.term.index is not NULL, i.e. there is request of quantifying one (or more terms)’s importance in the model (i.e. metrics delta adjusted R-squared and partial R-squared), we need to compute the reduced model (formula without the term of interest):
- For each requested term of interest:
  - Iteratively call analyseOneElement.<model_name>() but providing reduced formula
  - Get necessary statistics
  - Compute delta adjusted R-squared and partial R-squared; add them to the previously got data.frame

Therefore, there might be two or more rounds of iterations on model fitting in ModelArray.gam(), depending on number of terms the user requested in changed.rsq.term.index.

Functions mentioned in this section can be found in scripts R/analyse.R and R/ModelArray_Constructor.R.

Saving statistical results to .hdf5 file on disk

After getting the data.frame of statistical results, users will save it into the original HDF5 file on disk. The function is writeResults() located in script R/ModelArray_Constructor.R

Writing tests for CircleCI

We use Continuous Integration (CI) testing to ensure stability and quality assurance of ModelArray. Therefore, when you modify the source code of ModelArray, it is very important to make sure you add appropriate tests of the new or modified functionality.

All the test files are located in folder tests/testthat. Once a commit is pushed to GitHub, CircleCI (a CI testing platform) will be triggered to build the software and run the unit tests.

The tests serve the goals of:

Run functions in different scenarios (with different requests) and expect no error. This is to ensure when new features added, it will run without error; ** For some scenarios, some special data is needed. For example, when testing NAs in input data (for volume data)
Expect the output data.frame will contain column with a specific name, or check the dimensions of the output data.frame. This is to ensure expected columns show up.
Most importantly, check the values are as expected: statistical results calculated in ModelArray fitting loop = those calculated in standard R.

Developers can run the tests locally to see if unit tests in a test file (in tests/testthat folder) can run without errors:

rm(list = ls())
devtools::load_all()

Then click “Run Test” button in RStudio to run the test file. Check if there is anything failed.

For more details on how to write unit tests + test out, please check out Testing chapters in the book “R Packages” written by Hadley Wickham, Jennifer Bryan

Building the Docker image for ModelArray + ConFixel

The Dockerfile for the Docker image can be found in the root folder of the GitHub repository of ModelArray.

CircleCI will automatically build the Docker image and push to Docker Hub when there is a GitHub commit or a merge to main branch, or when a version is tagged. See the .circleci/config.yml file for more details on these setups.