Developer documentation
doc_for_developer.Rmd
Target audience: developers, and anyone from the broader community
who hopes to contribute to ModelArray
.
In this documentation, we walk through important methods used in
ModelArray
package, and the structure of the package and
its functions.
We welcome community contributions to ModelArray
. If you
hope to improve functions or fix a bug in our package, please follow the
following steps:
- fork the GitHub repository
- modify the scripts
- update the tests and/or add necessary new tests (see folder
tests
) - run tests + update docs and website:
- to make sure the unit tests and package building are without error;
- to make sure the documentations and website can be built, and to update any documentations;
- to do so, run the following commands:
devtools::build_readme()
,devtools::check()
,devtools::build()
,devtools::build_vignettes()
, and finallypkgdown::build_site()
. Please check whether there is no error along the process, and the built website looks good.
- if everything looks good, please submit a pull request to our github repository.
Overview
ModelArray
R package supports data in Hierarchical Data
Format 5 (HDF5) file format. It utilizes DelayedArray and HDF5Array R
packages, so that statistical analysis can be performed without loading
all the original data in HDF5 file into memory. This makes
ModelArray
very efficient in memory usage. For details,
please see section “ModelArray Construction” below.
Methods for showing and accessing ModelArray
object are
basic. Please see section “Show and accessors methods of ModelArray
class” for more.
The key function of ModelArray
package is to perform
statistical analysis. For details, please see section “Model fitting”
below.
Then we describe how to save statistical results to HDF5 file on disk.
Furthermore, we describe what unit tests are needed for quality assurance.
Finally we describe building the Docker image.
ModelArray
Construction
We first define a class called ModelArray
using
setClass()
. An ModelArray
object (i.e. an
instance of ModelArray
class) has several slots:
- sources: source filenames (e.g., a list of .mif files of fixel-wise data)
- scalars: scalar matrix (or matrices)
- results: statistical result matrices (if any)
- path: the path to the h5 file on disk
The key feature of an ModelArray
object is memory
efficient. This is because the entire dataset in HDF5 (.h5) file was not
loaded into the memory; only minimal data was loaded. To achieve this,
we first need ModelArraySeed()
, which utilizes
HDF5Array::HDF5ArraySeed()
, acting as a pointer to the
.hdf5 file on disk. To make the arrays in the ModelArray
object look more like “real” arrays, e.g. common array operations such
as indexing and transposing can be applied, we utilize package
DelayedArray
to wraps the data in on-disk HDF5 file into a
DelayedArray
object. Finally a ModelArray
class is defined by integrating above slots together. This is done in
the ModelArray()
function.
Above functions and setups can be found in script
R/ModelArray_Constructor.R
Show and accessors methods of ModelArray
class
To easily show and access a ModelArray
object, we adopt
S4 Object Oriented Programming (OOP) model.
End users of R actually are using such S4 methods frequently. For
example, when showing the (summarized) content of an object, users
simply enter its name in R console and press “enter”. Under the hood, it
uses the generic show()
method. As for showing a
ModelArray
object, we adopt the same way. As
show()
has already been a generic function, we only need to
setMethod()
for its setup.
With S4 OOP model, users can easily access the data they need. Below
are current accessor functions for accessing different slots in
ModelArray
class’s:
-
sources()
: source filenames (e.g., a list of .mif files of fixel-wise data) -
scalars()
: scalar matrix (or matrices) -
results()
: statistical result matrix or matrices (if any)
We use setMethod()
to define above methods. Unlike
show()
, these accessors are not generic yet, so we then do
additional step of setGeneric()
.
All of these show and accessors setups can be found in script
R/ModelArray_S4Methods.R
.
Model fitting
Functions such as ModelArray.lm()
and
ModelArray.gam()
are for model fitting. Under the hood,
these functions iteratively call analyseOneElement.lm()
and
analyseOneElement.gam()
for fitting for one element,
respectively. This also facilitates the parallel computing across all
elements requested.
Besides the existing functions for linear models
(ModelArray.lm()
) and GAMs (ModelArray.gam()
),
new functions for diverse statistical models can be added by following
the structure below.
The general structure for model fitting functions
ModelArray.<model_name>()
in ModelArray
is as below:
- Some sanity checks of arguments
- Print out some important methods for model fitting, such as
method
inmgcv::gam()
for GAMs - Fit statistical models:
- First, initialize: run
analyseOneElement.<model_name>()
once to get the column names of the statistics in the output statistics; usually using an element in the middle (instead of the first or the last element) to avoid too many NaNs in the data; - Then, iteratively call
analyseOneElement.<model_name>()
across all elements requested (can be parallelized). The functionanalyseOneElement.<model_name>()
does the following jobs:- Fit the model for this element;
- Get necessary statistics via
broom::tidy()
,broom::glance()
, and/orsummary()
. - Flatten the results into one row of data.frame for this element; add element ids; remove the column names (it’s a list of numeric now) to save a bit more memory
- Return this result of this element
- Concatenate results of all requested elements into a matrix of elements by statistics, then add the column names; now we get the final result data.frame
- First, initialize: run
- If p-value correction (for terms and/or model) is
requested:
- For each p-value correction method, apply it and add the values to the result data.frame
- Return the result data.frame
For ModelArray.gam()
, there is an additional step after
one round of iteration:
- If
changed.rsq.term.index
is notNULL
, i.e. there is request of quantifying one (or more terms)’s importance in the model (i.e. metrics delta adjusted R-squared and partial R-squared), we need to compute the reduced model (formula without the term of interest):- For each requested term of interest:
- Iteratively call
analyseOneElement.<model_name>()
but providing reduced formula - Get necessary statistics
- Compute delta adjusted R-squared and partial R-squared; add them to the previously got data.frame
- Iteratively call
- For each requested term of interest:
Therefore, there might be two or more rounds of iterations on model
fitting in ModelArray.gam()
, depending on number of terms
the user requested in changed.rsq.term.index
.
Functions mentioned in this section can be found in scripts
R/analyse.R
and
R/ModelArray_Constructor.R
.
Saving statistical results to .hdf5 file on disk
After getting the data.frame of statistical results, users will save
it into the original HDF5 file on disk. The function is
writeResults()
located in script
R/ModelArray_Constructor.R
Writing tests for CircleCI
We use Continuous Integration (CI) testing to ensure stability and
quality assurance of ModelArray
. Therefore, when you modify
the source code of ModelArray
, it is very important to make
sure you add appropriate tests of the new or modified functionality.
All the test files are located in folder tests/testthat
.
Once a commit is pushed to GitHub, CircleCI (a CI testing platform) will
be triggered to build the software and run the unit tests.
The tests serve the goals of:
- Run functions in different scenarios (with different requests) and expect no error. This is to ensure when new features added, it will run without error; ** For some scenarios, some special data is needed. For example, when testing NAs in input data (for volume data)
- Expect the output data.frame will contain column with a specific name, or check the dimensions of the output data.frame. This is to ensure expected columns show up.
- Most importantly, check the values are as expected: statistical results calculated in ModelArray fitting loop = those calculated in standard R.
Developers can run the tests locally to see if unit tests in a test
file (in tests/testthat
folder) can run without errors:
Then click “Run Test” button in RStudio to run the test file. Check if there is anything failed.
For more details on how to write unit tests + test out, please check
out Testing
chapters in the book “R Packages” written by Hadley
Wickham, Jennifer Bryan
Building the Docker image for ModelArray + ConFixel
The Dockerfile
for the Docker image can be found in the
root folder of the GitHub repository of ModelArray
.
CircleCI will automatically build the Docker image and push to Docker
Hub when there is a GitHub commit or a merge to main branch, or when a
version is tagged. See the .circleci/config.yml
file for
more details on these setups.