Developing a function for ModelArray.wrap
wrap_function.Rmd
Introduction
R
offers a ton of wonderful packages for statistical
modeling - more than we could ever hope to officially support with
functions like ModelArray.lm
and
ModelArray.gam
in ModelArray
. Fortunately,
ModelArray.wrap
provides a way to run your own function,
which can be arbitrarily complex, while using the efficient looping and
parallelization machinery of ModelArray
.
We’ll skip over the making of the .h5 file and phenotypes CSV file, and directly download the cortical thickness data from the Healthy Brain Network computed as part of the Reproducible Brain Charts project.
We’ll develop a function that computes the mean cortical thickness
within each site, and then use ModelArray.wrap
to run it
across elements.
Step 1. Prepare your data
We first create a folder called “myProject” on the Desktop. In a terminal console:
$ cd ~/Desktop
$ mkdir myCustomFunction
$ cd myCustomFunction
$ pwd # print the full path of this folder
$ wget -cO - https://osf.io/uxc65/download > download.tar.gz
$ tar -xzf download.tar.gz
$ rm download.tar.gz
Step 2. Use ModelArray to perform statistical analysis
The next step is to use this HDF5 file and the CSV file we prepared to perform statistical analysis in R. If you installed RStudio (which we recommend), then you can simply launch RStudio. All the commands in this Step 2 section will be run in R.
Step 2.1. Load ModelArray package in R
After you’ve installed ModelArray
, here’s how we load
the library:
Step 2.2. Create a ModelArray-class object
To create a ModelArray-class object that represents the HDF5 file of fixel-wise data, we need the HDF5 filename and the scalar’s name:
h5_path <- "~/Desktop/myCustomFunction/ConCifti/study-HBN_compression-gzip_chunkmb-4_thickness.h5"
# create a ModelArray-class object:
modelarray <- ModelArray(h5_path, scalar_types = c("thickness"))
# let's check what's in it:
modelarray
You’ll see:
ModelArray located at ~/Desktop/myProject/demo_FDC_n100.h5
Source files: 2336
Scalars: thickness
Analyses:
This shows that there are 2336 source thickness files in this
modelarray
you created.
You may take a look at what the scalar matrix looks like by using
scalars()
:
# scalar FDC data:
scalars(modelarray)[["thickness"]]
You’ll see:
<327684 x 2336> DelayedMatrix object of type "double":
ciftis/sub-NDARAA075AMK_thickness.fsLR_den-164k.dscalar.nii ... ciftis/sub-NDARZZ810LVF_thickness.fsLR_den-164k.dscalar.nii
[1,] 2.323956 . 3.401824
[2,] 2.927209 . 2.428174
[3,] 3.211094 . 2.479505
[4,] 2.463409 . 3.760944
[5,] 2.460162 . 2.912316
... . . .
[327680,] 2.832996 . 3.066516
[327681,] 2.669299 . 2.912496
[327682,] 2.555752 . 2.510855
[327683,] 2.336970 . 2.472471
[327684,] 2.256365 . 2.293425
Rows are greyordinates (n = 327684)
, and column names
are the source filenames (n = 2336)
. Each value is a
specific greyordinate’s thickness from a dscalar.nii file.
Step 2.3. Load cohort phenotypes CSV file
We then load the accompanying CSV file:
csv_path <- "~/Desktop/myCustomFunction/ConCifti/study-HBN_desc-participants_thickness.csv"
# load the CSV file:
phenotypes <- read.csv(csv_path)
nrow(phenotypes)
names(phenotypes)
This CSV file comes from RBC but is reduced to include only the subjects with thickness data. It also includes rich phenotype information.
Step 2.4. Perform statistical analysis
With both modelarray
and data frame
phenotypes
set up, we can now perform the statistical
analysis.
We need to write a function that we can use within
ModelArray.wrap
to compute what we want. It can be tricky
to figure out exactly what you want to compute and how to make sure your
function returns the correct output. We can use a convenience function
from ModelArray
to see what a dataframe looks like for a
single element.
example_df <- exampleElementData(modelarray, scalar = "thickness", i_element = 12345, phenotypes = phenotypes)
names(example_df)
This is a rich dataframe! It contains the cortical thickness for all subjects in the study at vertex 12345. It also contains the subject ID, sex, age, and other covariates.
We can explore this single element like we would any dataframe:
library(ggplot2)
ggplot(example_df, aes(x = age, y = thickness, color = study_site)) +
geom_point() +
stat_smooth(method = "lm") +
theme_minimal()
This looks cool. There is apparently a site effect, so let’s write a function that computes a mean cortical thickness within each site. We’ll also compute an anova on site and get the statistics/p-values. For now we’ll get the values we want into a dataframe, then we’ll write the function.
site_means <- example_df %>%
group_by(study_site) %>%
summarize(mean_thickness = mean(thickness, na.rm = TRUE)) %>%
# write them into "thickness_{study_site}" columns
pivot_wider(names_from = study_site, values_from = mean_thickness, names_glue = "thickness_{study_site}")
site_means
This looks good. Now let’s do the anova and get the statistics/p-values.
site_anova <- example_df %>%
{
lm(thickness ~ study_site, data = .)
} %>%
anova() %>%
broom::tidy() %>%
dplyr::mutate(
partial_eta_sq = dplyr::if_else(
term != "Residuals",
sumsq / (sumsq + sumsq[term == "Residuals"]),
NA_real_
)
) %>%
dplyr::filter(term != "Residuals") %>%
dplyr::select(term, df, sumsq, meansq, statistic, p.value, partial_eta_sq) %>%
tidyr::pivot_wider(
names_from = term,
values_from = c(df, sumsq, meansq, statistic, p.value, partial_eta_sq),
names_glue = "{term}.{.value}"
) %>%
tibble::as_tibble()
site_anova
Combining the two we get the full function we want. Absolutely
critical here is that your function needs a single argument named
data
:
my_mean_thickness_fun <- function(data) {
site_means <- data %>%
group_by(study_site) %>%
summarize(mean_thickness = mean(thickness, na.rm = TRUE)) %>%
pivot_wider(names_from = study_site, values_from = mean_thickness, names_glue = "thickness_{study_site}")
site_anova <- data %>%
{
lm(thickness ~ study_site, data = .)
} %>%
anova() %>%
broom::tidy() %>%
dplyr::mutate(
partial_eta_sq = dplyr::if_else(
term != "Residuals",
sumsq / (sumsq + sumsq[term == "Residuals"]),
NA_real_
)
) %>%
dplyr::filter(term != "Residuals") %>%
dplyr::select(term, df, sumsq, meansq, statistic, p.value, partial_eta_sq) %>%
tidyr::pivot_wider(
names_from = term,
values_from = c(df, sumsq, meansq, statistic, p.value, partial_eta_sq),
names_glue = "{term}.{.value}"
) %>%
tibble::as_tibble()
bind_cols(site_means, site_anova)
}
my_mean_thickness_fun(example_df)
This looks great! We can test that the function will work in the
ModelArray
looping machinery by running it on a single
element.
analyseOneElement.wrap(
12345,
my_mean_thickness_fun,
modelarray,
phenotypes,
"thickness",
num.subj.lthr = 10,
flag_initiate = TRUE,
on_error = "debug"
)
This will print the column names that will be retrieved from the
function when it’s run in the ModelArray
looping machinery.
It looks good, so we can run it on all the greyordinates:
res_wrap <- ModelArray.wrap(
FUN = my_mean_thickness_fun,
data = modelarray,
phenotypes = phenotypes,
scalar = "thickness",
n_cores = 8
)
On my laptop this took about 15 minutes.
head(res_wrap)
element_id thickness_HBNsiteCBIC thickness_HBNsiteCUNY thickness_HBNsiteRU thickness_HBNsiteSI study_site.df study_site.sumsq study_site.meansq study_site.statistic study_site.p.value study_site.partial_eta_sq
1 0 3.108483 3.009349 3.063119 2.692335 3 45.52142 15.173806 59.35110 5.688104e-37 0.07093606
2 1 2.618739 2.673533 2.494077 2.324791 3 24.51961 8.173205 28.88297 2.474739e-18 0.03582533
3 2 3.111822 3.056225 2.983709 2.812225 3 23.59899 7.866330 19.95448 9.059350e-13 0.02502795
4 3 3.669784 3.621715 3.054638 3.329743 3 186.55051 62.183503 83.32945 3.261715e-51 0.09682010
5 4 2.790135 2.729325 2.494522 2.444265 3 53.34644 17.782147 55.42033 1.336311e-34 0.06655069
6 5 2.996483 3.043458 2.785565 2.650955 3 40.25648 13.418828 41.60603 3.458560e-26 0.05080478
We now save the results data frame into the original h5 file:
writeResults(h5_path, df.output = res_wrap, analysis_name = "anova_and_means")
Notice that the the analysis name specified in argument
analysis_name
will be used in ConCifti
in the
next step when converting results back to dscalar.nii file format. It’ll
also be used as the prefix of the mif files to be saved.
Step 3. Convert back to dscalar.nii file format
Step 3.1. Convert the statistical results into mif file format using ConFixel
We now use ConFixel
to convert the results into a list
of mif files:
$ ciftistats_write \
--cohort-file ${HOME}/Desktop/myCustomFunction/ConCifti/study-HBN_desc-participants_thickness.csv \
--relative-root ${HOME}/Desktop/myCustomFunction \
--analysis-name anova_and_means \
--input-hdf5 ${HOME}/Desktop/myCustomFunction/ConCifti/study-HBN_compression-gzip_chunkmb-4_thickness.h5 \
--output-dir ${HOME}/Desktop/myCustomFunction/anova_and_means \
--example-cifti ${HOME}/Desktop/myCustomFunction/ConCifti/example_fsLR_den-164k.dscalar.nii
# remember to use your specific path in --relative-root;
# we recommend it's a full path too!
After it’s done, in the main folder myCustomFunction
,
there will be a new folder called anova_and_means
and
converted result dscalar.nii
files are in this new
folder: