Collaboration & How to Share Data

Collaboration is an important part of our work. Sharing the input data and outputs/derivatives of a project is a critical step in the scientific process, and so doing it both accurately and efficiently is a high priority. Overall, remember that this step can be a strain, especially when dealing with external collaborators or systems that are foreign to you. Be patient, and remember that we have to balance accuracy/reproducibility with efficiency/speed.

Below are a few example scenarios and what best recommendations we have for sharing data.

BIDS Data

Sharing BIDS data typically happens at the beginning of a project. We recommend looking into the options on this page about fetching your data.

Outputs from TheWay

Datalad to Datalad

Generally, if you have outputs from TheWay in a datalad dataset, you should try to have collaborators ingest that as datalad datasets too! This is best accomplished if the user can clone the dataset:

#cloning a dataset from a git repo

datalad clone https://github.com/datalad-datasets/longnow-podcasts.git

For most of our use, you want to clone the output of a specific pipeline – these outputs are stored in the output RIA store, so you have to clone them like so:

#cloning from an output ria of an fmriprep run
datalad clone ria+file:///PATH_TO_FMRIPREP_DATASET/output_ria#~data fmriprep_outputs

If the person is happy with this format and working with datalad, they can use this command to get the cloned data.

Datalad to Non-Datalad

If they want regular files with no datalad tracking involved, they can then use rsync to copy the physical data by following the symbolic links. That looks like this:

# YOU clone from an output ria of an fmriprep run
datalad clone ria+file:///PATH_TO_FMRIPREP_DATASET/output_ria#~data fmriprep_outputs
datalad get .

# THEY extract the data from this output RIA as regular files
rsync -avzhL --progress fmriprep_outputs FINAL_DESTINATION

NOTE FOR EXPERTS: This is part of the datalad workflow on aliasing; visit http://handbook.datalad.org/en/latest/beyond_basics/101-147-riastores.html to learn more about how you can use aliasing to share data flexibly.


Zipped or Unzipped?

A lot of outputs from our pipelines can be very large. We typically keep these zipped after running, but for collaboration you can choose to unzip all or part of the outputs into an unzipped outputs datalad dataset. We have an example script here that unzips fMRIPrep outputs – you could, for example, modify the datalad runscript section in lines 150-160 to include/exclude what you need:

# continued from the unzip script
# Line 150

# unzip outputs
unzip -n $ZIP_FILE 'fmriprep/*' -d .

# remove files we don't need at your discretion
rm fmriprep/func/*from-scanner_to-T1w_mode-image_xfm.txt
rm fmriprep/func/*from-T1w_to-scanner_mode-image_xfm.txt
rm fmriprep/func/*space-MNI152NLin6Asym_res-2_boldref.nii.gz
rm fmriprep/func/*space-MNI152NLin6Asym_res-2_desc-aparcaseg_dseg.nii.gz
rm fmriprep/func/*space-MNI152NLin6Asym_res-2_desc-aseg_dseg.nii.gz
# copy outputs out of fmriprep
cp -r fmriprep/func/* .

This can help save space on disk or make it easy for collaborators to transfer smaller outputs.

PennLINC-Kit

PennLINC-Kit is a toolkit for common analysis functions used by the PennLINC scientists. It comes with prepackaged datasets and accessible functions for fetching some of our most used data on-the-fly in your Python session. For many data sharing applications, this alone may suffice, so always check with PennLINC-Kit before you engineer your own solution.

Documentation (in progress) is available here.

Permissions, VPNs, and Picking a Medium

There are always barriers to sharing data. Someone needs access to something and that often entails a lot of bureaucracy; maybe there is data that can’t be shared with PHI; maybe there is not enough disk space to move data. Here are some thoughts to help you guide what decisions to make:

  • Is this a one-time transaction, or will there be data moving back-and-forth repeatedly? VPNs + permissions are an investment that can take a week or sometimes more to be approved
  • How big is the data? Does this have to be shared on a cluster? Maybe it’s more appropriate to download it locally and upload it to Box or SecureShare
  • What clusters are involved? CUBIC has a complicated permissions system; PMACS is more lenient but how will you move data from CUBIC to PMACS?
  • Do you need datalad tracking? How much do collaborators care about that (weighed against how much you care about that)? datalad can be fun, but it is definitely a commitment that you can’t easily back out of