How to Get Data on CUBIC

The following instruction are intended specifically to get data on CUBIC.

Request access to a CUBIC project
Dataset processed without BABS
Dataset processed with BABS
- 3.1. Get data without DataLad
- 3.2. Get data with DataLad
  - Prerequisites
  - Example walkthrough

1. Request access to a CUBIC project

🌟 Check the CUBIC Project section on the dataset-specific page to identify which CUBIC project you need to request access to (Dataset-specific links listed here).

🌟 Check the DUA section on the dataset-specific page and obtain the appropriate DUA (Dataset-specific links listed here).

If DUA is None, all CUBIC users have read-access to that CUBIC Project.

If DUA is NOT None, follow these steps:

Email Tien Tong a data request that includes the following:
- The level of access, in this case, you will need read-only access
- The name of the user[s] to be given access (full name or login name). If you plan to store and analyze data in a CUBIC project, you should use your project username, not personal CUBIC username.
- The full path to the project you need access to, /cbica/projects/<project_name>.
- Any DUA requirements.
Tien will forward your request to the CUBIC admins and notify you when access is granted.

🌟 Once you have access to the project, check the BABS section on the dataset-specific page to determine whether the dataset was processed with BABS (Dataset-specific links listed here).

IMPORTANT: Minimizing Data Duplication

AI2D shared data resources are designed for centralized access and sharing. To avoid unnecessary duplication, we strongly discourage keeping local copies of shared datasets within your own CUBIC project.

Best Practice: Once you have read access to the dataset’s CUBIC project, you can read and work with the data directly. For example, from your own CUBIC project you can run:

pandas.read_csv("/path/to/AI2D_dataset/data.tsv", sep='\t')

If you need to copy a small subset of the data (for example, to review the full output path for a single subject), place it in /cbica/comp_space/<your_project>. This location is intended for temporary data and is scheduled for automatic deletion, but you should still check and remove unnecessary files regularly.

If you get data using DataLad, run datalad drop to remove your local copy once you no longer need it.

2. Dataset processed without BABS

If the dataset was NOT processed with BABS, you can read and work with the data directly.

3. Dataset processed with BABS

If the dataset WAS processed with BABS, you can either get data WITHOUT datalad, or get data USING datalad.

3.1. Get data without DataLad

Outputs from our processing pipelines were most of the time zipped to save inodes.

This section provides instructions in case you prefer NOT to use DataLad and instead want to unzip the output files directly into your CUBIC project directory.

Pros:

Much faster than getting data using datalad
Simple and requires no additional software
Lightweight — you extract only the files you need

Cons:

Does not support data provenance tracking

3.1.1. Prerequisites

🌟 Check the dataset-specific page for the paths of the ephemeral clones (Dataset-specific links listed here).

🌟 Download the extraction script: unzip_files.sh. The script supports:

Extracting all files from zip archives
Filtering by subject list (text file with subject IDs)
Extracting specific file patterns using regex
Combining subject lists with file patterns

For usage instructions, run the script with the -h or --help flag.

3.1.2. Explore the Data Structure

Before extracting data, it’s helpful to understand what files are included in the zip file. You can either:

Option A: Extract a sample participant’s data

# Create a single subject list file (e.g., one exemplar subject)
echo "sub-100307" > single_subject.txt

input_dir=/path/to/ephemeral/clone
output_dir=/path/to/your/cubic/project

bash unzip_files.sh ${input_dir} ${output_dir} single_subject.txt

Option B: List contents of a zip file

# XCP-D example
input_dir=/path/to/ephemeral/clone
7z l ${input_dir}/<subject_id>*.zip

3.1.3. Extract Specific Data

Once you understand the data structure, you can extract exactly what you need:

Extract specific files for selected subjects:

# XCP-D example
input_dir=/path/to/ephemeral/clone
output_dir=/path/to/your/cubic/project
file_pattern="xcpd*/sub-*/func/sub-*_task-rest*space-fsLR_seg-Glasser_stat-pearsoncorrelation_relmat.tsv"

bash unzip_files.sh ${input_dir} ${output_dir} subject_list.txt "${file_pattern}"

Extract specific files for all available subjects:

# Omit the subject list to process all subjects
bash unzip_files.sh ${input_dir} ${output_dir} "${file_pattern}"

3.2. Get data with DataLad

This section provides instructions in case you prefer to use DataLad.

Pros:

Full data provenance tracking

Cons:

Some knowledge of datalad, git, git annex

3.2.1. Prerequisites

🌟 Check the dataset-specific page for the paths of the datalad datasets (Dataset-specific links listed here).

🌟 Install DataLad

Follow the instructions here to get datalad installed.

Accessing AI2D data via DataLad happens in two steps. You will first clone a AI2D data repository. This will make a copy of the AI2D file layout, but none of the actual data will be present. The next step is to get your data, which tells DataLad to download the content of specific files to your copy. Once the file content is present in your copy, you can use AI2D data just like any other set of files.

3.2.2. Example walkthrough

$ datalad clone \
    ria+file:///cbica/projects/pennlinc_rbc/datasets/LINC_CCNP/derivatives/xcpd-0-10-6-babs/output_ria#~data \
    ccnp_xcpd

[INFO   ] Configure additional publication dependency on "output-storage"                                                                           
configure-sibling(ok): . (sibling)
install(ok): /cbica/projects/pennlinc_rbc/ccnp_xcpd (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

$ cd ccnp_xcpd
$ datalad get sub-colornest001_ses-1_xcpd-0-10-6.zip
get(ok): sub-colornest001_ses-1_xcpd-0-10-6.zip (file) [from output-storage...]

$ datalad drop sub-colornest001_ses-1_xcpd-0-10-6.zip
drop(ok): sub-colornest001_ses-1_xcpd-0-10-6.zip (file)  

WARNING: These studies take up a lot of disk space. If you attempt to `get` all the file content, you will likely run out of space unless you have a very large storage.

Table of Contents