5.1.Available Data

This section provides a summary of the data available through the Alliance. All data is stored on the Beluga cluster. The data is stored on Beluga under the rpp-aevans-ab allocation in the following directory:

/project/rpp-aevans-ab/neurohub/ukbb/new

Tabular

The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in tabular/. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]:

Datafield refers to the type of data: 2207 is the datafield for whether the subject wears glasses or contact lenses.
Instance Index refers to the instance of data acquisition. 0-1 are the initial and repeat assessments. 2-3 are the imaging and repeat imaging visits.
Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.

A number of formats are available: csv, sas, stata, and r.

.csv:
Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield.
.sas / .sd2:
Format for SAS statistical analysis package.
.stata:
Format for the Stata statistical analysis package.
.r / .tab:
Format for use with R.
.txt:
Tab-delimited values. Similar to .csv.
.bulk:
List of bulk fields per participant.
.html:
Documentation about the field data dictionary and encodings.

Versions

The data is periodically updated. Old versions are stored in tabular/archive/, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.

Working with CSV Files

awk is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:

module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH

Imaging

Multiple modalities are available from the UKB and the data can be found on beluga in the directory:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk

The following table summarizes the current status on Compute Canada:

Datafield	Instance	Description	Status
20217	2-3	MR - Task functional brain MRI - DICOM	Available (raw)
20218	2-3	MR - Diffusion brain MRI - DICOM	Available (raw)
20219	2-3	MR - Susceptibility weighted brain images - DICOM	Available (raw)
20224	2-3	MR - Phoenix - DICOM	Available (raw)
20225	2-3	MR - Functional brain images - resting - DICOM	Available (raw)
20227	2-3	MR - Resting-state fMRI	Available (raw)
20249	2-3	MR - Task fMRI	Available (raw)
20250	2-3	MR - Diffusion	Available (raw)
20251	2-3	MR - SWI	Available (raw)
20252	2-3	MR - T1-weighted	Available (raw)
20253	2-3	MR - FLAIR	Available (raw)
20266	2-3	MR - Arterial spin labelling brain images - DICOM	Available (raw)
25750	2-3	MR - Resting functional MRI full correlation matrix, dimension 25	Available (raw)
25751	2-3	MR - Resting functional MRI full correlation matrix, dimension 100	Available (raw)
25752	2-3	MR -Resting partial correlation matrix, dimension 25	Available (raw)

Physical measures

This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on beluga:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk/20205

Datafield	Instance	Description	Status
20205	2-3	ECG at rest	Available (raw)

Physical activity

This category provides measurement recorded via a wrist-worn accelerometer. Main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on beluga:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk/9000x

Datafield	Instance	Description	Status
90001	0-1-2-3-4	Acceleration data (cwa, raw format)	Available (raw)
90004	NA	Acceleration intensity time-series (Epoch)	Available (raw)

Genetics

Multiple types of genetics data were acquired from UKB subjects.

You can find the data in Genotype_Results/ and Imputation/ directories in:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk

The following sections summarize the current status on Compute Canada:

Genotype

Datafield	Description	Status
22002	Genotyping process and sample QC - CEL Files	Available
22418	Calls	Available
22419	Genotype confidences	Available
22437	Copy number variants B-allele frequencies	Available
22431	Copy number variants, log2ratios	Available
22430	Intensities	Available
22438	Haplotypes	Available
22828	Imputation	Available

Exome

You can find the Exome data in the genetics/ directory in:

/lustre03/project/6008063/neurohub/ukbb/genetics/exome

Datafield	Description	Status
23151	Variant call files	Available
23152	Variant call files indices	Available
23153	CRAM files	Available
23154	CRAM indices	Available
23155	Population-level variants (PLINK)	Available
23156	Population-level variants (pVCF)	Available

Preprocessed data

NeuroHub users currently can have access to the following types of Preprocessed data

1. Diffusion-weighted imaging

The data are done with Tractoflow and available on Beluga at the following path:

/lustre03/project/rpp-aevans-ab/neurohub/ukb/new/Derived/tractoflow_out

2. Imaging data with fMRIPrep

To be available SOON and announce on time in our Newsletter so please make sure to SUBSCRIBE!

private note: Complete fMRI and diffusion MRI processing of currently released UK Biobank data using Compute Canada resources and storage of derived dataset output within NeuroHub_ In 2021 the large-scale diffusion MRI processing was conducted on all 40000 subjects of the UK Biobank with MRI data using the NeuroHub / CBRAIN allocation on Compute Canada. The processing took approximately 4 months to complete, with the generated output organized and packaged so as to be available to all NeuroHub UK Biobank users. It was also intended to perform the large-scale processing of the fMRI data using the fMRIprep software tool. However, at that time there were some known issues with the fMRIprep code that affected the generated results. As such it was decided to move forward first with the diffusion MRI processing and then move back to running fMRIprep across all 4000 subjects once fMRIprep had been suitably debugged and corrected.

At present a sample dataset corresponding to 1000 participants of the UK Biobank has been processed through fMRIprep and are currently undergoing analysis and quality checks. Once verified, the fMRIprep processing will be commenced on all 40000 subjects of the UK Biobank with imaging data. This is anticipated to also take approximately 4 months to complete, therefore adjusting the estimated completion date to Year 2 Q2.

3. Civet output

Information about how to access the CIVET output of the UK Biobank preprocessing available through CBRAIN and the LORIS DQT can be found in here.

Status Codes

Status	Meaning
Available	Data is accessible to NeuroHub users
Available (raw)	The raw data is available, but derivatives may not be.
Deploying	Data will soon be available.
Fetching	Data is currently being transferred to Beluga
Processing	Data is undergoing processing (e.g. format, structure).
Not on Beluga	Data is has not been downloaded.

Requests

Are there new datafields or return datasets that you'd like to be made available? Send us an email at [email protected] with a text file with the following information in a single column:

Datafields, prepended with 'F' (e.g.: "F20252")
Return datasets, prepended with 'R' (e.g.: "R123")
SNP IDs prepended with 'S' (e.g. "S456").

Is there bulk data that is already authorized but unavailable? Let us know at [email protected]; we prioritize data with community interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

5.1.Available Data

Tabular

Versions

Working with CSV Files

Imaging

Physical measures

Physical activity

Genetics

Genotype

Exome

Preprocessed data

1. Diffusion-weighted imaging

2. Imaging data with fMRIPrep

3. Civet output

Status Codes

Requests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally