Skip to content

5.1.Available Data

Xuan Mai PHAM edited this page Jun 2, 2023 · 58 revisions

This section provides a summary of the data available through the Alliance. All data is stored on the Beluga cluster. The data is stored on Beluga under the rpp-aevans-ab allocation in the following directory:

/project/rpp-aevans-ab/neurohub/ukbb/new

Tabular

The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in tabular/. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]:

  • Datafield refers to the type of data: 2207 is the datafield for whether the subject wears glasses or contact lenses.
  • Instance Index refers to the instance of data acquisition. 0-1 are the initial and repeat assessments. 2-3 are the imaging and repeat imaging visits.
  • Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.

A number of formats are available: csv, sas, stata, and r.

  • .csv:
    Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield.
  • .sas / .sd2:
    Format for SAS statistical analysis package.
  • .stata:
    Format for the Stata statistical analysis package.
  • .r / .tab:
    Format for use with R.
  • .txt:
    Tab-delimited values. Similar to .csv.
  • .bulk:
    List of bulk fields per participant.
  • .html:
    Documentation about the field data dictionary and encodings.

Versions

The data is periodically updated. Old versions are stored in tabular/archive/, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.

Working with CSV Files

awk is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:

module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH

Imaging

Multiple modalities are available from the UKB and the data can be found on beluga in the directory:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk

The following table summarizes the current status on Compute Canada:

Datafield Instance Description Status
20217 2-3 MR - Task functional brain MRI - DICOM Available (raw)
20218 2-3 MR - Diffusion brain MRI - DICOM Available (raw)
20219 2-3 MR - Susceptibility weighted brain images - DICOM Available (raw)
20224 2-3 MR - Phoenix - DICOM Available (raw)
20225 2-3 MR - Functional brain images - resting - DICOM Available (raw)
20227 2-3 MR - Resting-state fMRI Available (raw)
20249 2-3 MR - Task fMRI Available (raw)
20250 2-3 MR - Diffusion Available (raw)
20251 2-3 MR - SWI Available (raw)
20252 2-3 MR - T1-weighted Available (raw)
20253 2-3 MR - FLAIR Available (raw)
20266 2-3 MR - Arterial spin labelling brain images - DICOM Available (raw)
25750 2-3 MR - Resting functional MRI full correlation matrix, dimension 25 Available (raw)
25751 2-3 MR - Resting functional MRI full correlation matrix, dimension 100 Available (raw)
25752 2-3 MR -Resting partial correlation matrix, dimension 25 Available (raw)

Physical measures

This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on beluga:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk/20205

Datafield Instance Description Status
20205 2-3 ECG at rest Available (raw)

Physical activity

This category provides measurement recorded via a wrist-worn accelerometer. Main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on beluga:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk/9000x

Datafield Instance Description Status
90001 0-1-2-3-4 Acceleration data (cwa, raw format) Available (raw)
90004 NA Acceleration intensity time-series (Epoch) Available (raw)

Genetics

Multiple types of genetics data were acquired from UKB subjects.

You can find the data in Genotype_Results/ and Imputation/ directories in:

/lustre03/project/6008063/neurohub/ukbb/new/Bulk

The following sections summarize the current status on Compute Canada:

Genotype

Datafield Description Status
22002 Genotyping process and sample QC - CEL Files Available
22418 Calls Available
22419 Genotype confidences Available
22437 Copy number variants B-allele frequencies Available
22431 Copy number variants, log2ratios Available
22430 Intensities Available
22438 Haplotypes Available
22828 Imputation Available

Exome

You can find the Exome data in the genetics/ directory in:

/lustre03/project/6008063/neurohub/ukbb/genetics/exome

Datafield Description Status
23151 Variant call files Available
23152 Variant call files indices Available
23153 CRAM files Available
23154 CRAM indices Available
23155 Population-level variants (PLINK) Available
23156 Population-level variants (pVCF) Available

Preprocessed data

NeuroHub users currently can have access to the following types of Preprocessed data

1. Diffusion-weighted imaging

The data are done with Tractoflow and available on Beluga at the following path:

/lustre03/project/rpp-aevans-ab/neurohub/ukb/new/Derived/tractoflow_out

2. Imaging data with fMRIPrep

To be available SOON and announce on time in our Newsletter so please make sure to SUBSCRIBE!

private note: Complete fMRI and diffusion MRI processing of currently released UK Biobank data using Compute Canada resources and storage of derived dataset output within NeuroHub_ In 2021 the large-scale diffusion MRI processing was conducted on all 40000 subjects of the UK Biobank with MRI data using the NeuroHub / CBRAIN allocation on Compute Canada. The processing took approximately 4 months to complete, with the generated output organized and packaged so as to be available to all NeuroHub UK Biobank users. It was also intended to perform the large-scale processing of the fMRI data using the fMRIprep software tool. However, at that time there were some known issues with the fMRIprep code that affected the generated results. As such it was decided to move forward first with the diffusion MRI processing and then move back to running fMRIprep across all 4000 subjects once fMRIprep had been suitably debugged and corrected.

At present a sample dataset corresponding to 1000 participants of the UK Biobank has been processed through fMRIprep and are currently undergoing analysis and quality checks. Once verified, the fMRIprep processing will be commenced on all 40000 subjects of the UK Biobank with imaging data. This is anticipated to also take approximately 4 months to complete, therefore adjusting the estimated completion date to Year 2 Q2.

3. Civet output

Information about how to access the CIVET output of the UK Biobank preprocessing available through CBRAIN and the LORIS DQT can be found in here.

Status Codes

Status Meaning
Available Data is accessible to NeuroHub users
Available (raw) The raw data is available, but derivatives may not be.
Deploying Data will soon be available.
Fetching Data is currently being transferred to Beluga
Processing Data is undergoing processing (e.g. format, structure).
Not on Beluga Data is has not been downloaded.

Requests

Are there new datafields or return datasets that you'd like to be made available? Send us an email at [email protected] with a text file with the following information in a single column:

  • Datafields, prepended with 'F' (e.g.: "F20252")
  • Return datasets, prepended with 'R' (e.g.: "R123")
  • SNP IDs prepended with 'S' (e.g. "S456").

Is there bulk data that is already authorized but unavailable? Let us know at [email protected]; we prioritize data with community interest.

Clone this wiki locally