-
Notifications
You must be signed in to change notification settings - Fork 0
5.1.Available Data
This section provides a summary of the data available through the Alliance. All data is stored on the Beluga cluster. The data is stored on Beluga under the rpp-aevans-ab
allocation in the following directory:
/project/rpp-aevans-ab/neurohub/ukbb/new
The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in tabular/
. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]
:
-
Datafield refers to the type of data:
2207
is the datafield for whether the subject wears glasses or contact lenses. -
Instance Index refers to the instance of data acquisition.
0-1
are the initial and repeat assessments.2-3
are the imaging and repeat imaging visits. - Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.
A number of formats are available: csv, sas, stata, and r.
-
.csv
:
Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield. - .
sas
/.sd2
:
Format for SAS statistical analysis package. -
.stata
:
Format for the Stata statistical analysis package. -
.r
/.tab
:
Format for use with R. -
.txt
:
Tab-delimited values. Similar to .csv. -
.bulk
:
List of bulk fields per participant. -
.html
:
Documentation about the field data dictionary and encodings.
The data is periodically updated. Old versions are stored in tabular/archive/
, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.
awk
is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:
module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH
Multiple modalities are available from the UKB and the data can be found on beluga in the directory:
/lustre03/project/6008063/neurohub/ukbb/new/Bulk
The following table summarizes the current status on Compute Canada:
Datafield | Instance | Description | Status |
---|---|---|---|
20217 | 2-3 | MR - Task functional brain MRI - DICOM | Available (raw) |
20218 | 2-3 | MR - Diffusion brain MRI - DICOM | Available (raw) |
20219 | 2-3 | MR - Susceptibility weighted brain images - DICOM | Available (raw) |
20224 | 2-3 | MR - Phoenix - DICOM | Available (raw) |
20225 | 2-3 | MR - Functional brain images - resting - DICOM | Available (raw) |
20227 | 2-3 | MR - Resting-state fMRI | Available (raw) |
20249 | 2-3 | MR - Task fMRI | Available (raw) |
20250 | 2-3 | MR - Diffusion | Available (raw) |
20251 | 2-3 | MR - SWI | Available (raw) |
20252 | 2-3 | MR - T1-weighted | Available (raw) |
20253 | 2-3 | MR - FLAIR | Available (raw) |
20266 | 2-3 | MR - Arterial spin labelling brain images - DICOM | Available (raw) |
25750 | 2-3 | MR - Resting functional MRI full correlation matrix, dimension 25 | Available (raw) |
25751 | 2-3 | MR - Resting functional MRI full correlation matrix, dimension 100 | Available (raw) |
25752 | 2-3 | MR -Resting partial correlation matrix, dimension 25 | Available (raw) |
This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on beluga:
/lustre03/project/6008063/neurohub/ukbb/new/Bulk/20205
Datafield | Instance | Description | Status |
---|---|---|---|
20205 | 2-3 | ECG at rest | Available (raw) |
This category provides measurement recorded via a wrist-worn accelerometer. Main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on beluga:
/lustre03/project/6008063/neurohub/ukbb/new/Bulk/9000x
Datafield | Instance | Description | Status |
---|---|---|---|
90001 | 0-1-2-3-4 | Acceleration data (cwa, raw format) | Available (raw) |
90004 | NA | Acceleration intensity time-series (Epoch) | Available (raw) |
Multiple types of genetics data were acquired from UKB subjects.
You can find the data in Genotype_Results/
and Imputation/
directories in:
/lustre03/project/6008063/neurohub/ukbb/new/Bulk
The following sections summarize the current status on Compute Canada:
Datafield | Description | Status |
---|---|---|
22002 | Genotyping process and sample QC - CEL Files | Available |
22418 | Calls | Available |
22419 | Genotype confidences | Available |
22437 | Copy number variants B-allele frequencies | Available |
22431 | Copy number variants, log2ratios | Available |
22430 | Intensities | Available |
22438 | Haplotypes | Available |
22828 | Imputation | Available |
You can find the Exome data in the genetics/
directory in:
/lustre03/project/6008063/neurohub/ukbb/genetics/exome
Datafield | Description | Status |
---|---|---|
23151 | Variant call files | Available |
23152 | Variant call files indices | Available |
23153 | CRAM files | Available |
23154 | CRAM indices | Available |
23155 | Population-level variants (PLINK) | Available |
23156 | Population-level variants (pVCF) | Available |
NeuroHub users currently can have access to the following types of Preprocessed data
The data are done with Tractoflow and available on Beluga at the following path:
/lustre03/project/rpp-aevans-ab/neurohub/ukb/new/Derived/tractoflow_out
To be available SOON and announce on time in our Newsletter so please make sure to SUBSCRIBE!
private note: Complete fMRI and diffusion MRI processing of currently released UK Biobank data using Compute Canada resources and storage of derived dataset output within NeuroHub_ In 2021 the large-scale diffusion MRI processing was conducted on all 40000 subjects of the UK Biobank with MRI data using the NeuroHub / CBRAIN allocation on Compute Canada. The processing took approximately 4 months to complete, with the generated output organized and packaged so as to be available to all NeuroHub UK Biobank users. It was also intended to perform the large-scale processing of the fMRI data using the fMRIprep software tool. However, at that time there were some known issues with the fMRIprep code that affected the generated results. As such it was decided to move forward first with the diffusion MRI processing and then move back to running fMRIprep across all 4000 subjects once fMRIprep had been suitably debugged and corrected.
At present a sample dataset corresponding to 1000 participants of the UK Biobank has been processed through fMRIprep and are currently undergoing analysis and quality checks. Once verified, the fMRIprep processing will be commenced on all 40000 subjects of the UK Biobank with imaging data. This is anticipated to also take approximately 4 months to complete, therefore adjusting the estimated completion date to Year 2 Q2.
Information about how to access the CIVET output of the UK Biobank preprocessing available through CBRAIN and the LORIS DQT can be found in here.
Status | Meaning |
---|---|
Available | Data is accessible to NeuroHub users |
Available (raw) | The raw data is available, but derivatives may not be. |
Deploying | Data will soon be available. |
Fetching | Data is currently being transferred to Beluga |
Processing | Data is undergoing processing (e.g. format, structure). |
Not on Beluga | Data is has not been downloaded. |
Are there new datafields or return datasets that you'd like to be made available? Send us an email at [email protected] with a text file with the following information in a single column:
- Datafields, prepended with 'F' (e.g.: "F20252")
- Return datasets, prepended with 'R' (e.g.: "R123")
- SNP IDs prepended with 'S' (e.g. "S456").
Is there bulk data that is already authorized but unavailable? Let us know at [email protected]; we prioritize data with community interest.