Challenge Abstract

Deep Learning (DL) techniques have been used in medical imaging to improve quality and generate new images from reduced medical imaging acquisitions. They have implied a true revolution in the medical field, with myriads of new applications rising every year. We cannot deny the excellent outcomes these applications produce, with high-quality images and compelling results. However, when applied to medical images, most of the validation of these techniques has been done visually and/or qualitatively, not necessarily adequately assessed in clinical studies. There is a key question that may affect many of the DL applications in medical studies: “are we losing relevant quantitative clinical information when generating high-quality images with artificial intelligence techniques?”. The question is related to the validity of traditional quality measures such as the Peak Signal-to- Noise Ratio (PSNR), Structural Similarity Index (SSIM) or Normalized Root Mean Squared Error (NRMSE), commonly used in medical image analysis. Strictly speaking, it is not enough that the images look alike as they must also preserve all the relevant clinical information.

In this challenge, we try to answer the question about the validity of reconstructed images in a real clinical study. To that end, we will focus on a real diffusion magnetic resonance imaging (MRI) study on migraine. Data were acquired for a clinical study carried out in a local hospital (Hospital Clinico Universitario, Valladolid, Spain) by a group of neurologists.

Migraine is a primary disabling disorder characterized by recurrent episodes of headache that usually last 4-72 hours. It is more widespread among young and middle-aged women. Despite the high prevalence of migraine, its pathophysiological mechanisms are not well known, and there are no biomarkers currently. Two types of migraine are currently distinguished: episodic migraine (EM) and chronic migraine (CM). This classification criterion is based exclusively on the number of headache days per month (15 or more days with headache per month for chronic migraine patients). The unique, relevant radiological findings in migraine are white matter hyperintensities observed through T2-weighted images, and their role is unclear. The advantage of migraine in a challenge like the present one is that MRI findings related to diffusion MRI are subtle compared to healthy controls, according to previous studies. In severe disorders such as Alzheimer’s disease or schizophrenia, it is relatively easy to find statistically significant results with classic methods (i.e. Diffusion Tensor Imaging, T1-, T2-weighted MR imaging), and thus it is challenging to appreciate techniques or parameters that can better define pathophysiological properties. There are some diffusion MRI studies assessing migraine. Diffusion Tensor Imaging (DTI) has been the most employed technique to evaluate microstructural properties with differences found between controls and migraine patients (MP) and between EM and CM patients for DTI-related scalars like fractional anisotropy (FA), mean diffusion (MD) and Axial Diffusivity (AD).

With these features in mind, the principal purpose of our challenge is to validate if those DTI-based parameters generated from low-quality data directly or DL-based augmented data are able to replicate the statistical findings that appear when using standard quality data (i.e., if part of the relevant quantitative information is missing). Thus, we can validate the usefulness of DL-based reconstruction techniques in real clinical studies.

We have selected the migraine problem due to the following reasons:

We have been collecting a large and unique database of migraine patients with a proper acquisition scheme.
We have already confirmed the brain regions where the statistically significant differences occur with the fully-sampled data, and we keep these regions as the "silver standard" for reduced acquisitions.
We collaborate with neurologists, specialized in headache disorders that assist in interpreting the data.
The findings in migraine are subtle. If we reduce the number of subjects in the study or the number or gradient directions in the acquisition protocol, the differences are deeply reduced. Thus, this database is excellent to test the importance of deep learning methods. Since some of the differences are missing, the question arises: can powerful deep learning methods recover them?

Currently, we are working with 160 volumes, all including a unique q-space coverage scheme that enables us to easily subsample the data by merely selecting appropriate 21 gradient directions out of 61 without the need of applying interpolation algorithms. With this database, we can control the preprocessing pipeline and compare the quantitative measures obtained from the DTI under a reduced acquisition scheme to those estimated from fully- sampled data.

Challenge keywords

diffusion-weighted MRI, diffusion tensor MRI, data augmentation, deep learning, migraine, neurology, neuroimaging

Publication and future plans

We plan to publish a paper in a top-tier journal (e.g. Medical Image Analysis, Neuroimage, or Neuroimage: Clinical) presenting the results from the Challenge and further recommendations in diffusion MRI data augmentation techniques for clinical studies.

Up to three members per team indicated by the team leader qualify as the paper's authors from the challenge. The top 6 ranked teams qualify as the authors following the criteria:

The results show a minimum of quality.
100% of the data is properly processed.
The method follows a scientific procedure to obtain the results

Summary:

The angular resolution (i.e. the parameter that is proportional to the reciprocal number of diffusion sensitizing gradient directions) is one of the crucial design parameters used in a diffusion MRI experiment. Depending on the method employed to represent the diffusion MRI signal, a different number of gradient directions are required to fit the basis starting from six gradients in DTI to several dozens or hundreds in High Angular Resolution Diffusion Imaging (HARDI) techniques. In clinical studies, we are generally interested in optimizing the number of gradient directions to limit the acquisition duration and guarantee the patient's comfort during the examination. However, reducing the number of gradient directions may lead to a loss of subtle changes in angular characteristics of diffusion MRI data, which translates then to quantitative measures retrieved from a fitted model.

Although several angular diffusion MRI data augmentation techniques have been proposed, or even the MICCAI MUDI '2019 Challenge organized for diffusion-relaxation MRI signal prediction, the evaluation is mainly limited to numerical measures such as aforementioned the PSNR or SSIM indexes. In other words, no matter how powerful the algorithm is, primarily the center of interest of the authors remains in quantitative parameters reflecting the method's accuracy.

However, recent studies have suggested that decreasing the number of gradients leads to clinical information loss, and it becomes impossible to detect differences in various types of medical conditions. A reported key factor influencing the values of diffusion/DTI descriptors is the number of diffusion gradient orientations, which impacts the results of their statistical comparison between clinical groups.

The presented proposal is not a conventional challenge involving training and validation/testing sets that requests the participants for the best solution in terms of a numerical parameter showing the robustness of the methods. Instead, we share two datasets:

A fully-sampled diffusion-weighted MRI dataset acquired with 61 gradient directions at b=1000 s/mm2 coming from 60 healthy controls (HC).
A set of 100 migraine patients (MP), 50 EM and 50 CM, all acquired in a subsampled scenario with 21 gradient directions and b=1000 s/mm2. The fully sampled acquisition with 61 gradient directions coming from HC can easily be subsampled to the same 21 gradients used in the MP acquisitions by selecting the appropriate 21 directions out of 61.

The task aims to estimate three DTI based parameters, namely the FA, MD and AD from the MP dataset acquired with 21 diffusion gradient directions at b=1000 s/mm2. However, before that, the participants angularly augment the diffusion MRI data from 21 to 61 gradient directions to provide the most faithful representation of the signal and consequently the quantitative parameters, including FA, MD and AD. The participants can evaluate their algorithm using the HC dataset in terms of any measure they think would reflect the algorithm's power, for instance, above-mentioned the NRMSE or SSIM.

The participants submit three volumes (FA, MD, AD) for 100 MP subjects. However, the evaluation procedure in the organizers' site is carried out in terms of a statistical test whether significant differences between CM and EM patients can be detected. In other words, instead of evaluating the methods using the PSNR or SSIM, our goal is to assess the clinical impact of the algorithms. With this challenge, we want to find the answer to the question if it is possible to show the same or a comparable level of significant differences between CM and EM diffusion MRI data acquired in a reduced scenario (i.e. 21 gradient directions) as one could find with a fully-sampled data obtained with 61 gradient directions.

Ethics approval

The local Ethics Committee of Hospital Clínico Universitario de Valladolid (Valladolid, Spain) approved the study regarding the MRI acquisitions (PI: 14-197). All participants read and signed a written consent form prior to their participation.

Terms of use

All the participants will be asked to sign a document with the terms of use of the data provided for the challenge.

DATASETS

Data source

External data is not allowed to be used in the training process.

Source: All the data is acquired at a Philips Achieva 3 T MRI unit (Philips Healthcare, Best, The Netherlands) equipped with a 32-channel head coil (Laboratorio de Técnicas Instrumentales, Universidad de Valladolid) with patients from Hospital Clínico Universitario (Valladolid, Spain).

Acquisition protocol: Diffusion-weighted imaging MRI data were acquired under a unified protocol with the parameters defined as follows: b=1000 s/mm2, 61 non-collinear diffusion-sensitizing gradient directions, one baseline acquisition at b=0, volume size of 2x2x2 mm3, matrix size 128x128, 66 axial slices that cover the whole brain, flip angle 90°, repetition time (TR) 9000 ms, and echo time (TE) 86 ms.

Two data sets will be released:

Training dataset: A fully-sampled diffusion-weighted MRI dataset acquired with 61 gradient directions at b=1000 s/mm2 coming from 60 healthy controls. The sampling scheme allows the 61 gradient directions to be easily subsampled to 21 gradients.
Migraine dataset: A set of 50 Chronic migraine (CM) and 50 episodic migraine (EM) patients, all acquired in a subsampled scenario with 21 gradient directions and b=1000 s/mm2.

All datasets were anonymized.

How to subsample the 61 gradient directions:

The gradient directions are properly ordered so that the first 21 gradients are also evenly distributed in the sphere.
So, in order to subsample the 61 gradient directions, users must simply take the first 21 gradient directions. (NOTE: since the first direction of the NiFTI volume is the baseline, directions 2 to 22 must be taken accordingly).

For each subject, a folder with the following files is provided:

dwi.nii.gz. This file contains the baseline volume (b = 0) and the diffusion-weighted volumes. These volumes have been already preprocessed as detailed later.
dwi.bval. File that shows the b-values for each volume included in dwi.nii.gz . The possible b-values are 0 and 1000 s/mm2. The file follows the format employed in FSL.
dwi.bvec. File that shows the diffusion gradient directions for each volume included in dwi.nii.gz . These directions have been updated compared to the original values after the application of the diffusion preprocessing pipeline. The file follows the format employed in FSL.
dwi_mask.nii.gz. File that shows the brain mask for each subject. This mask has been obtained with MRtrix (command detailed later) after the application of the preprocessing pipeline.

Each subject will be stored in a different folder with the following names:

Healthy controls: HC00XX (where XX goes from 01 to 60). Example: HC0027/
Migraine patients: MP00XX, (where XX goes from 01 to 100). Example: MP0032/

Selection of subjects

Healthy controls were recruited by convenience sampling and snowball sampling. Controls with a history of migraine, other headache disorders different to infrequent tension-type-headache (less than one attack per month), or a history of other neurological or psychiatric disorders were excluded. Healthy controls were aged between 18 and 65 years. Additionally, a questionnaire was provided to the controls to assess whether they suffered from headaches with migraine features.

Migraine patients were recruited to a neurologist specialized in headache disorders at their first visit. Due to migraine, these patients had been referred to the Headache Unit at the Hospital Clínico Universitario de Valladolid (Valladolid, Spain). Patients were included after a definite diagnosis of episodic migraine or chronic migraine according to the third edition of the International Classification of Headache Disorders (ICHD-3) [7, 8], version ICHD-3 beta for the first patients, and version ICHD-3 for the last patients (the ICHD-3 was updated during the recruitment period). No methods or tests different to the anamnesis were applied to diagnose migraine, considering that there are no actual migraine biomarkers and that the current diagnosis of migraine is based exclusively on clinical symptoms.

Preprocessing

In order to avoid any influence of the preprocessing steps in the results of the challenge, all the data volumes underwent the same preprocessing pipeline, i.e. denoising using the MP-PCA approach [1], corrections of eddy currents and motion artifacts [2], and B1 field inhomogeneity [3], all done with the MRtrix3 software [4]. Specifically:

MRtrix, release 3.0 (version available in November 2017), was used for diffusion preprocessing and brain mask extraction. It is worth noting that some commands shown in this section have changed for the latest version of MRtrix. The following steps were carried out:

Marchenko-Pastur Principal Component Analysis (MP-PCA) denoising.
Correction for eddy currents and motion.
Correction for B1 field inhomogeneities.

The commands employed for these steps were:

dwidenoise dwi_original.nii.gz dwi_den.nii.gz

dwipreproc -fslgrad dwi_original.bvec dwi.bval -export_grad_fsl dwi.bvec dwi_updated.bval -rpe_none PA dwi_den.nii.gz dwi_corr.nii.gz

dwibiascorrect dwi_corr.nii.gz dwi.nii.gz -fsl -fslgrad dwi.bvec dwi_updated.bval

The brain mask was extracted with the following command:

dwi2mask dwi.nii.gz dwi_mask.nii.gz -fslgrad dwi.bvec dwi_updated.bval

Calculation of metrics

For the analysis carried out in this challenge, only three DTI-derived metrics are considered:

Fractional anisotropy (FA)
Axial Diffusivity (AD)
Mean Diffusivity (MD)

These metrics were selected for being the ones detecting significant differences in the preliminary clinical study with migraine patients.

For training, the participants are asked to obtain the three DTI-based parameters using the following FSL command (inside the folder of each subject):

dtifit -k dwi.nii.gz -o dti_files/dwi -m dwi_mask.nii.gz -r dwi.bvec -b dwi.bval

With this command, a set of compressed nifti files are stored in the folder specified by the user. Following the notation employed in this example, the files that would be assessed in relation to this challenge are dwi_FA.nii.gz, dwi_L1.nii.gz (AD or first eigenvalue), and dwi_MD.nii.gz , all of them stored in subject_path/dti_files/.

Note that we will also use this command to calculate the metrics for the 21 and 61 gradient volumes of the MP. These values will be used for the clinical studies that will be used for references. So we recommend using the same command to calculate the final metrics.

Postprocessing

For the evaluation of the results, all datasets will be non-linearly registered to the common template (i.e. FMRIB58_FA template being a high-resolution FA parameter averaged over 58 subjects) using the FSL FNIRT tool [5] in Montreal Neurological Institute (MNI) space.

The FNIRT tool uses a b-spline representation of the registration warp field. After the registration, a mean FA image was generated and thinned to create a mean FA skeleton of white matter tracts using a FA value of 0.2 as a threshold to distinguish white from gray matter. Then, each subject’s aligned FA images were projected onto the mean FA skeleton. Similarly, the same process was repeated for MD and AD using the protocol devoted to non-FA images. The Johns Hopkins University ICBM-DTI-81 White-Matter Labels Atlas provided in the FSL toolbox was used to identify the white matter tracts. We executed group-wise comparisons of CM vs EM.

TASK

DTI-BASED MEASURES ESTIMATION FROM REDUCED ACQUISITIONS

Participants are expected to estimate three DTI-based parameters (FA, MD and AD) from the migraine dataset acquired with 21 diffusion gradient directions at b=1000 s/mm2, but with a quality similar to the parameters estimated from 61 gradient directions. To that end:

They can use the training data set to angularly augment the diffusion MRI data from 21 to 61 gradient directions to provide the most faithful representation of the signal and consequently the quantitative parameters, including FA, MD and AD. Deep Learning methods are recommended here.
Then, they will apply the method to the migraine dataset. The participants will submit three volumes (FA, MD, AD) for the 50 EM and the 50 CM subjects.

Submission

he data submitted by each group must follow the next requirements:

Metrics: only FA, MD and AD volumes for each patient of the upgraded data must be provided.
Format: The data must be stored in a NifTI file, compressed with gzip.
All data must be stored inside a folder with the name of the group
The name of the files is as follows: [Metric]_[subject_reference].nii.gz, where [metric] stands for FA, MD or AD. For instance:

FA_MP0027.nii.gz

MD_MP0001.nii.gz

AD_MP0054.nii.gz

A method to submit the data will be provided in the following months.

ASSESSMENT

Metric

The purpose of the challenge is to validate methods in a real clinical study. So, the results of the challenge will be evaluated using the aggregation of two metrics, based on tools typically used for clinical studies with DTI data:

1.- Skeleton Metric: The quality of the results is measured based on a statistical test carried out with FSL TBSS:

As a gold standard, we consider the statistical test results on original diffusion MRI data (i.e. 61 gradient directions). Specifically, we will consider EM patients vs CM patients.
For each of the datasets provided by the participants, the same procedure will be repeated:
The voxel-wise TBSS differences in FA, MD, and AD values of white matter between the two groups will be tested using a permutation-based inference tool by nonparametric statistics called randomise, implemented in FSL, with the threshold-free cluster enhancement (TFCE) option. Five thousand permutations will be set to allow robust statistical inference. The significance threshold for intergroup differences will be set to p < 0.05 after correcting for family-wise error (FWE) applying the TFCE option.
This evaluation will be carried out in the common space with shared warpings to transform the measures from its native spaces. The same FA skeleton will be used here. Then, we will execute a group-wise comparison between CM and EM subjects and obtain a skeleton p-value.
We use the skeleton metric in order to measure the quality of the proposed measure (measure={FA, MD, AD}):

SMi=1-FP/NP-FN/NP

with

FP - false positives of the submission, defined as those points with p < 0.05 in the participant’s study and p > 0.05 in the original study,

FN - false negatives of the submission, defined as those points with p > 0.05 in the participant’s study and p < 0.05 in the original study,

NP - number of points where the evaluation takes place.

Since three metrics will be obtained (one for each measure: FA, MD and AD), the final skeleton metric will be the average of the three.

SMT=(SMFA+SMMD+SMAD)/3

2.- Region-of-interest metric: the second metric is based on a region-oriented analysis.

As a gold standard, we consider the statistical test results on original diffusion MRI data (i.e. 61 gradient directions). Specifically, we will consider EM patients vs CM patients.
For each of the measures (measure={FA, MD, AD}) provided by the participants, the same procedure will be repeated:
The metric's value inside each ROI will be calculated for each patient. Then, we will execute a group-wise comparison between CM and EM subjects and obtain a p-value per region.
We use the ROI-based metric in order to measure the quality of the proposed measure (measure={FA, MD, AD}):

RoMi=1-FP/48-FN/48

Since three metrics will be obtained (one for each measure: FA, MD and AD), the final skeleton metric will be the average of the three.

RoMT=(RoMFA+RoMMD+RoMAD)/3

Justification of the metrics:

The purpose of the challenge is to measure how good the DL reconstructed images mimic the original ones in a clinical study. The best method is that the results produced by the study with the reconstructed data show statistical differences in exactly the same areas of the brain in which the original data does.

To that end, the metrics proposed are based on the following considerations:

The clinical study is replicated for the reconstructed data. The p-values are calculated, and statistical differences are obtained.
To measure the proposed method's sensitivity and specificity, we use the ratio of false negatives and false positives.
A false negative is when the method decides that there are no differences, but the original study detects a difference. A reconstruction method that produces lots of false negatives is a method in which the differential quantitative information of the patients is lost. Thus, the ratio FN/NP will be an inverse measure of the sensitivity of the method. The greater FN, the less sensitive the method.

On the other hand, it is vital to measure the false positive ratio. It will give an idea of the reliability of the method.

We decided to use two different metrics, one skeleton oriented and one ROI oriented, in order to replicate the usual studies typically carried out with diffusion MRI data in clinical environments.

Ranking methods

Both metrics will be used to compute the ranking following the formula:

This metric gives the number between 0 and 1. The higher the value of the general metric, the better solution. In the case of two groups obtaining the same General metric, the following procedure will be followed:

The group with the lower total number of FP in both metrics will be ranked higher.
If there is still no resolution, the group with the higher Skeleton metric will be ranked higher.

QuaD22