The study contains two modules. First, we built and validated an infant deep learning segmentation framework (ID-Seg) to segment the infant hippocampus and amygdala on T2-weighted (T2w) MRI brain scans (Fig. 1a). Second, we conducted a proof-of-concept analysis to explore prospective associations between brain structure in infancy and behavioral problems at age 2 (Fig. 1b).
Multi-rater manual segmentation
Three research assistants (KS, GKCAA, EB) received instruction and training from a board-certified radiologist (AM) to perform infant hippocampus and amygdala segmentation using ITK-SNAP software [28]. Manual segmentation protocols are available in the Additional file 1: Manual Segmentation Protocols using ITK-SNAP. Inter-rater reliability for these manual tracings was assessed by Dice Similarity Score (DSC) and ensured a minimum 0.6 DSC for each brain region before proceeding. Based on all three raters' manual segmentation, a bilateral reference manual segmentation for the amygdala and hippocampus was generated with the Simultaneous Truth And Performance Level Estimation (STAPLE) algorithm [29]. STAPLE is an expectation–maximization algorithm that estimates the optimal combination of segmentations based on each rater's performance level. We established the "ground truth" segmentation using STAPLE and based on all three study raters instead of only one rater. We visually inspected and edited STAPLE output if needed.
Infant deep learning segmentation (ID-Seg)
We adopted a transfer-learning approach to train and test ID-Seg on multiple datasets. In Fig. 1a, ID-Seg was initially trained (termed “pre-train” in the AI literature) on our training dataset, consisting of 473 T2w infant MRI scans. Then we tested this trained model on our internal dataset (ECHO-Dataset 1) and external dataset (M-CRIB). For our internal dataset, we used our multi-rater manual segmentation framework as described above to generate manual segmentations; for out external dataset, researchers from an independent group [7] generated manual segmentations and have made these publicly available (https://osf.io/4vthr/). All deep learning models below were written in Python using PyTorch libraries, and relative training procedures were completed in the NVIDIA Geforce Titan RTX GPU workstation. Relevant code is open access via GitHub repository (https://github.com/wangyuncolumbia/ID-Seg-V2).
Model architecture
We used a multi-view fully convolutional neural network, the most cited MRI brain segmentation model [18]. This model was initially developed for adult whole-brain segmentation and has been demonstrated to be capable of segmenting small subcortical structures with skip connections and unpooling layers in the decoding path. The flowchart of this model’s architecture can be found in Additional file 1: Fig. S1, and more detailed information can be found in the original work [18]. Specifically, we trained three 2D CNN models separately for each of the three principal views (axial, coronal, and sagittal). Of note, each 2D CNN has the same architecture. In the end, we merged predicted probabilities from multi-view models using formula (1) below to calculate the final predicted label for each voxel:
$${L}_{Pred }\left(x\right)= argmax \left({\lambda }_{1}{p}_{Axial}\left(x\right)+ {\lambda }_{2} {p}_{Coronal}\left(x\right)+ {\lambda }_{3} {p}_{Sagittal}\left(x\right)\right).$$
(1)
In formula (1), \({p}_{Axial}\left(x\right), {p}_{Coronal}{\left(x\right), p}_{Sagittal}(x)\) are the predicted probabilities of a voxel from axial, coronal, and sagittal deep learning models. We set the weights \({\lambda }_{1}, {\lambda }_{2}, {\lambda }_{3}\) to 0.4, 0.4, and 0.2, respectively.
When using this model, there is no fixed dimension requirement for the input size of MRI images, however, the input size should be divisible by 16 because this model consists of 4 down-sampling layers – each layer reduces the image by a factor of 2. Therefore, for each dataset in this project, we changed the size of the input image to meet this divisibility rule by cropping background borders (equally from both sides) or up-sampling the field of view to ensure a fair comparison if the input size were significantly smaller than the training dHCP dataset. Detailed information can be found in Additional file 1:Table S1.
Model learning
Initial network training
The goal of training ID-Seg on a large training dataset first was to provide robust weight initialization. We randomly split this training dataset into two parts: 80% for training and 20% for validating model performance. As noted above, we applied the dHCP structural segmentation pipeline, which bilaterally segments and labels 87 regions within the infant brain, including the hippocampus and amygdala. We started with these automated segmentations because we reasoned that this large sample would provide strong prior initialization of the network, such that we could then optimally utilize the smaller sample of manually segmented scans to achieve high segmentation accuracy. We anticipated that the segmentations from the automated software (i.e., dHCP) would not be as accurate as the manual annotations. However, these segmentations would allow our model to recognize a wide range of morphological variations in brain structures. This training procedure affords strong prior weights for the ID-Seg network, where robustness to data heterogeneity is enhanced by the diversity of the training dataset (e.g., different scanners and sites). For each 2D model, trainable parameters are 3,520,871 and the total trainable parameters are 10,562,613 for the multi-view 2D model. During the training process, we selected a set of model hyperparameters including epoch, dropout, convolutional kernel size, optimizer, learning rate, loss function, and batch size. We chose an optimal configuration that results in a model that achieves the best performance on the validation dataset. The optimal hyperparameter configuration can be found in Additional file 1: Table S2.
Internal fine-tuning and leave-one-out cross-validation
We next applied the initially trained ID-Seg and fine-tuned it on our internal dataset (ECHO-Dataset 1). Specifically, we first passed weights of the initially trained ID-Seg, and we then trained ID-Seg for 5 epochs while only unfreezing the last few layers to prevent propagation errors due to random initialization weights and to save computation time. Lastly, we unfroze all layers and fine-tuned the whole network for another 15 epochs. Similarly, we evaluated ID-Seg's performance against manual segmentations that we conducted on the internal dataset using a multi-rater framework and the leave-one-out cross-validation (LOOCV) technique. The training loss versus epochs plot on this dataset can be found in Additional file 1: Fig. S2a. The hyperparameters used to fine-tune the network, including learning rate, batch size, and epoch number, were 5*10–4, 8, and 15, respectively.
External fine-tuning and leave-one-out cross-validation
To test the reliability of ID-Seg, we applied ID-Seg to our external dataset (M-CRIB). We evaluated the accuracy of ID-Seg’s segmentations against manual segmentations performed by an independent group. The training loss versus epochs plot on this dataset can be found in Additional file 1: Fig. S2b. We used similar hyperparameters as in our internal fine-tuning.
For comparison, we also segmented the internal and external datasets with (1) ID-Seg without pre-training on dHCP dataset; and (2) an automated pipeline, the dHCP, that uses an Expectation–Maximization approach, rather than deep learning.
Segmentation evaluations and comparisons
We calculated three commonly used evaluation metrics to compare the segmentation output of our ID-Seg against manual segmentations: Dice similarity coefficient (DSC), intra-class correlation (ICC), and average surface distance (ASD). DSC is a metric used to calculate the similarity between two images and measure the overlap across the two images [30]. ICC is a measure of consistency between two raters, or in this case, two segmented images [31]. ASD is a surface-based metric and measures the average Hausdorff Distance over all points between surfaces of a prediction structure (i.e., ID-Seg’s segmentation of the hippocampus and amygdala) and the “‘ground truth”’ (i.e., manually segmented hippocampus and amygdala). Relevant code can be found at https://github.com/deepmind/surface-distance. For each structure, we calculated DSC, ICC, and ASD to compare the output of each method with manual segmentation. One-way ANOVA tests were used to compare the accuracy of three methods: ID-Seg without pre-training, ID-Seg with pre-training and the dHCP pipeline based on DSC, ICC, and ASD.
Brain and behavior relationship—a proof-of-concept analysis
Volumetric and shape analysis for the hippocampus and amygdala
We applied the optimized version of ID-Seg to directly segment infant MRI scans in our proof-of-concept dataset (ECHO-Dataset 2)—that is, infant MRI scans that were not used in any of the previous training/testing procedures (Fig. 1b). Using ID-Seg, we calculated volumetric and shape measurements for the bilateral hippocampus and amygdala. The volume (in mm3) of each region was adjusted with respect to total brain volume. Then, we performed shape analysis for each structure using SlicerSALT software (Kitware, Inc., United States). Here, we used an average spherical harmonics description (SPHARM) to represent the shape measurements of a 3D structure [32].
Brain and behavior relationship
We conducted Spearman rank partial correlation analysis to examine prospective associations between morphometric measures of the hippocampus and amygdala at infancy with behavioral outcomes at age 2. The behavior outcomes included internalizing, externalizing, and total behavioral problems and were assessed using the T score from parent-report CBCL. We adjusted for postmenstrual age of the infant at MRI scan, maternal education, and maternal post-partum mood symptoms (as indexed by the 10-item Edinburgh Postnatal Depression Scale [33]). Infant sex was not adjusted for because this is already accounted for in CBCL T scores.