Fig. 3

HDC a late fusion and b early fusion architectures for a three-modality emotion recognition system. The late fusion architecture fuses after the temporal encoder, resulting in 3 parallel temporal encoders—one per modality. In comparison, the early fusion architecture fuses before the temporal encoder, resulting in only 1 temporal encoder