HDC early fusion architecture
The HDC physiological architecture includes four main blocks: the map into the hyperdimensional space (HDS), the spatial encoder, the temporal encoder, and the associative memory as shown in Fig. 2. The first block maps incoming data into HDS using an item memory or a generator. HDC depends on the pseudo-orthogonality of random vectors to be able to distinguish between various classes; a random vector will be nearly orthogonal to another random vector in the hyperdimensional space. Random vectors are used for the channel item memory vectors so that the source channel of a feature value can be included as information in the encoding process. These are stored in an item memory (iM).
To encode feature values, in this implementation, additional feature projection vectors are randomly generated for each channel and stored as well. In traditional architectures, the feature projection vector {−1, 0, 1} is multiplied by the feature value and then binarized by reducing the positive values in the vector to 1s, and the zeros and negative values to 0s. This process can be simplified to multiplexers selecting between a pre-generated random negative or positive binary feature projection vector depending on the feature value’s sign to eliminate computationally expensive multipliers. This allows the feature projection vectors to maintain pseudo-orthogonality but have the same sparsity as the item memory vectors, making them interchangeable. As a result, the feature projection vectors can also be stored in the item memory instead of separately.
In the spatial encoder, the binding operation (XOR) is utilized to generate a spatially encoded hypervector for each channel. If iM\(_i\) represents the item memory vector for channel i and FP\(_{i,j}\) represents the feature projection vector selected for channel i for sample j, then the spatially encoded hypervector for sample j SE\(_{i,j}\) is computed as
$$\begin{aligned} \text {SE}_{i,j} = \text {iM}_{i} \oplus \text {FP}_{i,j} \end{aligned}$$
(2)
To develop a complete hypervector, the bundling operation (vertical majority count across vectors) combines the many spatially encoded hypervectors within a sensor modality. If the sensor modality m has k channels and the bundling operation is represented as \(+\), SE\(_{m,j}\) is computed as
$$\begin{aligned} \text {SE}_{m,j} = (\text {iM}_{1} \oplus \text {FP}_{1,j}) +\cdots + (\text {iM}_{k} \oplus \text {FP}_{k,j}) \end{aligned}$$
(3)
Because emotion recognition involves various sensor modalities, it requires fusion. Previous sensor fusion implementations fused after the temporal encoder, but in this work, an early fusion approach is taken, which fuses the modalities directly after the spatial encoding process. Therefore, this architecture requires only a single temporal encoder as opposed to one per modality, as shown in Fig. 3. This reduces the parallel encoding paths while still allowing each modality to be weighted equally instead of by number of features. If there are m sensor modalities, the fused spatially encoded hypervector for sample j is
$$\begin{aligned} \text {SE}_{j} = \text {SE}_{1,j} + \text {SE}_{2,j} + \cdots + \text {SE}_{m,j} \end{aligned}$$
(4)
HDC also has the ability to encode temporal changes through the use of n-grams based on a sequence of N samples. This is invaluable for physiological signals that are time-varying as it allows for the capturing of time-dependent emotional fluctuations within the same class or between segments of the same class. The permutation operation (cyclical shift, represented as \(\rho \)) is used to keep track of previous samples. Hypervectors coming from the spatial encoder are permuted and then bound with the next hypervector N times in the temporal encoder. This results in an output that observes changes over time, TE\({_j}\), that can be computed as
$$\begin{aligned} \text {TE}_{j} = \text {SE}_{j} \oplus \rho ^{+1} (\text {SE}_{j-1}) \oplus \cdots \oplus \rho ^{+(N-1)}(\text {SE}_{j-(N-1)}) \end{aligned}$$
(5)
During the training process, many such encoded hypervectors are generated, bundled to represent a class and then stored into the final block, the associative memory. During inference, the encoded hypervector is compared against each trained hypervector using Hamming distance. For binary vectors, this involves an XOR and then popcount. The comparison with least distance is the inferred label.
Implementation
The HDC early fusion architecture is implemented on both the AMIGOS and DEAP data sets with a standard dimension of 10,000 for the full datapath in the baseline implementation. In the AMIGOS study, GSR recorded at 128 Hz (1 channel across middle and index fingers), ECG recorded at 256 Hz (2 channels on right and left arm crooks) and continuous EEG recorded at 128 Hz (14 channels: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4) were measured for 33 subjects as they watched 16 videos [7]. Each video for each subject was classified to have either led to a positive or negative emotion (valence), and the strength of the emotion was classified as either strong or weak (arousal). From the 3 sensor modalities, Wang et al. selected 214 time and frequency domain features relevant to accurate emotion classification [11]. GSR has 32 features, ECG has 77 features, and EEG has 105 features. Similar preprocessing and features are used in this work as this feature selection demonstrated excellent performance on the AMIGOS data set in previous work [9, 11]. The features used include GSR skin response/conductance and skin conductance slow response, ECG heart rate spectral power, variability and heart rate time series, and EEG average power spectral density and asymmetry of theta band, alpha band, beta band, and gamma band. The data for all 33 subjects was appended and a moving average of 15 s over 30 s was applied. The signals were scaled to be between \(-1\) and \(+1\) to meet the HDC encoding process and downsampled by a factor of 8 for more rapid processing and usage of the HDC classification algorithm. Previous work uses the leave-one-subject-out approach to evaluate performance, this was also implemented for the early fusion architecture [7, 11, 12]. The temporal encoder was tuned and an optimal n-gram of 3 feature windows was selected. For both data sets, transitionary ngrams (those with samples from both classes) were excluded from training and testing.
The DEAP study was collected in a similar format as the AMIGOS with 32 subjects watching 40 one-minute highlight excerpts of music videos selected to trigger distinct emotional reactions; however, it contains more extensive sensor modalities all recorded at 512 Hz: continuous EEG (32 channels placed according to the international 10–20 system), EMG (2 channels: neck and corner of mouth), GSR (1 channel across middle and index fingers), BVP (1 channel on the thumb), EOG (4 channels above and below each eye), temperature (1 channel on skin) and respiration amplitude (1 channel) [8]. The arousal and valence scores were self-assessed by the participants on a scale between 1 and 9. A binary classification system is maintained for high and low valence and arousal by thresholding the scale at 5. Preprocessing and feature selection were done using the TEAP toolbox which selected time and frequency domain features for 5 of the modalities based on previous work in those areas [24]. These features have been shown to enable high performance on the DEAP data set in prior work and hence were selected for this work [25]. EMG has 10 features including power and statistical moments over two channels. EEG has 192 features across the 32 channels including power spectral density in delta, theta, slow alpha, alpha, beta and gamma bands. GSR has 7 features including number of peaks, amplitude of peaks, rise time and statistical moments. BVP’s 17 features include interbeat intervals, multiscale entropy at 5 levels, tachogram power, power spectral density in multiple bands, and statistical moments. Respiration has 12 features including main frequency, power spectral density and statistical moments. This results in 40 samples with a total of 238 features per video from 5 modalities per subject. The signals were then scaled to be between \(-1\) and \(+1\) for the HDC encoding scheme. Previous work for this data set does training and inference independently by subject which was adopted in this work as well [13, 1, 8]. Typically, 90% of the data set is used for training per subject with the remaining 4 videos used for testing. For HDC, due to the inclusion of the temporal encoder, this would result in limited number of inferences leading to imprecise classification accuracies. As a result, the size of the training set was decreased to be 80% of the data set with 20% used for testing. A temporal n-gram of 3 was selected for this data set as well.
Memory optimization
For both the AMIGOS and DEAP data sets, there are over 200 features that need to be spatially encoded. This requires advance storage of 214/238 iM vectors and 420/476 feature projection (FP) vectors—positive (PFP) and negative (NFP)—totalling to 642/714 vectors that need to be stored in the item memory. Use of a unique iM and FP vector set per channel is shown in first column of Fig. 4. Without significant reduction of the memory requirements, optimizations of other blocks will provide limited benefits to the overall efficiency.
In the spatial encoder, the iM vector and the FP vector are bound together to form a unique representation containing feature information that is specific to a feature channel. However, both the iM and FP vectors do not need to be unique to the feature channel in order to generate a unique combination of the two. The binding operation will inherently create a vector different, and pseudo-orthogonal to both of its inputs. Therefore, as long as one of these inputs is different for a specific feature channel, the spatially encoded feature channel vector (represented by the SE vectors in Fig. 4) will be unique. Using this idea, a set of optimizations were developed and implemented on the DEAP and AMIGOS data sets:
‘iM vectors constant per modality’: the iM is replicated across the various modalities, shown in the second column of Fig. 4. If, between each modalities, the FP vectors are different, then orthogonality and input feature channel uniqueness are maintained even if the iM is the same.
‘FP constant per feature channel’: though the iM is now the same between each modalities, each feature channel within a modality still has a unique iM vector. Therefore, it is possible to re-use the same FP vectors for every feature channel within a modality, as shown in the third column of Fig. 4. This requires maintaining 2 unique FP vectors (PFP and NFP) for each modalities, and unique iM vectors within a modality.
‘Combinatorial pairs’: taking this combinatorial binding strategy to its limit, the 2-input binding operation can be used to generate many unique vectors from a smaller set of vectors by following an algorithmic process. Each feature channel requires a distinct set containing an iM vector, and two FP (positive & negative) vectors: {iM, PFP, NFP}. If the vectors for feature channel 1 are {A, B, C}, then the bound pairs that could result from spatial encoding (iM \(\oplus \) PFP or iM \(\oplus \) NFP) are:
-
A \(\oplus \) B
-
A \(\oplus \) C
B \(\oplus \) C will not occur, because they are both FP vectors. However, it is a unique pairing that could be re-used for another channel. For example, the set for feature channel 2 could be: {B, C, D}. The encoding process would use the following pairings:
-
B \(\oplus \) C
-
B \(\oplus \) D
This re-use strategy is the key to saving memory; it can be applied across all channels using a bank of the minimal required vectors, as shown in the first part of Fig. 5.
Each vector can be paired with every other vector only once to maintain orthogonality and paired uniqueness across all feature channel. For a feature channel, one vector (the iM) must have two other available vectors (PFP and NFP) to pair with. With \(\left\lfloor {x}\right\rfloor \) defined as the floor function of x, the following equation can be used to calculate the total feature channels, TFC, possible given a bank of v vectors:
$$\begin{aligned} \text {TFC} = \sum _{n=1}^{v-2} \left\lfloor {\frac{v - n}{2}}\right\rfloor \end{aligned}$$
(6)
The formula can be derived by looping through each vector in the vector bank and sequentially grouping it with pairs of the other vectors. The generation of feature channel sets can be algorithmic, following the pattern shown in the tables in Fig. 5.
‘Rule 90 generation’: implementation of the cellular automata with rule 90 will allow trading off vector storage with vector generation. If there are m modalities, the first \(2\times m\) generated vectors would be used for the PFP and NFP vector for each modality. These would be maintained throughout training and inference resulting in \(2\times m + 1\) locally stored vectors including an initial seed vector. However, the rest of the iM vectors would be generated on the fly for each feature channel during the encoding process, requiring no additional vector storage. This is possible because of the fixed access pattern of the iM. The generation process requires use of rule 90 across the entire hypervector, and local storage of the most recently generated vector to use as the next seed. 1 vector is requested and then generated for each feature channel.
‘Hybrid’: to reduce vector requests and hence the computation for rule90, the last two schemes: ‘combinatorial paired binding’ and ‘rule 90 generation’, can be combined. This hybrid strategy could include burst generation of a small set of vectors which could be locally stored. From this set, combinatorial pairs are assigned to feature channels and spatially encoded. This set can be gradually re-populated with new vectors as the old vectors are exhausted in the encoding process providing new possible pairs. This provides further tradeoff between vector storage and computation. The vector request rate (vector generation requests per feature channel) is minimized when the vector storage is large enough for the combinatorial paired binding scheme alone at which point no generation is required.
‘Dimensionality reduction’: the final method of memory reduction is in the form of hypervector dimension reduction. The algorithm outlined in 2 stays exactly the same, but the length of the HD vectors used throughout is shortened. This changes the size of the entire datapath, impacting both the logic complexity and the memory storage approximately linearly. However, smaller hypervectors also have reduced pseudo-orthogonality—random lower dimensional vectors are less likely to be nearly orthogonal in the hyperdimensional space than higher dimensional vectors. The capacity for information that can be stored within a hypervector is reduced. This especially impacts the output of the bundling operation that occurs in the spatial encoder which no longer represents as much information about each input channel, impacting classification accuracy. This optimization is a tradeoff between algorithm accuracy performance and overall efficiency. The impact of changing dimensions on emotion recognition accuracy for the various memory optimizations is also explored.