MCGNet+: an improved motor imagery classification based on cosine similarity

It has been a challenge for solving the motor imagery classification problem in the brain informatics area. Accuracy and efficiency are the major obstacles for motor imagery analysis in the past decades since the computational capability and algorithmic availability cannot satisfy complex brain signal analysis. In recent years, the rapid development of machine learning (ML) methods has empowered people to tackle the motor imagery classification problem with more efficient methods. Among various ML methods, the Graph neural networks (GNNs) method has shown its efficiency and accuracy in dealing with inter-related complex networks. The use of GNN provides new possibilities for feature extraction from brain structure connection. In this paper, we proposed a new model called MCGNet+, which improves the performance of our previous model MutualGraphNet. In this latest model, the mutual information of the input columns forms the initial adjacency matrix for the cosine similarity calculation between columns to generate a new adjacency matrix in each iteration. The dynamic adjacency matrix combined with the spatial temporal graph convolution network (ST-GCN) has better performance than the unchanged matrix model. The experimental results indicate that MCGNet+ is robust enough to learn the interpretable features and outperforms the current state-of-the-art methods.


Introduction
Brain-computer interface (BCI) technology has drawn much attention globally due to its significant meaning and extensive applications [1]. It enables their users to interact with the machine through the brain signals [2], such as the task of converting the psychological imagination of motion into a command [3], which can be utilized to help people with disabilities as a rehabilitation device [4] and could be considered the only way for people with motor disabilities to communicate [5]. The motor imagery classification based on the features extracted from the EEG imagination data of moving the body parts without actual movement, but the feature extraction process often relies heavily on prior knowledge to exclude certain features [6]. Consequently, more robust feature extraction techniques will continue to drive the development of BCI technologies.
A typical brain-computer interface system consists of four main processes [7]: brain-electric raw data acquisition, data preprocessing, feature extraction and feature classification. The previous studies show that the feature extraction and classification are two important phases, which determine whether the system is effective or not. The feature extraction process is designed to describe EEG signals by relevant values [8], and features should contain the information embedded in the original EEG signals while filtering out the noise and other irrelevant information. The classification phase is critical because an efficient classifier can take advantage of as many extracted features as possible and greatly improve the accuracy of the classification. The motor imagery classification is an EEG-based task that focuses primarily on the feature extraction and classification, which have been studied extensively in previous work.
Some research highlights two most common types of features that include frequency band power features and time point features [9], both of which benefit from extracting zone after spatial filtering [10]. Principal component analysis (PCA) and independent component analysis (ICA) are two classic unsupervised spatial filter methods [11], supervised spatial filters include the common spatial patterns (CSP) and filter bank common spatial patterns(FBCSP) [12]. In terms of the classifiers for motor imagery task, many state-of-the-art methods have been proven effective, such as linear discrimination analysis (LDA) and support vector machine (SVM) [13].
Nowadays, the deep learning methods have been efficiently applied to various areas. Much recent work has explored the application of deep learning to EEG-based analytical tasks [14]. The deep learning methods improve the analytical efficiency and accuracy and provide endto-end learning for EEG-based tasks, such as sleep stage detection, anomaly detection, motor imagery classification and so on [15]. In spite of the typical deep learning methods, such as convolution networks, can learn from the raw data without manual feature extraction, they still have some major limitations. For instance, typical deep learning methods require large datasets to train the models, which can be a disadvantage for EEG-based tasks because the collection of EEG data usually costs a lot. In addition, EEG datasets represent the unique characteristics of an individual, and the data collected from different areas of the brain. Therefore, the spatial connection between the EEG data cannot be ignored. However, existing methods including recent deep learning methods are unable to effectively learn the connections between different brain regions [16].
Graphs are the most appropriate data structure for brain connections; and graph neural networks (GNNs) has been shown to be effective in classifying graph structures [17]. The core idea of GNNs is to update each node's embedding iteratively through aggregating the representations of its neighbors and itself. The EEG channels could be represented as nodes in the graph and the connections between the channels correspond to the edges of the graph, but the graph convolutional networks need adjacency matrix to be given in advance which is the representation of the graph connection [18], so determining a suitable brain map structure is still a challenge due to the limitations of cognition of brain structure. And there are some methods that could be used to generate the adjacency matrix, we could utilize the position to calculate the distance between the electrodes as the degree of correlation or utilize the features collected from the electrodes to calculate the correlations. Moreover, the collection of EEG data is usually in chronological order, so in addition to spatial characteristics the temporal characteristics also need to be taken into account.
In this paper, we proposed a novel model called MCG-Net + based on the our proposed MutualGraphNet, combined the spatial-temporal filter and graph convolutional networks to learn the temporal and spatial characteristics, which achieved robust performance on the motor imagery classification tasks. The contributions of this paper are as follows: • The model could realize end-to-end learning. Furthermore, the model is specially designed to adapt to the characteristics of EEG data, so it could be able to utilize the features to a great extent. • For the first time, we use mutual information to generate the initial adjacency matrix and use cosine similarity to update the adjacency matrix dynamically, and achieve better performance. • Experimental results demonstrate that the newly proposed model has better performance than stateof-the-art methods.

Related work
A motor imagery classification task is of great significance for people with disabilities. Numerous works have been done to improve classification performance. In earlier studies, traditional machine learning methods were commonly used for motor imagery classification task, such as support vector machine (SVM), K-Nearest-Neighbor (KNN) and artificial neural network (ANN) are frequently used [19], but these traditional methods have limited performance on EEG-based classification tasks. Currently, the deep learning methods are utilized in EEGbased classification tasks, Deep Belief Network (DBN) [20] was proposed to manually extract features from the channels then feed them into the network. Convolutional Neural Networks(CNN) could automatically learn features from EEG data and perform better than DBN due to their regular structure and the degree of ambiguity of the translational structure [21]. Two CNN models were specially designed for motor imagery classification called Shallow ConvNets and Deep ConvNets [14], both of them have better performance than the state-of-the-art methods. Then another CNN model called EEGNet [15] was proposed, which utilizes the Depthwise and Separable convolutions to replace the traditional convolutions for the motor imagery task that have better performance than the ConvNets. The CNN models can effectively extract the local patterns of data, but it can only be applied to the standard grid data [22], graph convolutional networks have been proven to have better performance on the graph structure data. Much has been done to improve the performance of the graph convolutional networks. So far, GCNs have been applied in many fields, the spatial-temporal graph convolution network (ST-GCN) [23] is proposed to learn the dynamic graphs for the human action recognition tasks, the spatiotemporal multi-graph convolution network (ST-MGCN) [24] is proposed for ride-hailing demand forecast which encodes the non-Euclidean correlations among regions into multiple graphs, GraphSleepNet [16] based on spatial-temporal convolution network (ST-GCN) is proposed for automatic sleep stage classification. When using GCNs, the connection relationship between each electrode need to be given as a prior knowledge, in other words, the adjacency needs to be calculated as input.
There are different methods that can be used to generate the adjacency matrix, the distance between two electrodes can be used directly to represent the degree of correlation between electrodes and there are many different vector distance calculation methods, such as the euclidean distance [25] which only need the physical position of the electrodes, the Chebyshev distance [26] is defined as the maximum difference between two vectors in any coordinate dimension, hamming distance, Manhattan distance and so on. Furthermore, we can use the correlations of vectors to determine the degree of relevance of the different channels, such as cosine similarity [27] that calculates the similarity relationship between the characteristics of different electrode channels, Pearson correlation that evaluates the linear relationship between two continuous variables, Spearman correlation that evaluates the monotonic relationship between two continuous variables, Kendall correlation, Point-Biserial correlation and so on. Also, we could use some machine learning methods, such as the information gain [28] that evaluates the gain of each variable in the context of the target variable and mutual information is the name given to information gain when applied to variable selection that calculates the statistical dependence between two variables.
Motivated by the studies mentioned above, considering the graph structure and the dynamic spatial-temporal characteristic of the EEG data as well as the graph structure of different motor imagery could be different, the traditional GCNs models may not be optimal for EEG-based motor imagery classification task. Thus, we propose the novel model to best suit the characteristics of EEG data which uses the mutual information to generate the initial adjacency matrix and use the cosine similarity to update the adjacency matrix after each iteration.

Preliminaries
In this study, the EEG data could be defined as an undirected graph G = (V , E, A) , where V is a finite set of |V | = N nodes and N represents the number of the EEG data channel; E is a set of edges, indicating the connectivity between different channels; A represents the adjacency matrix of graph G. Figure 1 shows how the graph is generated from the EEG raw data.
The recorded EEG signals are divided into several labeled segments called trials, the dth trial can be denoted as where N denotes the number of the EEG electrodes and F denotes the values of all nodes within the time steps t. The dataset can be described as D = (X 1 , y 1 ), (X 2 , y 2 ), ..., (X L , y L ) , L denotes the number of the trials and y represents the label corresponding to the trial, there are four motor imagery categories including left hand, right hand, feet and tongue, so the label can be denoted by 0-3, respectively. The goal of the task is to learn the mapping relationship between the EEG data and the motor imagery categories represented as labels and the problem can be defined as: given a input trial X i ∈ R N ×F , 0 < i < L identify the corresponding label y i .

Methodology
The overall framework of the model proposed in this paper is presented in Fig. 2, it includes three main parts: feature extraction and adjacency matrix generation part, spatial-temporal attention part and spatialtemporal graph convolution part. Spatial-temporal attention part puts more attention on the more valuable spatial-temporal information, then spatial-temporal graph convolution part extracts both spatial and temporal features. And the complete algorithm can be seen as follows:

Algorithm 1 The process of motor imagery classification
Output: The corresponding classificationŷ.
1: Calculate the mutual information of the columns of X and get the adjacency matrix A ∈ R N * N . 2: repeat 3: Put the A and X in to the spatial-temporal attention block and get the get the attention matrix S.

4:
Put A, X, S into a GCN layer and get the embeddingX ∈ R N * L . 5: Calculate the cosine similarity of the column of the embedding, get a new matrixÂ ∈ R N * N . 6: Update the adjacency matrix A =Â and the input X =X. 7: until The repeat times are equal to the number of ST-layers. 8: Then the outputŷ = sof tmax(linear(X)).

Adjacency matrix generation 4.1.1 Relevance calculate methods
The relevance of different electrodes can be obtained through calculating the correlations or the information gain of the features of the electrodes, and in this paper we calculate the relevance of different electrodes over all the electrodes. The correlations of different channels can be represented by the distances of the channels. The euclidean distance of the electrodes can be represented as: The euclidean distance can be understood as the straightline distance between two points, but the electrodes are distributed on the surface of the cerebral cortex, so it is not suitable to directly express the relationships between the electrodes. The Chebyshev distance is defined as the maximum difference between two vectors in any coordinate dimension, it is the maximum distance along an axis, and the Chebyshev distance of the electrodes can be denoted as: The calculation of the distances of the electrodes only utilizes the positions of the electrodes, we can also use the features of the electrodes to obtain the correlations. The cosine similarity of two vector can be defined as: However, the cosine similarity does not consider the magnitude of the vectors, but only consider the directions. The Jacquard index, also known as the intersection ratio and Jacquard similarity coefficient, can be used to compare the similarity and diversity of sample sets: x.y �x��y� .

Fig. 2
The overall structure of the proposed model consists of three parts: the feature extraction and the mutual information computation part, the spatial-temporal attention mechanism part and spatialtemporal graph convolution part One of the main disadvantage of the Jacquard index is that it is greatly affected by the size of the data. Large datasets have a great impact on the index, because it can significantly increase the union while maintaining similar intersection. Moreover, we could use information gain between the feature vectors to obtain the degree of relevance, information gain is calculated by comparing the entropy of the dataset before and after a transformation. The mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variables selection.

Adjacency matrix update
In order to make full use of and adjust the input prior knowledge in time according to the embedding learned by GCNs, we use the mutual information to generate the initial adjacency matrix and use the cosine similarity to update the adjacency matrix during the training process.
Mutual information (MI) [29] is used to indicate whether there is a relationship between two variables and the strength of the relationship. The mutual information of two variables X and Y can be defined as: Mutual information is related to entropy, which is the expected or mean value of the information of all variables. The entropy of X is defined as: Then MI of X and Y can be computed by the equations: where H(X, Y) is the joint entropy of X and Y, and H(Y|X) is the conditional entropy that X is given in advanced. Thus, I(X, Y) is the reduction in the uncertainty of the variable X by the knowledge of another variable Y, equivalently, it represents the amount of information that Y contains about X.
Considering the features of EEG data X = {x 1 , x 2 , ..., x N } ∈ R N ×F , we could compute the mutual information m i j of x i , x j and use it as the weight of the connection of x i , x j , then we could generate a N × N weight matrix which could be used as the input adjacency matrix of the graph convolution networks. In our proposed work [30], we kept the initial adjacency matrix unchanged during the training process. However, on embedding changes after each iteration, we update the adjacency matrix after each iteration synchronously to improve the performance of the model. Here, we compute the cosine similarity of two columns of the embedding as the weight of the adjacency matrix. The cosine distance of two vector x, y is defined as: The updated weight can be defined as: where the a l+1 i,j denotes the element of the ith row and jth column of the adjacency matrix at the l + 1th iteration, and e l i , e l j represents the ith, jth column of the embedding at lth iteration. The process of generating and updating the adjacency matrix can be seen in Fig. 3.

Spatial-temporal attention
The spatial-temporal attention mechanism could capture the dynamic spatial and temporal correlations of the motor imagery network. In the spatial dimension, the activities of one brain region has influence on other brain regions and generally different brain activities convey different information, so the dynamic spatial-temporal capture mechanism is required. We use a spatial attention mechanism [31], which could be represented as: where S denotes the spatial attention matrix, which is computed by current layer. V p , b p ∈ R N ×N , χ ( r − 1) = (X 1 , X 2 , · · · , X T r−1 ∈ R N ×C r−1 ×T r−1 C r−1 is the number of channels of the input data in the r t h layer. W 1 ∈ R T r−1 , W 2 ∈ R C r−1 ×T r−1 , W 3 ∈ R C r−1 , S i,j in S represents the correlation strength between node i and j, then a softmax function is used to normalize the attention (8) cos(x, y) = x.y �x��y� .
, weights. Combining the adjacency matrix and the spatial attention matrix, the model could adjust the impacting weights between nodes dynamically. In the temporal dimension, there are correlations during each motor imagery trial, such that the brain waves are transmitted in the cerebral cortex and the active areas of the brain will change over time, so the collected EEG data also changes over time. Therefore, a temporal attention is utilized to capture dynamic temporal information. The temporal attention mechanism is defined as: where V e , b q ∈ R T l−1 ×T l−1 , M 1 ∈ R N , M 2 ∈ R C l−1 ×N , M 3 ∈ R C l−1 , E m,n denotes the strength of the correlation between motor imagery network m, n, and E is normalized by the softmax function, so the temporal attention matrix can be directly applied to the input.

Spatial-temporal graph convolution
The spatial-temporal convolution consists of a graph convolution in the spatial dimension and a normal convolution in the temporal dimension, which could extract both the spatial features and the temporal features.
The spatial features are extracted by aggregating information from neighbor nodes; we use graph convolution to extract the spatial features. The graph convolution is based on Laplacian matrix and Fourier transform, the graph Laplacian can be defined as: where A ∈ R N ×N is the adjacency matrix associated with the graph, D ∈ R N ×N is the diagonal degree matrix, I ∈ R N ×N is the identity matrix. L is a real symmetric positive semidefinite matrix, it can be decomposed as L = U U T and ∈ R N ×N is the diagonal matrix of eigenvalues that represent the frequencies of their associated eigenvectors. Let x ∈ R n be a signal defined on the vertices of a graph G, the graph Fourier transform of the signal is defined as x̂ = U T x . The graph convolution uses the linear operators that diagonalize in the Fourier domain to replace the classical convolution operator, the graph convolution can be defined as: where θ is a vector of Fourier coefficients, g θ is the filter that could reduce the computational complexity, g θ can be approximated by a truncated expansion in the terms of Chebyshev polynomials [32]: where k is the order of the Chebyshev polynomials, θ p ∈ R k is the vector of Chebyshev coefficients, T p (�) ∈ R N ×N is the Chebyshev polynomial of order k and � = 2�/ max − I ranges in [−1, 1] . Then the jth output feature can be calculated as: where x i denotes the ith row of input matrix, F in equals to the input dimension, the outputs are collected into a feature matrix Y = [y 1 , y 2 , . . . , y F out ] ∈ R N ×F out . In this work, we generalize the above definition to the nodes with multiple channels, the lth layer's input is (13) Fig. 3 The process of generating and updating the adjacency matrix C (l−1) denotes the channel's number and T l−1 denotes the lth layer's temporal dimension.
After the graph convolution having captured the neighboring information for each node in the spatial dimension, a standard convolution layer is used in the temporal dimension, we use a standard two-dimension convolution layer to extract the temporal information, the rth convolution layer could be defined as: where is the parameter of the temporal dimension convolution kernel, and * represents the convolution operation, ReLU is the activation function.

Experiment
In order to evaluate the effectiveness of our model, we carried out the comparative experiments on a public dataset BCI Competition IV dataset 2a(SMR) for motor imagery task.

Dataset description
The BCI Competition IV dataset 2a consists of EEG data from nine subjects, there are two sessions recorded, one for training and the other one for testing. Each session includes 288 trials, which are recorded with 22 EEG electrodes and 3 electrooculogram channels, we only utilize 22 EEG channels in this experiment and the distribution of the EEG electrodes can be seen in Fig. 4. There are four types of labels in this dataset, corresponding to movements of the left hand, right hand, feet and tongue.
The original dataset is sampled at 250 Hz and bandpass-filtered between 0.5 Hz and 100 Hz, and we low-pass filter the dataset to 4-40 Hz. Also in our experiment, we set the length of each trial to 4.5 s which starts from 500 ms before the start cue of each trial until to the end cue, then we extract 11 differential entropy features (DE) for each channel and double fold the features to make it have the same shape as the adjacency matrix, and combine the two as the input of the graph convolutional network, then we standard scale the data to make it suitable for the machine learning model. To show the effectiveness of our proposed model learning from the raw data and ensure the model could be used for wider range of tasks, the raw EEG data have not undergone more preprocessing.

Experiment settings
We compare our model with some state-of-the-art methods as well as the proposed MutualGraphNet, the baseline methods are listed as follows:

Filter Bank Common Spatial Patterns (FBCSP) [12]:
it extracted the band power features of EEG, then use the features to train the classifier to predict the labels. 2. Shallow ConvNet [14]: an end-to-end learn method, which uses convolutional networks to do all the computations. 3. Deep ConvNet [14]: it has more convolution-pooling blocks and is much deeper than Shallow ConvNet. 4. EEGNet [15]: it uses the depthwise and separable convolution and has two convolution-pooling blocks.
In addition to the above baseline methods, we also conducted a comparison between the proposed method in this paper and the traditional machine learning methods including support vector machine (SVM) [33] and random forest (RF) [34].
In order to prove that the model can effectively extract features and has the ability to eliminate the influence of individual differences, we no longer conduct experiments on each subject separately, we mixed the experimental data of nine subjects, and a total 2592 training trials and 2592 testing trials, and we use fourfold cross-validation to evaluate the performance. Since the training set is not big enough, in order to reduce the impact of over-fitting, we adopt a loss flooding strategy [35] during the training process, which is defined as: R (g) = |R(g) − b| + b and R (g) is the loss of the model, b is a constant called loss flooding level, here we set b as 0.5. All these experiments are performed on a single Nvidia RTX3090 32GB GPU and the hyper-parameters are shown in Table 1.
As for the baseline methods, in order to evaluate the performance of the models more reasonably, we use 250 Hz sampling 4.5 s EEG data for all experiments. Since Fig. 4 The distribution of the electrodes in 3D space that the EEGNet [15] used the 128 Hz resampled data to conduct experiment in the original paper, so we double the lengths of temporal kernels and average pooling size of the original model for double sampling rate to better adapt the input which proven to have better performance than the original model. In response to changes in the length of the sampling time, we also adjusted the parameters of each model accordingly, conducted experiments and selected the best model performance. The training parameters of other baseline methods are the same as in the paper [15].

Results and discussion
We compare our model with the six baseline methods on SMR, we use the accuracy, F1-score and precision as the evaluation metrics to evaluate the performance of the models. Table 2 shows the performance of the different models on the SMR dataset, the results show that our model performs better compared to the other baseline methods and the proposed MutulaGraphNet.
For the traditional methods, the random tree (RF) has better performance than the support vector machine (SVM), but both of them are not good enough. The FBCSP cannot extract and utilize complex features in multi-subject tasks [36], though it has good performance in single-subject tasks. And the results show that the traditional machine learning methods cannot learn the complex features well, the deep learning models EEG-Net and ShallowConvNet all outperform the traditional methods which demonstrate the effectiveness of deep convolutional neural networks for EEG-based classification tasks. However, the performance of DeepConvNet demonstrates that the deeper convolutional network does not work better. The values in bold shown in Table 2 indicate that our model (MCGNet) outperforms conventional methods in accuracy, F1-score and precision.
In order to evaluate the effect of the depth of network, we study the impact of the layers of ST-GCN in Fig. 5. The horizontal axis in Fig. 5 represents the layers of ST-GCN and the vertical axis represents the corresponding performance of the model. The results show that the MCGNet + with more ST-GCN layers does not work better; the best performance is achieved with 4 layers and with the increasing number of layers the performance gets worse. That is because the increase in the number of layers leads to an increase in training parameters, but the training dataset is too small to train the model with more parameters.
In this paper, we extract differential entropy (DE) feature as the input of the model, and in EEG-based tasks there are other five different features [37]: power spectral density (PSD), differential asymmetry (DASM), rational asymmetry (RASM), asymmetry (ASM) and differential caudality (DACU) features from EEG. The DASM and RASM can be expressed as:   We also evaluate the performance of our models on these features. All the experiments are performed with fourfold cross-validation and the training settings are the same as above.
The results are presented in Table 3, the PSD feature still has the worst performance and the DE feature outperforms the other features. The DCAU feature also achieves comparable performance, but ASDM and DSAM feature contain less information which leads to limited performance. All the features have better performance with the new model, which indicated the effectiveness of the newly proposed method. Moreover, the results indicate that there exists some kind asymmetry of the brain which has discriminative information and our knowledge of the human brain is still very limited, the deeper understanding of brain is still required to obtain more effective and valuable information from EEG data. The values in bold shown in Table 3 indicate that DE feature in both our models (MCGNet and MutualGraphNet) outperforms other features in accuracy, F1-score and precision. The new approach is compared with the several different adjacency matrixes that we designed: 1. KNN: for each channel, select the nearest N channels to establish a connection. 2. The Euclidean distance(ED): according to the actual distance of each electrode on the brain, select adjacent points to establish a connection.
3. Random: randomly select channels and establish connections between channels. 4. Mut_Euclidean : use the Euclidean distance to establish connections and calculate the mutual information. 5. Mut_KNN: use KNN to establish connections and calculate mutual information between connected channels. 6. Mut_ED: use the Euclidean distance to confirm connection and calculate mutual information between connected channels.
The results of classification with different kinds of adjacency matrix are shown in Fig. 6. It can be seen that the MI_cos adjacency matrix has better performance than the MI adjacency matrix, Mul_ KNN and Mul_ED are better than KNN and ED which means that mutual information could provide valuable information for ST-GCN. Furthermore, the adjacency matrix surely could effect the performance of classification.

Conclusion
In this paper, we improve the original model for motor imagery classification task based on our previous work [30]. Instead of using the stable adjacency matrix, we calculate the cosine similarity of the columns of the embedding to generate the dynamic adjacency matrix. The main advantage of the new model is that it could adjust the input matrix during the training process to utilize the features fully. The experiment results demonstrate that the new model outperforms the state-of-the-art methods as well as our previous model. Furthermore, the adjacency matrix has much more impact on the performance of the GCNs, and more suitable adjacency matrix can still be explored.
The current understanding on brain mechanisms is still limited, more influencing factors will be taken into  . 6 The performance of the proposed model with different kinds of adjacency matrix. RD represents the random, ED represents the Euclidean distance, ME represents Mut_Euclidean, MK denotes the Mut_KNN, MI denotes Mutual Information and MC denotes Mutual_ Cos account to further improve the forecasting accuracy. Moreover, motor imagery EEG data present individual differences, such as FBCSP has different performances when experimenting with EEG data that from different subjects, and it can achieve good results when using the same subject's data for training and testing, but it does not perform well in mixed data of multiple subjects. Individual differences also affected the development of solutions for the classification task of motor imagery. How to eliminate individual differences and extract valuable features is still key for wider application of EEGbased tasks. Some current transfer learning methods may be deployed to eliminate individual differences and further expand the scope of EEG applications.