The overall framework of the model proposed in this paper is presented in Fig. 2, it includes three main parts: feature extraction and adjacency matrix generation part, spatial–temporal attention part and spatial–temporal graph convolution part. Spatial–temporal attention part puts more attention on the more valuable spatial–temporal information, then spatial–temporal graph convolution part extracts both spatial and temporal features. And the complete algorithm can be seen as follows:

### Adjacency matrix generation

#### Relevance calculate methods

The relevance of different electrodes can be obtained through calculating the correlations or the information gain of the features of the electrodes, and in this paper we calculate the relevance of different electrodes over all the electrodes. The correlations of different channels can be represented by the distances of the channels. The euclidean distance of the electrodes can be represented as:

$$\begin{aligned} \rho = ((x_2 - x_1)^2 +(y_2 -y_1)^2 +(z_2 -z_1)^2)^{1/2}. \end{aligned}$$

(1)

The euclidean distance can be understood as the straight-line distance between two points, but the electrodes are distributed on the surface of the cerebral cortex, so it is not suitable to directly express the relationships between the electrodes. The Chebyshev distance is defined as the maximum difference between two vectors in any coordinate dimension, it is the maximum distance along an axis, and the Chebyshev distance of the electrodes can be denoted as:

$$\begin{aligned} \rho = \mathrm{{max}}(|x_2 -x_1|,|y_2 - y_1|, |z_2 - z_1|). \end{aligned}$$

(2)

The calculation of the distances of the electrodes only utilizes the positions of the electrodes, we can also use the features of the electrodes to obtain the correlations. The cosine similarity of two vector can be defined as:

$$\begin{aligned} cos(x,y) = \frac{x.y}{\Vert x\Vert \Vert y\Vert }. \end{aligned}$$

(3)

However, the cosine similarity does not consider the magnitude of the vectors, but only consider the directions. The Jacquard index, also known as the intersection ratio and Jacquard similarity coefficient, can be used to compare the similarity and diversity of sample sets:

$$\begin{aligned} J(x_1,x_2)=\frac{|x_1\bigcap x_2|}{|x_1|+|x_2|-|x_1\bigcap x_2|}. \end{aligned}$$

(4)

One of the main disadvantage of the Jacquard index is that it is greatly affected by the size of the data. Large datasets have a great impact on the index, because it can significantly increase the union while maintaining similar intersection. Moreover, we could use information gain between the feature vectors to obtain the degree of relevance, information gain is calculated by comparing the entropy of the dataset before and after a transformation. The mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variables selection.

#### Adjacency matrix update

In order to make full use of and adjust the input prior knowledge in time according to the embedding learned by GCNs, we use the mutual information to generate the initial adjacency matrix and use the cosine similarity to update the adjacency matrix during the training process.

Mutual information (MI) [29] is used to indicate whether there is a relationship between two variables and the strength of the relationship. The mutual information of two variables *X* and *Y* can be defined as:

$$\begin{aligned} I(X,Y) = \sum _{x\in X}\sum _{y\in Y}p(x,y){\rm{log}}\frac{p(x,y)}{p(x)p(y)}. \end{aligned}$$

(5)

Mutual information is related to entropy, which is the expected or mean value of the information of all variables. The entropy of *X* is defined as:

$$\begin{aligned} \begin{aligned} \quad H(X)&= \sum _{x\in X}P(x){\rm{log}}\frac{1}{P(x)} \\&= - \sum _{x\in X}P(x){\rm{log}}P(x) = -E{\rm{log}}P(X). \\ \end{aligned} \end{aligned}$$

(6)

Then MI of *X* and *Y* can be computed by the equations:

$$\begin{aligned} \begin{aligned} \quad I(X,Y)&= H(X) + H(Y) -H(X,Y) \\&= H(X)-H(X|Y) = H(Y)-H(Y|X) \\ H(X,Y)&= \sum _{x \in X}\sum _{y \in Y}p(x,y){\rm{log}}\frac{1}{p(x,y)} = -E{\rm{log}}P(X,Y) \\ \quad H(Y|X)&= \sum _{x \in X}\sum _{y \in Y}p(x)p(y|x){\rm{log}}\frac{1}{p(y|x)} \\&= -E{\rm{log}}P(Y|X), \end{aligned} \end{aligned}$$

(7)

where *H*(*X*, *Y*) is the joint entropy of *X* and *Y*, and *H*(*Y*|*X*) is the conditional entropy that *X* is given in advanced. Thus, *I*(*X*, *Y*) is the reduction in the uncertainty of the variable *X* by the knowledge of another variable *Y*, equivalently, it represents the amount of information that *Y* contains about *X*.

Considering the features of EEG data \(X = \{x^1, x^2,...,\) \(x^N\}\in {\mathbb {R}}^{N\times F}\), we could compute the mutual information \(m_ij\) of \(x^i\), \(x^j\) and use it as the weight of the connection of \(x^i,x^j\), then we could generate a \(N\times N\) weight matrix which could be used as the input adjacency matrix of the graph convolution networks. In our proposed work [30], we kept the initial adjacency matrix unchanged during the training process. However, on embedding changes after each iteration, we update the adjacency matrix after each iteration synchronously to improve the performance of the model. Here, we compute the cosine similarity of two columns of the embedding as the weight of the adjacency matrix. The cosine distance of two vector *x*, *y* is defined as:

$$\begin{aligned} cos(x,y) = \frac{x.y}{\Vert x\Vert \Vert y\Vert }. \end{aligned}$$

(8)

The updated weight can be defined as:

$$\begin{aligned} a_{i,j}^{l+1} = \frac{e_{i}^{l}.e_{j}^{l}}{\Vert e_{i}^{l}\Vert \Vert e_{j}^{l}\Vert }, \end{aligned}$$

(9)

where the \(a_{i,j}^{l+1}\) denotes the element of the *i*th row and *j*th column of the adjacency matrix at the \(l+1th\) iteration, and \(e_{i}^{l}, e_{j}^{l}\) represents the *i*th, *j*th column of the embedding at *l*th iteration. The process of generating and updating the adjacency matrix can be seen in Fig. 3.

### Spatial–temporal attention

The spatial–temporal attention mechanism could capture the dynamic spatial and temporal correlations of the motor imagery network. In the spatial dimension, the activities of one brain region has influence on other brain regions and generally different brain activities convey different information, so the dynamic spatial–temporal capture mechanism is required. We use a spatial attention mechanism [31], which could be represented as:

$$\begin{aligned}S &= V_p*\sigma ((\chi ^{(r-1)}W_1)W_2(W_3\chi ^{(r-1)})^T +b_{p}), \\ \quad S^{'}_{i,j} &= \frac{\mathrm{{exp}}(S_{i,j})}{\sum \limits ^N_{j=1}\mathrm{{exp}}(S_{i,j})}, \end{aligned}$$

(10)

where *S* denotes the spatial attention matrix, which is computed by current layer. \(V_p, b_p \in {\mathbb {R}}^{N\times N}\), \(\chi ^(r-1) = (X_1,X_2, \cdots , X_{T_{r-1}} \in {\mathbb {R}}^{N\times C_{r-1}\times T_{r-1}}\) \(C_{r-1}\) is the number of channels of the input data in the \(r^th\) layer. \(W_1 \in {\mathbb {R}}^{T_{r-1}}, W_2 \in {\mathbb {R}}^{C_{r-1}\times T_{r-1}}, W_3 \in {\mathbb {R}}^{C_{r-1}}\), \(S_{i,j}\) in *S* represents the correlation strength between node *i* and *j*, then a softmax function is used to normalize the attention weights. Combining the adjacency matrix and the spatial attention matrix, the model could adjust the impacting weights between nodes dynamically.

In the temporal dimension, there are correlations during each motor imagery trial, such that the brain waves are transmitted in the cerebral cortex and the active areas of the brain will change over time, so the collected EEG data also changes over time. Therefore, a temporal attention is utilized to capture dynamic temporal information. The temporal attention mechanism is defined as:

$$\begin{aligned}E &= V_e*\sigma (((\chi ^{(l-1)})^{T}M_1)M_2(M_3\chi ^{(l-1)})+b_q), \\\quad E^{'}_{m,n} &= \frac{\mathrm{{exp}}(E_{i,j})}{\sum \limits ^{T_{r-1}}_{j=1}\mathrm{{exp}}(E_{i,j})}, \\ \end{aligned}$$

(11)

where \(V_e, b_q \in {\mathbb {R}}^{T_{l-1}\times T_{l-1}}\), \(M_1 \in {\mathbb {R}}^N, M_2 \in {\mathbb {R}}^{C_{l-1}\times N}\), \(M_3 \in {\mathbb {R}}^{C_{l-1}}\), \(E_{m,n}\) denotes the strength of the correlation between motor imagery network *m*, *n*, and *E* is normalized by the softmax function, so the temporal attention matrix can be directly applied to the input.

### Spatial–temporal graph convolution

The spatial–temporal convolution consists of a graph convolution in the spatial dimension and a normal convolution in the temporal dimension, which could extract both the spatial features and the temporal features.

The spatial features are extracted by aggregating information from neighbor nodes; we use graph convolution to extract the spatial features. The graph convolution is based on Laplacian matrix and Fourier transform, the graph Laplacian can be defined as:

$$\begin{aligned} L = I - D^{-1/2}AD^{-1/2}, \end{aligned}$$

(12)

where \(A\in {\mathbb {R}}^{N\times N}\) is the adjacency matrix associated with the graph, \(D \in {\mathbb {R}}^{N\times N}\) is the diagonal degree matrix, \(I\in {\mathbb {R}}^{N\times N}\) is the identity matrix. *L* is a real symmetric positive semidefinite matrix, it can be decomposed as \(L = U\Lambda U^T\) and \(\Lambda \in {\mathbb {R}}^{N\times N}\) is the diagonal matrix of eigenvalues that represent the frequencies of their associated eigenvectors. Let \(x \in {\mathbb {R}}^n\) be a signal defined on the vertices of a graph *G*, the graph Fourier transform of the signal is defined as x̂ = \(U^Tx\). The graph convolution uses the linear operators that diagonalize in the Fourier domain to replace the classical convolution operator, the graph convolution can be defined as:

$$\begin{aligned} g_\theta (L)x = g_\theta (U\Lambda U^T)x = Ug_\theta (\Lambda )U^Tx, \end{aligned}$$

(13)

where \(\theta\) is a vector of Fourier coefficients, \(g_\theta\) is the filter that could reduce the computational complexity, \(g_\theta\) can be approximated by a truncated expansion in the terms of Chebyshev polynomials [32]:

$$\begin{aligned} g_{\theta }({\Lambda }) = \sum ^{k-1}_{p=0}{\theta }_{p}T_{p}({\tilde{\Lambda }}), \end{aligned}$$

(14)

where *k* is the order of the Chebyshev polynomials, \(\theta _p \in {\mathbb {R}}^k\) is the vector of Chebyshev coefficients, \(T_{p}({\tilde{\Lambda }})\in {\mathbb {R}}^{N\times N}\) is the Chebyshev polynomial of order *k* and \({\tilde{\Lambda }} = 2{\Lambda }/{\lambda }_{max} -I\) ranges in \([-1,1]\). Then the *jth* output feature can be calculated as:

$$\begin{aligned} y_i = \sum ^{F_{in}}_{i=1}g\theta _{i,j}(L)x_i, \end{aligned}$$

(15)

where \(x_i\) denotes the *i*th row of input matrix, \(F_{in}\) equals to the input dimension, the outputs are collected into a feature matrix \(Y = [y_1,y_2,\ldots ,y_{F_{out}}] \in {\mathbb {R}}^{N \times F_{out}}\). In this work, we generalize the above definition to the nodes with multiple channels, the *l*th layer’s input is \(X^{(l-1)} = (x_1,x_2,\ldots ,x_{(T_{l-1})}) \in {\mathbb {R}}^{N\times C_{l-1}\times T_{l-1}}\), \(C_{(l-1)}\) denotes the channel’s number and \(T_{l-1}\) denotes the *l*th layer’s temporal dimension.

After the graph convolution having captured the neighboring information for each node in the spatial dimension, a standard convolution layer is used in the temporal dimension, we use a standard two-dimension convolution layer to extract the temporal information, the *r*th convolution layer could be defined as:

$$\begin{aligned} \chi ^{(r)}_h = {\rm ReLU}(\Phi *({\rm ReLU}(g_\theta * G {\hat{\chi }}^{(r-1)}_{h}))), \end{aligned}$$

(16)

where \(\Phi\) is the parameter of the temporal dimension convolution kernel, and \(*\) represents the convolution operation, ReLU is the activation function.