Machine learning for cognitive behavioral analysis: datasets, methods, paradigms, and research directions

Bhatt, Priya; Sethi, Amanrose; Tasgaonkar, Vaibhav; Shroff, Jugal; Pendharkar, Isha; Desai, Aditya; Sinha, Pratyush; Deshpande, Aditya; Joshi, Gargi; Rahate, Anil; Jain, Priyanka; Walambe, Rahee; Kotecha, Ketan; Jain, N. K.

doi:10.1186/s40708-023-00196-6

Brain Informatics

Table 2 Comparative Analysis of Cognitive Behavior Analysis Techniques Across Data Modalities

From: Machine learning for cognitive behavioral analysis: datasets, methods, paradigms, and research directions

Cognitive Behaviour Analysis Task	Data Modalities	Dataset Type	Feature Extraction	Description
Lie/Deception Detection	Audio	Unimodal	Mel Frequency Cepstral Coefficient (MFCC) [17] Spectral Kurtosis, MFCC, Spectral Spread, blood pressure, Spectral Centroid, respiration rate, and Tonal Power Ratio [66]	The Linear Kernel Support Vector Machines (SVM) classifier was used on the processed speech signals. The accuracy of Lie and Truth deception detection of speech audio, respectively, was 88.23% and 84.52% [17]. The MMO-DBN [66] method combines the Monarch Butterfly Optimization [95] and Moth Search [91] algorithms with a Deep Belief Neural Network, resulting in an accuracy of 98.4%
Lie/Deception Detection	Images	Unimodal	Facial Features extracted using OpenFace[39]	A fraudulent detection framework to identify persons acting dishonestly in video clips by extracting the proportions of their facial micro-expressions [38]. An expression database with five expressions (Happiness, Joy, Surprise, Anger, Disgust/Contempt, and Sadness) with a classification accuracy rate of 85% Long Short-Term Memory Network (LSTM) was trained using facial videos from Real-life Trial (RLT) Dataset, Silesian Deception Dataset, and Bag-of-lies dataset to classify facial features with an accuracy of 89.49% [39]
Lie/Deception Detection	Audio and Video	Multimodal	Verbal features: unigrams and bigrams derived from bag-of-words representation [18] Non-verbal features: Eyes, eyebrows, and mouth movements (facial expressions) and hand movements and trajectories (hand gestures)	The decision Trees algorithm was trained on these features to classify truth and deception with an accuracy of up to 75%
Lie/Deception Detection	Audio, video, and text	Multimodal	Improved Dense Trajectory (videos), MFCC (Mel-frequency Cepstral Coefficients) features from audio and GloVe vector representations for transcripts (text)	Linear SVM algorithm was applied to classify truth and deception with an accuracy of 87.73%
Lie/Deception Detection	Audio, video, and EEG	Multimodal	Attention-enhanced frequency distributed spectrograms (audio), two-stream CNN (video frames), Bi-LSTM (EEG)	The study investigates the Bag of Lies dataset using audio, video, and EEG data, applying late fusion of a two-stream CNN, attention-enhanced frequency distributed spectrograms with CNN, and a Bi-LSTM neural network for EEG data to detect lies, achieving an 83.5% accuracy with multimodal fusion
Lie/Deception Detection	Audio, video, and EEG	Multimodal	Audio frames, Concatenated LBP face images from 20 frames per video, Concatenated EEG channels	In [40], LieNet, a unique deep convolutional neural network, is developed to detect multiscale variations of dishonesty using preprocessed audio, video, and EEG signals individually input into LieNet[40] for feature extraction. The framework is trained with data augmentation methods resulting in high accuracy rates on the BOL, RL trail, and MU3D databases. Other Deception detection techniques are also reported in literature [41, 42]
Lie/Deception Detection	Audio, video, and micro-expression features	Multimodal	3D-CNN [43] (videos), CNN and Word2Vec (text), open smile [44] toolkit (audio), 39 manually annotated microexpressions	[43] proposes a neural network model for deceit detection using audio, video, text, and micro-expression features; features are extracted using 3D-CNN, CNN, openSMILE toolkit, and binary annotations; the features are fused and fed to a multilayer perceptron for classification, achieving a maximum accuracy of 96.14%
Lie/Deception Detection	Audio, Video, EEG, Gaze	Multimodal	LBP features from 20 frames per video, Zero crossing rate (audio), Spectral centroid (audio), Spectral bandwidth (audio), Spectral roll-off (audio), Chroma frequencies (audio), MFCC (audio), PyGaze (gaze), 100 points from a CSV file for each channel (EEG)	The research presented by [19] collected data from four different modalities and used different ML models to analyse and classify them, including using LBP and algorithms like SVM, random forest, and MLP for video data, frequency-based properties and Random Forest/KNN for audio data, CNN-based classifier and Random Forest/MLP for EEG data, and fixations, eye blinks and pupil size as features for gaze data
Stress/Emotion Detection	EEG	Unimodal	Differential Entropy (DE), Power Spectral Density (PSD), Differential Asymmetry (DASM), Differential Caudality, and Rational Asymmetry (RASM)	In [21], DBNs were used to classify positive, neutral, and negative emotions from EEG data filtered by a bandpass filter between 0.3 and 50 Hz, using features such as Differential Entropy (DE), Power Spectral Density (PSD), Differential Asymmetry (DASM), Differential Caudality, and Rational Asymmetry (RASM), achieving an average accuracy of 86.08%, with SVM, LR, and KNN also used as classifiers
Stress/Emotion Detection	EEG	Unimodal	empirical mode decomposition (EMD), discrete wavelet transformations (DWT) and a combination of both DWT-EMD	In [52], EEG characteristics are extracted using EMD, DWT, and DWT-EMD, and classification techniques such as KNN, SVM, and ANN were used to classify intrinsic properties of real, neutral, and performed smiles with an average accuracy of 94.3% and 84.1% using DWT-EMD and ANN in alpha and beta bands, respectively
Stress/Emotion Detection	ECG	Unimodal	Peak detection followed by HRV feature extraction	In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 71.6% for the wrist using LOSO and mixed subject fivefold cross-validation methods
Stress/Emotion Detection	PPG (Wrist)	Unimodal	Peak detection followed by HRV feature extraction	In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 66.7% wrist PPG using LOSO and mixed subject fivefold cross-validation methods
Stress/Emotion Detection	PPG (Fingertip)	Unimodal	Peak detection followed by HRV feature extraction	In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 59.9% for fingertip PPG using LOSO and mixed subject fivefold cross-validation methods
Stress/Emotion Detection	Text	Unimodal	GloVe embeddings	The cognitive approach to psychotherapy aims to modify negative thoughts; NLP was employed to create schemas from cognitive processes demonstrated by healthy individuals. These were then categorised into nine groups and mapped using GLoVE embeddings with KNN, SVM, and RNN classifiers
Stress/Emotion Detection	ECG, GSR	Multimodal	ECG: HRV (Statistical and Frequency), GSR: statistical	In the SWELL_KW dataset [22], stress detection was performed using ECG and GSR modalities with preprocessing and feature extraction methods. KNN and SVM algorithms were used for classification achieving 66.52% and 72.82% accuracy, respectively
Abnormal Behaviour Detection	Text	Unimodal	Bag of Words, SkipGram, GloVe	The corpus used in this study was taken from the Koko platform, which contains 500,000 posts on mental health issues. It was annotated into three classes: thinking errors (such as black-and-white thinking and catastrophising), emotions (including anger and anxiety), and situations (such as bereavement and work). The posts can have multiple labels, and different deep-learning techniques were used with word embeddings to classify them. The CNN-GloVe model achieved the highest F1 score of 57.8%
Abnormal Behaviour Detection	Images	Unimodal	Social Force Flow [31] For every pixel in every frame, the interaction force is then transferred into the image plane	The Social Force concept is used to locate abnormal behaviours in crowd footage by covering a picture in a grid of particles, projecting it using the space–time average of optical flow, and measuring the interaction forces between particles treated as persons. The method achieved 94% accuracy using the bag of words method to categorise frames as normal and abnormal
Abnormal Behaviour Detection	ECG	Unimodal	quadratic time–frequency distribution (QTFD) technique	This paper uses the quadratic time–frequency distribution (QTFD) technique to analyse EEG signals and track changes in spectral characteristics over time, extracting time–frequency characteristics for subject-dependent SVM classification of emotions using a 2D arousal-valence plane [50]
Abnormal Behaviour Detection	ECG	Unimodal	Power Spectral Density and the Burg Autoregressive model [51]	A technique proposed for emotion recognition combines dynamic functional network patterns with regional brain activations calculated using Power Spectral Density and the Burg Autoregressive model. The method achieved up to 90.3% accuracy in differentiating between true/genuine versus neutral, true/genuine versus fake, and neutral versus fake emotions [51]
Abnormal Behaviour Detection	ECG	Unimodal	DWT, EMD, and DWT-EMD	In [52], SVM, KNN, and ANN classifiers were used on EEG data to identify genuine smiles, fake/acted smiles, and neutral expressions. EEG features were extracted using three time–frequency analysis techniques at three frequency bands: DWT, EMD, and DWT-EMD. When distinguishing genuine emotional expression from a fake emotional expression using ANN, SVM, and KNN, the DWT-EMD technique yielded the highest classification accuracy in the alpha band at 94.3%, 92.4%, and 83.8%, respectively
Abnormal Behaviour Detection	ECG, EDA, EMG, BVP, Accelerometer, Respiration, and Temperature	Multimodal	Forward Selection method	In [54], Forward Selection was used for feature selection, and SMOTE was used to balance the imbalanced WESAD dataset, with non-linear algorithms like GBDT, RF, ET, and DT being used to evaluate information gain through Gini Impurity or Friedman MSE
Abnormal Behaviour Detection	ECG, EDA, EMG, BVP, Accelerometer, Respiration, and Temperature	Multimodal	PCA, Quantile Transformer, and Standard Scalar preprocessing	This study analyses bio-signals to detect stress using deep learning and machine learning on the WESAD dataset, applying PCA, Quantile Transformer, and Standard Scalar preprocessing, and using six machine learning methods for binary classification while employing Leave-one-subject-out cross-validation to avoid personalisation [55]
Abnormal Behaviour Detection	Accelerometer, EDA, Temperature	Multimodal	Features like Mean and Standard Deviation, Dynamic Range, and min and max values were extracted	A new stress tracking system is proposed based on a GRU RNN, which is useful in situations where not all modalities are reliable stress predictors. The system performs binary classification, considering only ACC, EDA, and TEMP signals with statistical parameters for feature engineering. GRU solves the vanishing gradient problem of RNN, and the selected indicators are used to distinguish between stress and non-stress-related circumstances [57]

Back to article page