Skip to main content

Table 2 Comparative Analysis of Cognitive Behavior Analysis Techniques Across Data Modalities

From: Machine learning for cognitive behavioral analysis: datasets, methods, paradigms, and research directions

Cognitive Behaviour Analysis Task

Data Modalities

Dataset Type

Feature Extraction

Description

Lie/Deception Detection

Audio

Unimodal

Mel Frequency Cepstral Coefficient (MFCC) [17]

Spectral Kurtosis, MFCC, Spectral Spread, blood pressure, Spectral Centroid, respiration rate, and Tonal Power Ratio [66]

The Linear Kernel Support Vector Machines (SVM) classifier was used on the processed speech signals. The accuracy of Lie and Truth deception detection of speech audio, respectively, was 88.23% and 84.52% [17]. The MMO-DBN [66] method combines the Monarch Butterfly Optimization [95] and Moth Search [91] algorithms with a Deep Belief Neural Network, resulting in an accuracy of 98.4%

Lie/Deception Detection

Images

Unimodal

Facial Features extracted using OpenFace[39]

A fraudulent detection framework to identify persons acting dishonestly in video clips by extracting the proportions of their facial micro-expressions [38]. An expression database with five expressions (Happiness, Joy, Surprise, Anger, Disgust/Contempt, and Sadness) with a classification accuracy rate of 85%

Long Short-Term Memory Network (LSTM) was trained using facial videos from Real-life Trial (RLT) Dataset, Silesian Deception Dataset, and Bag-of-lies dataset to classify facial features with an accuracy of 89.49% [39]

Lie/Deception Detection

Audio and Video

Multimodal

Verbal features: unigrams and bigrams derived from bag-of-words representation [18]

Non-verbal features: Eyes, eyebrows, and mouth movements (facial expressions) and hand movements and trajectories (hand gestures)

The decision Trees algorithm was trained on these features to classify truth and deception with an accuracy of up to 75%

Lie/Deception Detection

Audio, video, and text

Multimodal

Improved Dense Trajectory (videos), MFCC (Mel-frequency Cepstral Coefficients) features from audio and GloVe vector representations for transcripts (text)

Linear SVM algorithm was applied to classify truth and deception with an accuracy of 87.73%

Lie/Deception Detection

Audio, video, and EEG

Multimodal

Attention-enhanced frequency distributed spectrograms (audio), two-stream CNN (video frames), Bi-LSTM (EEG)

The study investigates the Bag of Lies dataset using audio, video, and EEG data, applying late fusion of a two-stream CNN, attention-enhanced frequency distributed spectrograms with CNN, and a Bi-LSTM neural network for EEG data to detect lies, achieving an 83.5% accuracy with multimodal fusion

Lie/Deception Detection

Audio, video, and EEG

Multimodal

Audio frames, Concatenated LBP face images from 20 frames per video, Concatenated EEG channels

In [40], LieNet, a unique deep convolutional neural network, is developed to detect multiscale variations of dishonesty using preprocessed audio, video, and EEG signals individually input into LieNet[40] for feature extraction. The framework is trained with data augmentation methods resulting in high accuracy rates on the BOL, RL trail, and MU3D databases. Other Deception detection techniques are also reported in literature [41, 42]

Lie/Deception Detection

Audio, video, and micro-expression features

Multimodal

3D-CNN [43] (videos),

CNN and Word2Vec (text), open smile [44] toolkit (audio), 39 manually annotated microexpressions

[43] proposes a neural network model for deceit detection using audio, video, text, and micro-expression features; features are extracted using 3D-CNN, CNN, openSMILE toolkit, and binary annotations; the features are fused and fed to a multilayer perceptron for classification, achieving a maximum accuracy of 96.14%

Lie/Deception Detection

Audio, Video, EEG, Gaze

Multimodal

LBP features from 20 frames per video,

Zero crossing rate (audio),

Spectral centroid (audio),

Spectral bandwidth (audio),

Spectral roll-off (audio),

Chroma frequencies (audio),

MFCC (audio), PyGaze (gaze), 100 points from a CSV file for each channel (EEG)

The research presented by [19] collected data from four different modalities and used different ML models to analyse and classify them, including using LBP and algorithms like SVM, random forest, and MLP for video data, frequency-based properties and Random Forest/KNN for audio data, CNN-based classifier and Random Forest/MLP for EEG data, and fixations, eye blinks and pupil size as features for gaze data

Stress/Emotion Detection

EEG

Unimodal

Differential Entropy (DE), Power Spectral Density (PSD), Differential Asymmetry (DASM), Differential Caudality, and Rational Asymmetry (RASM)

In [21], DBNs were used to classify positive, neutral, and negative emotions from EEG data filtered by a bandpass filter between 0.3 and 50 Hz, using features such as Differential Entropy (DE), Power Spectral Density (PSD), Differential Asymmetry (DASM), Differential Caudality, and Rational Asymmetry (RASM), achieving an average accuracy of 86.08%, with SVM, LR, and KNN also used as classifiers

Stress/Emotion Detection

EEG

Unimodal

empirical mode decomposition (EMD), discrete wavelet transformations (DWT) and a combination of both DWT-EMD

In [52], EEG characteristics are extracted using EMD, DWT, and DWT-EMD, and classification techniques such as KNN, SVM, and ANN were used to classify intrinsic properties of real, neutral, and performed smiles with an average accuracy of 94.3% and 84.1% using DWT-EMD and ANN in alpha and beta bands, respectively

Stress/Emotion Detection

ECG

Unimodal

Peak detection followed by HRV feature extraction

In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 71.6% for the wrist using LOSO and mixed subject fivefold cross-validation methods

Stress/Emotion Detection

PPG (Wrist)

Unimodal

Peak detection followed by HRV feature extraction

In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 66.7% wrist PPG using LOSO and mixed subject fivefold cross-validation methods

Stress/Emotion Detection

PPG (Fingertip)

Unimodal

Peak detection followed by HRV feature extraction

In MAUS Dataset [27], HRV statistical and frequency domain features are extracted. SVM is applied for binary classification achieving an accuracy of 59.9% for fingertip PPG using LOSO and mixed subject fivefold cross-validation methods

Stress/Emotion Detection

Text

Unimodal

GloVe embeddings

The cognitive approach to psychotherapy aims to modify negative thoughts; NLP was employed to create schemas from cognitive processes demonstrated by healthy individuals. These were then categorised into nine groups and mapped using GLoVE embeddings with KNN, SVM, and RNN classifiers

Stress/Emotion Detection

ECG, GSR

Multimodal

ECG: HRV (Statistical and Frequency), GSR: statistical

In the SWELL_KW dataset [22], stress detection was performed using ECG and GSR modalities with preprocessing and feature extraction methods. KNN and SVM algorithms were used for classification achieving 66.52% and 72.82% accuracy, respectively

Abnormal Behaviour Detection

Text

Unimodal

Bag of Words, SkipGram, GloVe

The corpus used in this study was taken from the Koko platform, which contains 500,000 posts on mental health issues. It was annotated into three classes: thinking errors (such as black-and-white thinking and catastrophising), emotions (including anger and anxiety), and situations (such as bereavement and work). The posts can have multiple labels, and different deep-learning techniques were used with word embeddings to classify them. The CNN-GloVe model achieved the highest F1 score of 57.8%

Abnormal Behaviour Detection

Images

Unimodal

Social Force Flow [31] For every pixel in every frame, the interaction force is then transferred into the image plane

The Social Force concept is used to locate abnormal behaviours in crowd footage by covering a picture in a grid of particles, projecting it using the space–time average of optical flow, and measuring the interaction forces between particles treated as persons. The method achieved 94% accuracy using the bag of words method to categorise frames as normal and abnormal

Abnormal Behaviour Detection

ECG

Unimodal

quadratic time–frequency distribution (QTFD) technique

This paper uses the quadratic time–frequency distribution (QTFD) technique to analyse EEG signals and track changes in spectral characteristics over time, extracting time–frequency characteristics for subject-dependent SVM classification of emotions using a 2D arousal-valence plane [50]

Abnormal Behaviour Detection

ECG

Unimodal

Power Spectral Density and the Burg Autoregressive model [51]

A technique proposed for emotion recognition combines dynamic functional network patterns with regional brain activations calculated using Power Spectral Density and the Burg Autoregressive model. The method achieved up to 90.3% accuracy in differentiating between true/genuine versus neutral, true/genuine versus fake, and neutral versus fake emotions [51]

Abnormal Behaviour Detection

ECG

Unimodal

DWT, EMD, and DWT-EMD

In [52], SVM, KNN, and ANN classifiers were used on EEG data to identify genuine smiles, fake/acted smiles, and neutral expressions. EEG features were extracted using three time–frequency analysis techniques at three frequency bands: DWT, EMD, and DWT-EMD. When distinguishing genuine emotional expression from a fake emotional expression using ANN, SVM, and KNN, the DWT-EMD technique yielded the highest classification accuracy in the alpha band at 94.3%, 92.4%, and 83.8%, respectively

Abnormal Behaviour Detection

ECG, EDA, EMG, BVP, Accelerometer, Respiration, and Temperature

Multimodal

Forward Selection method

In [54], Forward Selection was used for feature selection, and SMOTE was used to balance the imbalanced WESAD dataset, with non-linear algorithms like GBDT, RF, ET, and DT being used to evaluate information gain through Gini Impurity or Friedman MSE

Abnormal Behaviour Detection

ECG, EDA, EMG, BVP, Accelerometer, Respiration, and Temperature

Multimodal

PCA, Quantile Transformer, and Standard Scalar preprocessing

This study analyses bio-signals to detect stress using deep learning and machine learning on the WESAD dataset, applying PCA, Quantile Transformer, and Standard Scalar preprocessing, and using six machine learning methods for binary classification while employing Leave-one-subject-out cross-validation to avoid personalisation [55]

Abnormal Behaviour Detection

Accelerometer, EDA, Temperature

Multimodal

Features like Mean and Standard Deviation, Dynamic Range, and min and max values were extracted

A new stress tracking system is proposed based on a GRU RNN, which is useful in situations where not all modalities are reliable stress predictors. The system performs binary classification, considering only ACC, EDA, and TEMP signals with statistical parameters for feature engineering. GRU solves the vanishing gradient problem of RNN, and the selected indicators are used to distinguish between stress and non-stress-related circumstances [57]