Multimodal Tools for Speech and Language Processing


NLP tools

·      Word embeddings: GloVe (, Word2Vec (, BERT (, ELMo (, RoBERTa

·      Stanford NLP software (

·      Unigrams, bigrams, trigrams

·      Linguistic Inquiry and Word Count (LIWC): (

·      POS tags: NLTK toolkit (

·      Morphological analysis:

o   Polyglot:

o   Morfessor :

o   LegaliPy:

·      Flesch reading ease and other readability formulas (Kincaid et al 1975)

·      Speciteller: Specificity score (Li and Nenkova 2015)

·      Concreteness score (Brysbaert et al 2014)

·      Dictionary of Affect and revised 2009 version (Whissell 1989, Whissell 2009)

·      Hedge words and phrases (Ulinski et al 2018)

·      textstat: tools to extract readability measures from text (readability, complexity, and grade level)

·      Tools to restore punctuation in unpunctuated text/ASR results:

o   Punctuator

o   Bert-restore-punctuation

o   fastPunct

o   Ottokart/punctuator2

·      Information on NLP for Chinese data

·      Polyglot (Multilingual text processing toolkit extracted from 136 language in Wikipedi

·      Other useful text features:

o   Number of filled pauses

o   Response latency

o   False starts and other speech disfluencies

o   Repetitions

o   Lexical diversity: determined by type/token ratio

o   Creativity: similarity of this response to other responses

·      Sentiment lexicons

o   The General Inquirer (Stone et al. 1966)

§  Positive (1915), Negative (2291), Strong vs Weak, Pleasure, Pain, etc.

o   MPQA Subjectivity Cues Lexicon

§  2718 positive, 4912 negative

o   Bing Liu Opinion Lexicon

§  2006 positive, 4783 negative

o   Product reviews on Amazon

§  Multidomain sentiment analysis dataset

§  Amazon product data, 143 million reviews

o   Movie reviews on IMDB

§  Cornell movie review data, labeled with sentiment polarity, scale, and subjectivity

§  Large Movie Review Dataset v1.0, 25k movie reviews

§  IMDB Movie Reviews Dataset, 50k movie reviews

§  Bag of Words Meets Bags of Popcorn, 50k movie reviews

o   Reviews from Rotten Tomatoes

§  Stanford Sentiment Treebank, 11k reviews

o   Tweets with emoticon

§  Sentiment140, 160k tweets

o   Twitter data on US airlines

§  Twitter US Airline Sentiment, with negative reasons (e.g. “rude service”)

o   Paper reviews

§  Paper Reviews

o   SentiWordNet

§  WordNet synsets automatically labeled with positivity, negativity, and objectiveness

o   NRC Word-emotion Association Lexicon (Mohammad and Turney 2011)

§  Labeled by Turkers for joy, sadness, anger, fear, trust, disgust, anticipation surprise

o   Lexicon of Valence, Arousal and Dominancy (Warriner et al 2013)

§  AMT ratings of 14k words

o   Sentiment in Twitter (Go et al 2009) (Kouloumpis et al 2011)

o   Emoji in Twitter (Felbo et al 2017)

o   Attention Modeling for Targeted Sentiment (Liu and Zhang 2017)

o   BERT in Sentiment Analysis (Google AI Language)


Speech approaches

·      Aenaes: text/speech alignment (

·      MFCC features

·      Acoustic-prosodic features

o   OpenSMILE (

o   Parselmouth (

o   Praat (

o   Prosodic labeling and detection



o   Prosodic analysis:  AuToBI – A Tool for Automatic ToBI annotation (

§  PyToBI: ToBI labeling with Python

o   Video series in speech acoustics:

·      ASR

o   Kaldi (

o   Google Cloud Speech-to-Text (

o   And more:

o   Basic information:




·      TTS

o   Simon King Merlin video tutorial:


o   WaveNet from Google DeepMind

o   Tacotron 1 and 2

·      Noise reduction: (

o   Calculating spectral centroids

o   MFCC features: (tutorial)

o   Median filtering

o   To remove background noise or music: Spleeter, Descript, Audacity

o   Denoising script (multiple methods included)

·      Old and new speech software:  

o   SoX conversion software:


·      Spectrogam reading practice:  




Visual features

·      Fisher Vector encoding (FV) (

·      Vector of Linearly Aggregated Descriptors (VLAD) (

·      Facial expression detection (FED) (


Statistical measures and z-score normalization

·      Pearson’s correlation

·      Krippendorff’s alpha

·      More here on ANOVA, Kruskal-Wallis test, regression, t-tests, Wilcoxon signed-rank test, F1 scores and z-score normalization:


Machine Learning

·      Weka

·      Scikit-learn (https:/

·      Deep learning models

o   ChatGPT – GPT3.5 GPT4 (zero-shot, fine-tuned) (OpenAI 2023)

o   Llama 2 (zero-shot, fine-turned) (Touvron et al 2023)

o   PaLM 2 (Anil et al 2023)

o   Alpaca-LoRA (Hu et al 2021)

o   Transfer learning w/teacher-student network – many papers

o   Many BERT uses

o   Multi-Modality Multi-Loss Fusion Network: end2end model that optimizes for feature extraction and ML processes; used for multiple modality corpora (Multimodal Learning with Transformers:  A Survey, peng, zhu, clifton 2023)

o   ImageBind

o   Tensor Fusion Network (TFN):  for emotion, with Pytorch (tutorials for Pytorch)

o   MULT: a transformer encoder with cross-modal attention

o   Reading-comprehension datasets:

§  MultiSpanQA (Li et al 2022)

§  SQuAD (Rajpurkar et al 2016)

§  Quoref (Dasigi et al 2019)

·      Some other potentially useful papers:










· - gid=0