Fig. 1
From: Identifying tweets of personal health experience through word embedding and LSTM neural network

The pipeline to generate the vocabulary and vector space model. A corpus of 22 million unlabeled tweets was collected and pre-processed to remove certain punctuations, duplicates, non-English tweets, and tweets with URLs. A collection of unique terms was compiled to generate a vocabulary, and a vector space model was created the preprocessed tweets