Analysis of associations between emotions and activities of drug users and their addiction recovery tendencies from social media posts using structural equation modeling

Background Addiction to drugs and alcohol constitutes one of the significant factors underlying the decline in life expectancy in the US. Several context-specific reasons influence drug use and recovery. In particular emotional distress, physical pain, relationships, and self-development efforts are known to be some of the factors associated with addiction recovery. Unfortunately, many of these factors are not directly observable and quantifying, and assessing their impact can be difficult. Based on social media posts of users engaged in substance use and recovery on the forum Reddit, we employed two psycholinguistic tools, Linguistic Inquiry and Word Count and Empath and activities of substance users on various Reddit sub-forums to analyze behavior underlining addiction recovery and relapse. We then employed a statistical analysis technique called structural equation modeling to assess the effects of these latent factors on recovery and relapse. Results We found that both emotional distress and physical pain significantly influence addiction recovery behavior. Self-development activities and social relationships of the substance users were also found to enable recovery. Furthermore, within the context of self-development activities, those that were related to influencing the mental and physical well-being of substance users were found to be positively associated with addiction recovery. We also determined that lack of social activities and physical exercise can enable a relapse. Moreover, geography, especially life in rural areas, appears to have a greater correlation with addiction relapse. Conclusions The paper describes how observable variables can be extracted from social media and then be used to model important latent constructs that impact addiction recovery and relapse. We also report factors that impact self-induced addiction recovery and relapse. To the best of our knowledge, this paper represents the first use of structural equation modeling of social media data with the goal of analyzing factors influencing addiction recovery.


Introduction
Substance use constitutes a major contemporary health epidemic. There were 70,237 substance use overdose deaths in 2017, which was a 9.6% increase from 2016 [1]. In the US, abuse of alcohol and other illicit drugs is estimated to lead to a monetary impact of over $740 billion annually because of increased expenses related to loss of work productivity, health care, and crime [2]. Substance use can also increase the risk for liver [3], or lung diseases [4], and especially infectious diseases such as Hepatitis B, or C, and HIV/ AIDS [5].
Drug addiction was usually considered a moral or character flaw. This view has undergone a significant change and addiction is now considered a chronic illness characterized by health deterioration, poor social functioning, and loss of control over substance use [6]. Substance use has also been established to change the brain function and makes a user crave drugs. The substance use journey typically begins with experimentation and because of the perceived positive effects, a person gets addicted. After an individual decides to break the addiction cycle, they typically experience physical and emotional withdrawals that are manifested through sadness, restlessness, anxiety, nausea, vomiting, sweating, and cramping. Depending on factors such as the substances used as well as the amount and duration of use, such symptoms typically last for 3-5 days and can be managed by medications, vitamins, and exercise [2]. The notion of "recovery" is polysemous in that it may be considered as an ongoing process or as a granular event [7]. Regardless, recovery is a long-term process requiring continuous effort and diligence [2]. Substance withdrawal management regimes that can lead to recovery from addiction involve managing both physical and emotional symptoms experienced by individuals as they give up drugs. To manage these symptoms, individuals are typically recommended to focus on self-development [8,9] with the help of their families, and friends [2]. Many individuals however, relapse into drug use because they fail to follow substance use disorder treatment regimens [10].
Though managing emotional and physical symptoms during drug withdrawals is manifestly important, these constructs are multifarious, latent (i.e. not directly observable), and difficult or impossible to directly measure. In this paper, we have proposed the use of structural equation modeling (SEM)-a multivariate latent variable modeling technique to estimate critical latent constructs (italicized hereafter) such as emotional distress, physical pain, self-development, and relationships by analyzing social media activities of substance users. Social media has generated recent interest as a novel source of information in drug abuse epidemiology [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Being semi-anonymous, social media consists of unfiltered and self-reported conversations and activities of an individual. Of the different social media platforms, we used drug use and recovery data available on Reddit. This social media forum is the fifth most visited website in the USA and has over 330 million active users [26]. Reddit is a community-based social media forum where the communities (called subreddits) are created based on common interest. Members of the subreddit can post, vote, and comment in the subreddit. Each subreddit has moderators who ensure that the content posted by the members of the subreddit are topically focused. At the time of writing, there are more than 138,000 subreddits on Reddit [26], with a number of subreddits focusing on recreational drug use (RDU) and drug addiction recovery (DAR).

Problem formulation and overview of proposed approach
Our aim was to determine the effect of emotional distress, physical pain, self-development efforts, relationships (of the drug user), and geographic disparities on drug addiction recovery and relapse, using SEM as a rigorous modeling methodology. Solving this problem required addressing the following sub-problems: first, we needed to identify and determine the instances of emotional distress, physical pain, self-development efforts, relationships, and geographic disparities in the social media posts and activity of the drug users. Then, we had to come up with a model to infer the relationships between the unobserved constructs (emotional distress, physical pain, self-development efforts, and relationships) and the observable construct drug addiction recovery (determined by observing if a user posted in a drug addiction recovery forum). Our approach consisted of the following steps: (1) we used two psychometrically validated dictionaries, namely, Linguistic Inquiry and Word Count (LIWC) and Empath, to identify instances of emotional distress, physical pain, relationships, self-development efforts, and geographic disparities present in the posts of the drug user. (2) We also utilized the forum activity of the users on Reddit to identify the instances of self-development efforts and relationships. (3) We applied SEM to identify and quantify the relationship between emotional distress, physical pain, self-development, relationships, and geographic disparities on one hand and drug addiction recovery and relapse on the other.

Prior work
A number of recent works have utilized data from social media in conjunction with methods from machine learning and natural language processing to study and understand patterns associated with a diverse set of health-related issues, such as influenza [27], mental health [28], and suicidal ideation [29]. In terms of studying substance abuse, early works focused on manual identification of themes and tonality of the drug use posts on social media [12,13]. The image-based social media platform Instagram was analyzed to conduct content analysis for codeine misuse in [14]. Studies have also investigated the use of social media for examining geographic differences in opioid-related discussions [15] and identified topics related to substance delivery methods, drug types, and other factors associated with recreational drug use [16]. In [17] transductive classification was applied to identify opioid addicts on Twitter. Other works have identified opioid use related tweets [18] and studied information sharing amongst drug users on Reddit [19]. Drug addiction recovery has been the focus of far fewer works. Among the latter, in our previous work Eshleman et al. [20], random forests were used with subreddit activity as features to identify users open to addiction recovery interventions in a predictive setting. The Gini impurity criterion, which measures how often a random element from a set would be labeled incorrectly if labeled according to the distribution of labels in the set, was used to rank the different subreddits on the basis of their importance. This analysis found correlations amongst subreddit categories, such as, mental health, spirituality, and relationships with addiction recovery behavior. The SEM model in the current work was developed using two latent variables-"relationships" and "mental and physical well-being", both of which were directly inspired by findings reported in [20]. In particular, we used user activity in the following subreddits: "relationships", "rela-tionship_advice", "parenting", and "childfree" to reflect the latent variable "relationships". Similarly, we used subreddits, such as, "meditation", "yoga", "gainit", "bodyweightfitness", and "running" to estimate the latent variable "mental and physical well-being". In other works, MacLean et al. [21], used a trans-theoretical model of behavior change to predict the stages of addiction recovery and relapse. Lu et al. [22], used the cox regression model to identify transitions to addiction recovery subreddits. Chancellor et al. [23], studied recovery-related posts on Reddit to identify clinically unverified treatments for drug withdrawal popular amongst drug users on Reddit. Rubya et al. [24]., investigated how users in online recovery communities enact anonymity Finally, Tamersoy et al. [25], studied Reddit forums to characterize smoking and drinking abstinence and were able to predict long-term and short-term abstinence.
The current work addresses two outstanding issues in this problem domain at the state-of-the-art: first, drug addiction recovery and relapse involves (latent) variables that cannot be directly measured and have to be inferred from observable variables. Second, the addiction and recovery processes involve complex interplay of relationships between the observed and latent variables, which needs to be characterized. Current methods in the area involve variables that have to be explicitly measured and consequently are incapable of addressing these two issues. We demonstrate how SEM can be a powerful framework to test, evaluate, and characterize multivariate causal relationships in addiction recovery and relapse where both observable and latent factors are involved.

Summary statistics
In Table 1 and Fig. 1 we present the correlations between the LIWC indicators in the withdrawal management model. From this data we observe that the majority of the LIWC variables are positively correlated with each other. We also observe some correlations that are not so obvious. For example, we see that the second (0.78) and third highest (0.72) correlations were for the categories "swear" and "sexual", and "anger" and "sexual". As displayed in Table 2, the high correlation was due to common expletives in these categories. We also see that the LIWC category "health" had high correlation values with categories, such as "negative emotion" (0.39), "sad" (0.25), and "anxiety" (0.28). This indicates that users in our dataset usually talked about health (physical symptoms) in the context of negative emotions-as may be expected for users experiencing withdrawals.
In Table 2 we compare the values of the indicators for "emotional distress", and "physical pain" between the users who posted or did not post in DAR subreddits. The corresponding table for Empath variables is presented in Additional file 1: Table S1. We used  LIWC to determine the value of each indicator for the posts of drug users in our dataset. Then the distributions of the values of indicators for the set of users who posted in a DAR subreddit was compared with the set of users who did not post in a DAR subreddit with the null hypothesis being that there was no difference between the distributions. The Mann-Whitney U-test [30], a non-parametric test, was used to compare the distributions and we observe statistically significant differences between the two set of users for each observable variable. The values of the indicators of the latent variable "emotional distress" were found to be higher for the users who displayed addiction recovery behavior. Posts corresponding to addiction recovery behavior typically consisted of higher values for the LIWC categories: "feel" (20%, p < 0.005), "anger" (22.2%, p < 0.005), "authentic" (9%, p < 0.005), "sexual" (13.3%, p < 0.005), "negative emotion" (20%, p < 0.005), "sad" (25%, p < 0.005), "affect" (7.5%, p < 0.005), "anxiety" (26.0%, p < 0.005), and "swear" (16.6%, p < 0.005) as compared to the other LIWC categories used by us (Table 2).
Similarly, the values for the indicators of the latent variable "physical pain" were higher for the users who displayed addiction recovery behavior. Accordingly, our data shows that drug users complained about their health and physical discomforts during the withdrawal phase. Correspondingly, these posts were found to have higher values for the relevant LIWC categories: "body" (5.4%, p < 0.005), "health" (30.3%, p < 0.005), "biology" (13.3%, p < 0.005), and "death" (28.5%, p < 0.005) ( Table 2).  Table 1). Positive correlations are color-coded in blue and negative correlations in red. The size of each square represents the magnitude of the correlations. As this visualization indicates, every variable-pair in the model was positively correlated. The two highest correlations values were observed for the variable pairs "anger" and "swear" followed by "anger" and "sexual" Figure 2 displays the final LIWC withdrawal management model with factor loadings (the value for correlations are not displayed in the figure to maintain clarity). In Fig. 2, the effect of the variables "emotional distress" and "physical pain" on drug addiction recovery behavior is studied. We estimated the latent variable "emotional distress" with nine LIWC categories: "negative emotion", "sad", "anger", "anxiety", "feel", "affect", "swear", "sexual", and "authentic". The latent variable "physical pain" was estimated using four indicators "biology", "death", "health", and "body". All of the paths in the model were found to be statistically significant. Both "emotional distress" and "physical pain" were found to influence addiction recovery behavior. However, "emotional distress" was found to be more evident in withdrawal as compared to "physical pain"; all of the indicator variables for "emotional pain" were found to have a strong effect on withdrawal, with the LIWC categories "anger" and "swear" being the two most significant indicators.

Path analysis
RMSEA, SRMR, CFI, and TLI were used to assess the model fit. The results based on the hypothesized model indicated a decent fit with RMSEA = 0.08, TLI = 0.90, CFI = 0.95, and SRMR = 0.07. The relatively higher value observed for the RMSEA was due to the covariance between the LIWC categories. These covariances increased the number of paths that had to be estimated in the model, reduced the degrees of freedom of the model, and led to relatively higher RMSEA values. The values for the TLI, CFI, and SRMR indices all indicate high-quality model fit. Table 3 summarizes the results of the final SEM model. The LIWC withdrawal management model. Ellipses indicate latent variables, rectangles represent observed variables, straight line with one arrowhead represents a direct effect, and a curved line represents covariance. As indicated by this model emotional and physical pain positively affects the recovery propensity of a drug user. However, for the LIWC indicators emotional factors were found to be more important than physical factors

Summary statistics
In Table 4 and Fig. 3 we present the correlations between the Empath indicators for the withdrawal management model. Similar to the LIWC variables, all of the Empath variables in the model were also found to be positively correlated with each other with the categories "pain" and "shame" (0.89) followed by "suffering" and "hate" (0.71) having the highest correlation values. The Empath category "suffering" was also found to be correlated with "medical_emergency" (0.22), "weakness" (0.25), "health" (0.34),  Table 4). Positive correlations are color-coded in blue and negative correlations in red. The size of each square represents the magnitude of the correlations. As this visualization indicates, every variable-pair in the model is positively correlated. The two highest correlation values were observed for the variable-pairs "pain" and "shame" followed by "suffering" and "hate" and "pain" (0.69) indicating that users in the withdrawal phase discussed physical symptoms in the context of distress. In Additional file 1: Table S1 we compare the values of the Empath based indicators for "emotional distress", and "physical pain" between the users who post and do not post in DAR subreddits. Figure 4 displays the Empath indicator-based withdrawal management model with factor loadings (the value for correlations are not displayed in the figure to maintain clarity). In this figure, the effect of "emotional distress" and "physical pain" on drug addiction recovery behavior is studied. We estimated the latent variable "emotional distress" with four Empath categories: "negative_emotion", "hate", "shame", and "suffering" The latent variable "physical pain" was estimated using four indicators ""pain", "medi-cal_emergency", "weakness", and "health". All of the paths in the model were found to be statistically significant. As was the case for the model built using LIWC indicators, both "emotional distress" and "physical pain" were found to influence addiction recovery behavior. All of the indicators for "emotional distress" had a strong positive effect, with "shame" and "suffering" being the most contributory. Similarly, all of the indicators for the "physical pain" had a strong positive effect, with "pain" having the highest effect. As opposed to the LIWC model, however, "physical pain" was found to be more evident in withdrawal as compared to "emotional distress". The model quality was determined using RMSEA, SRMR, CFI, and TLI. The hypothesized model indicated a good fit with RMSEA = 0.07, TLI = 0.96, CFI = 0.98, and SRMR = 0.03. Similar to the LIWC model, the relatively higher value observed for the RMSEA was due to the covariance between the Empath categories. The values for the TLI, CFI, and SRMR indices all indicate highquality model fit. Table 5 summarizes this SEM model.

Fig. 4
The Empath indicators-based withdrawal management model. Ellipses indicate latent variables, rectangles represent observed variables, straight line with one arrowhead represents a direct effect, and a curved line represents covariance. As indicated by this model, emotional and physical pain were found to positively influence the propensity of a drug user to recover. Unlike the model built using LIWC indicators, for the Empath indicators physical factors were found to be more important than emotional factors in recovery The recovery efforts model obtained using subreddit activities

Analysis of subreddit activities
In Fig. 5 and Additional file 1: Table S4 we present the correlations between the forum activity used in the SEM model for recovery efforts. From the figure and table, we observed that unlike the LIWC variables the correlation values between the forum activity displayed across different subreddits was low. The highest correlation was between the forums "careerguidance" and "resumes" (0.3), followed by "entrepreneur" and "careerguidance" (0.2). The comparison of the forum activity for the users who posted and did not post in a DAR subreddit was conducted in a manner similar to that described in the withdrawal management model ( Table 6). The values of the subreddit activities corresponding to the latent variable "mental and physical well-being" were higher for users who displayed addiction recovery behavior. Some of these subreddits were: "fitness" (66.6%, p < 0.005), "meditation" (85.7%, p < 0.005), "yoga" (85.7%, p < 0.005), "gainit" (66.6%, p < 0.005), "bodyweightfitness" (100%, p < 0.005), and "running" (75.8%, p < 0.005) (Table 6). Similarly, the values for the subreddit activities corresponding to the latent variable "career" were higher for users who displayed addiction recovery behavior. Some of these subreddits were: "jobs" (96.2%, p < 0.005), "entrepreneur" (66.6%, p < 0.005), "careerguidance" (66.6%, p < 0.005), and "resumes" (66.6%, p < 0.005). Finally, the values of the subreddit activities corresponding to the latent variable "relationships" were also found to be higher for users who displayed addiction recovery behavior. Examples of subreddits for which enhanced activity was observed included: "relationships" (66.6%, p < 0.005), "relationship_advice" (50%, p < 0.005), "parenting" (50%, p < 0.005), and "childfree" (66.6%, p < 0.005) ( Table 6). Figure 6 shows the subreddit activity-based recovery model with factor loadings (the value for correlations are not displayed in the figure to maintain clarity). In it, the effect of "mental and physical well-being", "career" and "relationships" on drug addiction recovery behavior is studied. We estimated the latent variable "mental and physical wellbeing" with six indicators: "fitness", "meditation", "yoga", "gainit", "bodyweightfitness", and "running". The latent variable "career" was estimated using four indicators "jobs", "entrepreneur", "careerguidance", "resumes". Finally, the latent variable "relationships" was estimated using the following four indicators: "relationship_advice", "relationships", "parenting", and "childfree". The effect of "mental and physical well-being" and "relationships" on addiction recovery behavior was found to be statistically significant and positive, whereas, the effect of "career" on addiction recovery behavior was negative and statistically insignificant. All of the indicator variables for "mental and physical well-being" had  Table S4). Positive correlations are color-coded in blue and negative correlations in red. The size of each square represents the magnitude of the correlations. As this visualization indicates, every variable-pair in the model is positively correlated. The two highest correlation values were observed for the variable-pairs "career-guidance" and "resume" followed by "career-guidance" and "Entrepreneur" a strong positive effect, with "fitness" and "bodyweightfitness" being the most contributory. Similarly, the indicator variables for "relationships" also had a strong positive effect on "relationships" (except "childfree" which was statistically insignificant). "relationship_ advice" had highest effect on "relationships" followed by the subreddit "relationships". Between "relationships", and "mental and physical well-being", "relationships" was found to be more important for addiction recovery behavior. The fit indices for the final model indicated a good fit with the fit indices being: RMSEA = 0.02, TLI = 0.90, CFI = 0.92, and SRMR = 0.02. Table 7 summarizes the SEM model.

Summary statistics
In Table 8 and Fig. 7 we present the correlations observed between the LIWC indicators in the relapse model. All of the LIWC variables were found to be positively correlated with each other with the highest correlation observed for the categories "you΄" and "female΄" (0.76) followed by "you΄" and "male΄" (0.72). In Additional file 1: Table S3 we compare the values of the LIWC based indicators for "anti-social", "motion΄" (lack of physical activity), and "religion΄" (lack of religious) between the users who relapse and who do not relapse. The SEM model for addiction recovery using subreddit activities. Mental and physical well-being (MPWB) and relationships were found to positively influence addiction recovery behavior. Career/job prospects negatively effects recovery behavior, however, its effect was statistically insignificant Table 7 Latent variable factor structure, direct effects, and covariances the final subreddit activity based recovery SEM model '-> ' represents a path or direct effect in the model. "Relationships" have a positive impact on addiction recovery. "Mental and physical well-being" (MPWB) also has a positive impacton addiction recovery. But, the impact of "career" was negative and statistically insignificant   Table 8).

Relationships between variables
Positive correlations are color-coded in blue and negative correlations in red. The size of each square represents the magnitude of the correlations. As this visualization indicates, every variable-pair in the model is positively correlated. The two highest correlation values were observed for the variable-pairs "you΄" and "female΄" followed by "you΄" and "male΄" Fig. 8 Final model of factors for the LIWC relapse model. "Anti-Social", "religion΄", and "motion΄" were found to positively influence relapse behavior. Tone΄ negatively affected relapse behavior, however, its effect was statistically insignificant Figure 8 shows the final LIWC based relapse model with factor loadings (the value for correlations are not displayed in the figure to maintain clarity). In this figure, the effect of "anti-social", "motion΄" (lack of physical activity), and "religion΄" (lack of religious) on relapse behavior is studied. We estimated the latent variable "anti-social" using the negation of the following six LIWC categories: "friend", "we", "shehe", "you", "male", "female". The effect of "anti-social" and the negation variables "motion΄", and "religion΄" were found to increase relapse behavior and were statistically significant. The effect of the negation variable "tone΄" (lack of positive emotion) on recovery was negative and statistically insignificant. All of the indicator variables for "anti-social" had a strong positive effect, with "you΄" and "male΄" being the most contributory. "Anti-social" was found to have the highest effect on the relapse behavior. The fit indices for the final model indicated a good fit with the fit indices being: RMSEA = 0.07, TLI = 0.96, CFI = 0.98, and SRMR = 0.03. Table 9 summarizes the model.

Table 9 Latent variable structure, direct effects, and covariances of the LIWC-based SEM model for relapse
The symbol '-> ' is used to represent a path or direct effect in our SEM model. The negation of a variable is indicated by a prime. "Anti-social", "motion΄", and "religion΄" had a positive impact on relapse behavior

Summary statistics
In Additional file 1: Table S2 we compare the values of the Empath based indicators for the negation variables "positive emotion΄" (lack of positive emotion), "career΄" (lack of career interests), and "urban΄" (lack of urban facilities) between the users who relapse and who do not. In Table 10 and Fig. 9 we present the correlations between the Empath  . 9 Correlation diagram of the Empath variables present in the Empath relapse model (see also Table 10). Positive correlations are color-coded in blue and negative correlations in red. The size of each square represents the magnitude of the correlations. As this visualization indicates, every variable-pair in the model was positively correlated. The two highest correlation values were observed for the variable-pairs "joy΄" and "zest΄" followed by "white_collar_job΄" and "blue_collar_job΄" indicators present in the relapse model. Similar to the LIWC variables, all of the Empath variables in the model were found to be positively correlated with each other with the categories "joy΄" and "zest΄" (0.95) followed by "white_collar_job΄" and "blue_collar_ job΄" (0.69) having the highest correlation values. Figure 10 displays the Empath indicator-based relapse model with factor loadings (the value for correlations are not displayed in the figure to maintain clarity). In this figure, the effect of "positive emotion΄", "career΄" and "urban΄" on relapse behavior is shown. We estimated the latent variable "positive emotion΄" with the negation of the following Empath indicators: "joy", "zest", "cheerfulness", and "positive emotion". The latent variable "career" was estimated using the negation of three Empath indicators: "blue_collar_job", "white_collar_job", and "office". All of the path models were found to be statistically significant. The effect of "positive emotion΄", "career΄", and "urban΄" were found to be lead to relapse and were statistically significant. The indicator variables for "positive emotion΄" were found to have a strong effect, with "joy΄΄" and "zest΄΄" being the most contributory. Similarly, all of the indicators for "career΄" also had a strong effect, with "white_col-lar_job΄" and "office΄" being the most contributory. The fit indices indicated a good fit for this model: RMSEA = 0.04, TLI = 0.98, CFI = 0.99, and SRMR = 0.07. This model is summarized in Table 11.

The role of emotional distress and physical pain in withdrawal management
We observed that both emotional distress and physical pain played a significant role for redditors who display addiction recovery and relapse related behavior. To understand the reason behind this observation we further investigated the posts from individuals discussing their withdrawals from drugs. We observed that users Fig. 10 The Empath indicator-based relapse model. Ellipses indicate latent variables, rectangles represent observed variables, straight line with one arrowhead represents a direct effect, and a curved line represents covariance. As indicated by this model, "positive emotion΄", "career΄", and "urban΄" were found to positively influence the relapse behavior of a drug user typically experienced both physical pain and emotional distress during withdrawal. Also, we often observed users to have employed chemical treatments such as methadone and suboxone, alternative therapies such as kratom, xanax, and loperamide, as well as other supplements known to suppress physical symptoms of withdrawal.
Interventions for assuaging emotional distress were found by us to be less prevalent. In Table 12 we present example posts describing some of the measures taken by individuals to suppress physical pain and discomfort. Interestingly, many users who had successfully managed their withdrawal process and were well into recovery, were observed by us to display a sense of loss after giving up their drug of choice. Paraphrased examples of posts describing such behavior are shown in Table 13.  Table 12 Paraphrased posts discussing different therapies utilized by the drug users to suppress physical discomforts during withdrawals

Mental and physical well-being
Both mental and physical well-being were found to have a positive effect of addiction recover behavior. Physical activities are known to increase the production dopamine, noradrenaline, and serotonin and can act as mechanisms for a natural high [31][32][33][34][35][36][37][38][39]. Many initiatives such as "lace-'em-up" have demonstrated the importance of physical activity for recovering addicts [40]. Our work confirms that similar conclusions can be drawn by analyzing social media data. In Table 14 we display paraphrased excerpts from posts demonstrating the positive effects of mental and physical activities on addiction recovery behavior.

Relationships
We found that "relationships" had a positive effect on addiction recovery. Unsurprisingly, friends and family play an important role in the addiction recovery efforts of an individual. There are many reasons that underlie this finding. First, the stigma associated with drug use causes an individual to feel shame and fear discrimination. Consequently, they don't feel safe to discuss their issues with co-workers, or strangers. It has been shown that addicts and recovering addicts feel comfortable in sharing their addictions and recovery journey with friends and family [41]. Research has   also highlighted the willingness and positive outcomes of users undergoing addiction recovery efforts with the help and support drug-free friends, family members, and significant others [42]. Our analysis of social media data led to similar conclusions. In Table 15 we share excerpts from posts depicting the different ways friends and family affect the addiction recovery behavior.

Jobs and career
We observed a negative, albeit statistically insignificant, effect of career/job opportunities on addiction recovery behavior. As noted in the "Research design and methods" section, the addiction literature is ambiguous on the effect of profession on addiction recovery. To highlight this point, we present example posts showing both the negative and positive aspects of profession on addiction recovery in Table 16.

Supporting addiction recovery and personalized addiction recovery care
Personalized addiction recovery treatments have been found to be essential for successful abstinence [43,44]. Our results identifying the impact of family and friends, selfdevelopment efforts, emotional distress and physical pain on addiction recovery can be utilized to provide direction for a person's recovery. For example, an individual in the initial stages of abstinence may be asked to focus on mental and physical well-being, and at least for some time stay away from high pressure situations (new jobs or returning to a previous stressful job). Their family and friends could also be made aware about their role in an individual's recovery and how they provide a safe non-judgmental space for the afflicted individual. Additionally, efforts could be made to manage emotional pains and cravings during and after the withdrawal period.

Conclusions
In this paper, we have described a framework that uses SEM to analyze and quantify latent constructs using SEM for modelling addiction recovery behavior using data from social media. The paper presents different SEM models to quantify the relationship Hey everyone. How do you guys handle a high pressure career in recovery, particularly early recovery. I've seen fellow redditors who are in the corporate grind. I work a Wall Street job, with unpredictable and stressful hours. I am 10 days clean now, but the timing and pressure keeps on triggering me to use again. If anyone has any experience they can share, would be much appreciated. It's an extremely well paying job and I don't want to just walk away from it. Thanks guys Tomorrow will be day 10 from snorting dope and honestly it's been great! I also got a 2nd full time job at night last month so which keeps me busy and helps me sustain myself. Feels great to have some money for once! I don't know why but this feels like the time it will actually work out I cleaned up about 3 years ago entirely on my own will power. I found my calling-my dream job. It helped me stay busy and get over my cravings. The enjoyment I felt moving forward in my career was so much more enthralling than getting high off any other drug between a number of observable and latent variables and their link to substance addiction.
To the best of our knowledge, this is the first study to utilize social media data and SEM to measure the latent constructs associated with substance abuse and recovery. Our results underscore the value of information present on social media platforms like Reddit to the study of substance misuse and design of interventions.

Data source and participants
We used a set of 117 recreational drug use (RDU) subreddits, and 29 drug addiction recovery (DAR) subreddits reported in our prior works to identify users discussing drug use and recovery on Reddit [20,45]. In [20] we had utilized the word2vec algorithm [46] to create a term embedding space. In this space related terms were grouped using an iterative set expansion technique to construct drug-use and addiction-recovery lexicons. These lexicons were subsequently employed to characterize the different subreddits following which bi-clustering was used to cluster the different RDU and DAR subreddits. These bi-clusters were further manually curated to arrive at two RDU, and DAR subreddits sets. For this paper, we further identified 170,097 unique users discussing their drug use and recovery from these two RDU and DAR subreddit sets. For each of these users we retrieved their 1000 most recent posts (the specific number of retrieved posts was platform imposed) using the praw api [47]. Finally, we filtered out those users who had less than five nonempty posts in the RDU and DAR subreddit sets. As a consequence of this filtering, we ended up with a set of 7025 users consisting of 2679 users who posted in both RDU and DAR subreddit, and 4346 users who posted only in an RDU subreddit. In Table 17 we present example posts in different RDU and DAR subreddits.

Overview of modeling and analysis
In Fig. 11 we display the key steps of our analysis process. We used LIWC or Empath to analyze the posts of the users in our dataset to extract language features, such as, negative emotions, anxiety, and pain, associated with recovery/relapse behavior of drug users. We next hypothesized certain unobserved (latent) variables for the observed features as well as the relationship between observed and latent variables. The model and its goodness of fit was iteratively analyzed and refined using SEM to obtain the final path diagram displaying the interrelationships between latent and observed variables and recovery/relapse behavior. In the following, we describe each of the modeling steps.

Linguistic feature specification using LIWC and Empath
LIWC [48] and Empath [49] are text analysis tools developed to measure psychological, cognitive, emotional, and behavioral components in a given text sample using humanvalidated dictionaries. Given a piece of text, these dictionaries can be utilized to make complex determinations, such as, calculating the percentage of terms related to sadness, religion, finance, negative emotions, or physical activity. In particular, LIWC outputs the percentage of total words that belong to 90 unique categories defined therein. Empath operates similarly and uses over 200 categories. Empath can also be used to create new categories by defining appropriate seed terms. Our research used the existing categories of Empath.

Basic concepts and definitions of structural equation modeling
In this section we describe the essential terms and concepts used in SEM. SEM is also referred to as the analysis of co-variance structure as model fitting is accomplished by utilizing the observed co-variances of the variables. For a detailed explanation of SEM, the reader is referred to [50]. SEM models are represented as a graphical representation of variable relationships and are called path diagrams. In SEM terminology observed variables (manifest variables) are those variables that are present in the dataset and can be measured. These variables are represented as rectangles in a path diagram. By contrast latent variables are not directly observable. Latent variables can be interpreted as the causes of manifest variables and are represented as ovals in the path diagram. In these diagrams, putative relationships between two variables are represented as directed edges (paths) weighted by path coefficients that are analogous to regression coefficients. Latent variables or error terms that co-vary are joined by curved arrows in the path diagram. SEM designates two other sets of variables: exogenous variables are determined to be outside of the model and have no paths pointing to them while endogenous variables are determined by the system of equations and have at least one path pointing to them. Both exogenous and endogenous variables can be observable or latent. Finally, for a specific model, its degrees of freedom (d), denotes the number of model parameters that are allowed to vary. Specifically, d is the difference between the number of possible parameters that can be estimated and number of actual parameters estimated. The number of possible parameters is quadratic in p -the number of observed variables while the number of estimated variables consists of all the paths (direct effects, correlations, error terms) being estimated in the model. A model is considered to be under-identified, justidentified, or over-identified if d < 0, d = 0, and d > 0 respectively. To estimate and evaluate the relationships in the model correctly we need to have d > 0.
It is important to clarify the relationship between SEM and another popular graphbased probabilistic reasoning framework, called Bayesian Networks (BN). We begin by noting that SEM does not denote a single technique; it refers to a family of related procedures. This family can be broadly characterized in terms of taking three inputs and generating three outputs [51]. The inputs being: (1) one or more qualitative causal hypotheses, (2) a set of questions about causal relations among variables of interest, and (3) a model instance. The outputs of SEM are: (1) estimates of model parameters for hypothesized effects, (2) a set of logical implications of the model that can be tested in the data, and (3) a measure of how well the testable implications of the model are supported by the data. The point of SEM is to test a theory by specifying a model that represents predictions of the aforementioned theory from among plausible constructs measured with appropriate observed variables. BN represent dependencies among sets of random variables as (causal) graphs which are traversed to update conditional probabilities of events. The ideas underlying BN have been extended to the broader problem of causal inference under a framework called the structural causal model (SCM), which is subsumed under the umbrella of SEM [52]. In our problem context, a direct application of BN entails limitations. In particular, BN cannot differentiate between causal and non-causal relationships without intervention from a domain expert [53]. Furthermore, it is non-trivial to employ BN while differentiating between latent and observed variables-a core requirement in our research. Finally, the output of BN is known not to be well suited for theoretical explanations [54].

The process of structural equation modeling
SEM is an iterative process and involves the following steps: (1) Model specification: At this step a researcher hypothesizes the latent variables, the observed variables, and the relationships between them. (2) Estimation: The proposed model structure is estimated by using covariance analysis to solve a system of equations representing the interrelationships in the system. (3) Evaluation of model fit: The model fit can be evaluated using a variety of measures, such as, the comparative fit index (CFI), the Tucker Lewis index (TLI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). (4) Model re-specification: If the initial fit is not deemed to be adequate, the model is modified and the above steps iterated.

SEM estimation
In the estimation step the difference between the sample covariance ( C ) and the modelpredicted covariance ( C (θ) ) is minimized. The underlying idea is that the covariance matrix of the observed variables is a function of a set of parameters. If the parameters are correctly estimated (i.e. the model is correct) then the population covariance matrix will be exactly reproduced as shown in Eq. (1), where θ denotes the vector of model parameters.
The standard form of the structural equation relating the endogenous and exogenous variable is: In Eq. (2), y(n × 1) denotes the n dependent or endogenous variables, x(m × 1) denotes the m exogenous variables, and ζ (n × 1) denotes the specification errors. The matrix B (n × n) denotes the coefficients of the regression of y variables on other y variables with zeros on the diagonal which implies a variable cannot cause itself. The matrix Γ (n × m) denotes the coefficients of regression of the endogenous variables on the exogenous variables. A maximum likelihood function is used to fit the structural model equations by minimizing the fitting function (F ML ) shown in Eq. (3): In Eq. (3), S is the sample covariance matrix, |.| denotes the determinant, and tr (.) denotes the trace of a matrix. Additionally, in SEM, it is assumed that C(θ) , and S are positive-definite which means they are non-singular.

Employing SEM for social media data modeling: an operational explanation
In this section, we explain the progression of our analysis-process from Reddit posts to a final SEM model. As the specific context, we describe the withdrawal management modeling process using LIWC indicators. To generate this model, we had used 209,804 posts from 7025 drug users. The withdrawal management model involved nine LIWC categories: "negative emotion", "sad", "anger", "anxiety", "feel", "affect", "swear", "sexual", and "authentic" which were postulated to capture the emotive underpinnings of a post. Similarly, the four LIWC categories: "biology", "death", "health", and "body" were postulated to describe physical discomfort. In Table 2 we present example posts and the terms identified by LIWC for the aforementioned categories. We also present post-specific LIWC category values in the table. Also, Additional file 2: Table S1 contains the LIWC category values for a sample set of 1000 users engaged in substance use. Finally, the (binary) variable "recovery" was the outcome variable of the model; it was set to 1 if an individual posted in a DAR subreddit else it was set to 0. As explained in Fig. 11, the posts of these users were analyzed using LIWC to generate the matrix M 7025x14 .
In SEM, variables that can be measured constitute the observable variables. In our context (Fig. 2) this role was fulfilled by the thirteen LIWC categories listed above (these variables are represented as rectangles in the path diagram shown in Fig. 2). Our hypothesis was that the latent variables (represented as ovals in Fig. 2): "emotional distress" could be measured using the LIWC categories: "negative emotion", "sad", "anger", "anxiety", "feel", "affect", "swear", "sexual", and "authentic", while the latent variable "physical pain" could be measured via the LIWC categories: "biology", "death", "health", and "body". Finally, we hypothesized that these two latent variables had a direct effect on the recovery behavior as reflected by the Reddit posts of drug users. We measured the recovery behavior (observed variable) by using a binary variable "recovery" which was set to 1 if a user was found to have posted in drug addiction recovery forum. Alternatively, this variable was set to 0. The reader may also note that "emotional distress", and "physical pain" were the only endogenous variables in the model; the rest of the variables being exogenous.
Next, in the SEM estimation step the difference between the population covariance ( C ), i.e., the covariance observed in LIWC variables and the "recovery" variable for the population of 7025 drug users and the hypothesized-model-predicted covariance ( C (θ) ) was minimized. For our dataset, the standard form of the structural equation (Eq. (2)) relating the endogenous and exogenous variable took the following form: In Eq. (4), y(14 × 1) denotes the 14 exogenous variables (13-LIWC categories and 1-recovery variable), x(2 × 1) denotes the 2 endogenous variables ("emotional distress" and "physical pain"), and ζ (14 × 1) denotes the specification errors. The matrix B (14 × 14) denotes the effect of the exogenous variables on other exogenous variables while the matrix Γ (14 × 2) denotes the coefficients of regression of the LIWC variables on the endogenous variables. The maximum likelihood function explained in Eq. (3) is used to fit the structural model equations by minimizing the fitting function (F ML ) and obtain the model shown graphically in Fig. 2.

Model evaluation
In SEM, the model fit is evaluated by examining difference between the sample covariance ( C ) and the covariance ( C (θ ) ) computed using the model. The goal is to minimize the difference between C and C (θ) . The simplest fitting function for SEM models is the Chi-square fit χ 2 = (N − 1)F ML . However, this function is affected by sample size; large sample sizes may increase the χ 2 value even if the difference between C and C (θ ) is small and small sample sizes may lead to Type II errors [50]. The χ 2 function however, is used as part of other fitting functions. Typically, these fitting functions are of three types: relative goodness-of-fit functions, parsimony functions, and functions that determine absolute (standalone) fit.
Examples of relative goodness-of-fit functions include the CFI (Eq. 5) and TLI (Eq. 6) measures. These measures compare the proposed model against a baseline model where all variables are allowed to have a variance, but none are allowed to co-vary. For both CFI and TLI, goodness of fit values above 0.90 denote high-quality agreement [55].
In Eqs. (5) and (6), the baseline model is indicated by the subscript B while the subscript I denotes the proposed model. The degree of freedom is denoted by d.
The RMSEA [see Eq. (7)] constitutes an example of a parsimony-based fitting measure. The RMSEA takes into the account the complexity of the model by penalizing models with lower degrees of freedom since such models lead to higher values of RMSEA. RMSEA values less than 0.01, 0.05, and 0.08 are respectively considered to indicate excellent, good, or mediocre fit [55].
In the above equation, n denotes the sample size. Finally, SRMR [see Eq. (8)] is an example of an absolute fit index. SRMR is the average of standardized residuals between the observed and the model computed covariance matrices. An advantage of using SRMR over CFI, TLI, and RMSEA is that it is independent of the sample size.
In the above equation C ii and C jj are the observed standard deviations and p is the number of observed variables. Usually, SRMR values of less than 0.08 are considered to denote models of adequate quality [55].

Modeling withdrawal management and recovery
Withdrawal from drug addiction is accompanied by physical discomforts and negative emotions. Sedatives, opioids, and alcohol are known to cause intense physical discomforts during withdrawals, while withdrawal from substances such as marijuana, and stimulants cause emotional negativity [56]. Physical symptoms during the process of withdrawal include a variety of symptoms such as muscle aches, runny nose, dilated pupils, piloerection, insomnia, sweating, yawning, shivering, pain, cramps, weight loss, toothache, colds, and sometimes even mortality [57][58][59]. Emotional distress and negativity during withdrawal is characterized by aggression, anxiety, and loss of temper [60][61][62]. The medical approach to manage withdrawal symptoms typically involves gradually tapering doses of drug agonists to diminish the bodily discomforts and prevent a relapse. However, there are no clear methods to measure, and compare the intensity of either emotional distress or physical pain during withdrawal. In the following we describe the development of SEM models to determine the effect and importance of "emotional distress", and "physical pain" in withdrawal management using linguistic features determined using both LIWC and Empath.

Determining observed variables using LIWC
We used nine LIWC categories: "negative emotion", "sad", "anger", "anxiety", "feel", "affect", "swear", "sexual", and "authentic" to measure the latent variable "emotional distress". Examples of terms in each of the categories are presented in Table 18. The categories "negative emotion", "sad", "anger", and "anxiety" consisted of terms that had a negative connotation or valance and reflected negative thoughts. The category "feel" consisted of terms related to bodily sensations, while the category "affect" consisted of terms having both a negative and a positive connotation. We included the LIWC category "swear" as one of the indicators for "emotional distress" because we noticed that it was common for drug users to employ expletives to express their physical and emotional anguish. We also included the LIWC category "sexual" as one of our indicators for "emotional distress" because of analogous reasons. "Authentic" was a summary variable and was calculated as a single value for a given text input. The algorithm in LIWC for determining the authenticity of a text was developed based on the studies on deceptive and truthful communications [48,63]; it determines the openness, honesty, and disclosure of a given body of text. Consequently, there are no example terms for "authentic" in Table 18. To reflect the latent variable "physical pain", we used the following four LIWC categories: "biology", "death", "health", and "body". Example terms in each of these categories are presented in Table 18. The category "biology" contained terms related to human biology and biological activities. Terms representing death were present in the category "death" (bury, coffin, kill). The category "health" consisted of a number of terms related to medicine and health of an individual. The category "body" consisted of terms related to body parts and bodily functions. Additional file 2: Table S1 contains the LIWC category values for a sample set of users engaged in substance use. Finally, the (binary) variable "recovery" was the outcome variable of the model; it was set to 1 if an individual posted in a DAR subreddit else it was set to 0.

Determining observed variables using Empath
We used four Empath categories: "negative_emotion", "hate", "shame", and "suffering" to measure the latent variable "emotional distress". Examples of terms in each of the categories are presented in Table 18. The categories "negative_emotion", "hate", "shame", and "suffering" all consisted of terms that had a negative undertone and reflected negative feelings. To reflect the latent variable "physical pain", we used the following four Empath categories: "pain", "medical_emergency", "health", and "weakness" (see Table 18 for examples). The category "pain" contained terms related to physical discomfort. Terms representing a medical emergency were present in the category "medical_emergency". The category "health" consisted of a number of terms related to the health of an individual and the category "weakness" consisted of terms related to lack of strength of an individual. Again, the (binary) variable "recovery" was the outcome variable of the model; it was set to 1 if an individual posted in a DAR subreddit else it was set to 0.

The SEM model for recovery
Self-development efforts and relationships have been found to be indispensable for drug addiction recovery [65]. Family support, especially for adolescents in long term residential programs has been proven to be necessary for successful recovery from addiction [66]. Studies have also showed that having a strong social and family resource improves the chances of addiction recovery [67][68][69][70]. Self-development efforts encompassing activities that lead to mental and physical well-being, such as regular exercise, meditation, and yoga have been observed to help heal the body and mind [71,72]. Such activities have also been shown to address psychological and physiological needs of a recovering addict by reducing negative feelings and preventing weight gain following abstinence. Additionally, regular exercise is known to alleviate physical and mental stress. It is also known to positively alter the brain chemistry as it releases endorphins and creates a natural high, similar to ones released when an individual uses drugs. Studies have shown that addition of exercise as a lifestyle change leads to abstinence or reduction in drug use [31][32][33][34]. Mediation and yoga has also been proved to help individuals in their withdrawals and addiction by acting a calming effect during their period of struggles [35][36][37]. Professional activities constitute another aspect of self-development. However, the literature on the importance of jobs, and career on addiction recovery is ambiguous: some sources suggest that a stable job helps provide the recovering addicts with income and health benefits, improved mental health, and a purpose in their life. For example, Flynn et al. [72], found job/career to be one of the fundamental personal motivations for a recovering addict to stay sober. The importance of vocational rehabilitation and job search as one of the services in the social model of recovery has also been noted [73]. Other works have found that employed individuals undergoing recovery are more engaged in recovery activities and are more likely to abstain from substance use [74][75][76][77]. However, studies also have found that returning to old jobs, or stress experienced at work can lead to drug use and relapse [76]. Amongst these, Buczkowski et al., identified smoking environment at work as one of the triggers for relapse of smoking [77]. The stress associated with changing jobs has been cited to lead to substance use relapse [78][79][80][81][82]. Furthermore, the social stigma associated with drug addiction has been found to play a major role in the unwillingness of working individuals to opt for recovery interventions [83]. Finally, since employers are prejudiced against recovering addicts applying for jobs, such situations can also lead to a relapse or unwillingness to come out as an addict [83]. Because of the aforementioned reasons self-development efforts and relationships play a pivotal role in withdrawal management and drug addiction recovery. We therefore construct a SEM model to determine the effect and importance of the latent variables "mental and physical well-being", "career", and "relationships" in drug addiction recovery. To estimate these latent variables, we utilized forum activity of the drug users in multiple subreddits related to self-development efforts and relationships. We used the number of times an individual posted in the following eight subreddits: "fitness", "meditation", "yoga", "gainit", "bodyweightfitness", and "running" to estimate the latent variable "mental and physical well-being". Similarly, we used the posts in the subreddits: "jobs", "entrepreneur", "careerguidance", and "resumes" to estimate latent variable "career". As indicator variable for "relationships" we used the posts in the four subreddits: "relation-ship_advice", "relationships", "parenting", and "childfree". Finally, our outcome variable for the model was "recovery". The SEM model captures the effect of these variables on addiction recovery.

Modeling addiction relapse
As described above, the variables "emotional distress", "physical pain", "relationships", and "self-development" were found to play a critical role in addiction recovery. In addition to these factors, religion and geographic disparities were also found by us to influence the process of recovery. These results are supported by previous work in the field of relapse where it was found that recovering individuals display higher levels of religious faith [84][85][86][87]. Similarly, researchers have observed that addicts living in a rural setting have a higher chance for relapse as compared to their urban counterparts [88][89][90][91] because of limited access to relapse prevention facilities and preventive medications. In the following, we describe models that study the effect of the aforementioned latent variables along with demographic setting for drug users who undergo relapse. We defined relapse as the event of an individual posting in an RDU subreddit after posting in a DAR subreddit. Individuals who never posted in an RDU subreddit after posting in a DAR subreddit were defined to be in (continued) recovery. Based on these definitions 2363 individuals in our dataset were found to have relapsed, while 1355 users displayed continued recovery. To study users who relapsed while minimizing the impact of stray postings, we investigated only those users who had at least five posts in succession in a DAR subreddit before they were defined to have relapsed. Similarly, to study users who displayed signs of continued recovery we investigated only who had at least five posts in DAR subreddits before they stopped posting. As a consequence of this filtering, we ended up with a total of 174 users of whom 108 were identified to have relapsed while 66 users were identified to have continued their recovery journey till our observations concluded. Also, to extract relapse specific information, we scaled the values for LIWC and Empath categories by dividing them by the number of days between the post under investigation and the day when the user was defined to have relapsed.

Determining observed variables using LIWC for modeling relapse
While modeling users who relapsed we observed a limitation of using psycholinguistic dictionaries such as LIWC and Empath. Anti-social behavior, lack of religious expression, physical exercise, and positive emotion increases the chances of a relapse. However, using these dictionaries we could only obtain a value for the presence of such categories, i.e., the absence of such psycholinguistic information was not represented via any appropriate categories. To overcome this weakness and to build a model for relapse using LIWC, we generated values for such (absent, in LIWC or Empath) variables by subtracting the numeric weight of the corresponding LIWC/Empath categories from 1. For example, if a post had a value of 0.2 for the category "friends", we calculated the value of "friends΄" (i.e. the negation of the category "friends") to be 0.8 (hereafter, such variables are referred to as negated variables and denoted by a prime). We used negation of the following six LIWC categories "friend", "we", "shehe", "you", "male", and "female" to represent and study the latent variable "anti-social". To model lack of physical exercise and religious expression we used the negation of LIWC categories "motion" and "religion". The (binary) variable "relapse" was the outcome variable in our model; it was set to 1 if an individual relapsed else it was set to 0.

Determining observed variables using Empath
We used Empath to model the relapse behavior as a consequence of lack of positive emotion, career interests, and urban facilities. Similar to obtaining the values of LIWC categories for modeling relapse, we used negation of the following four Empath categories "joy", "zest", "cheerfulness", and "positive emotion" to study the latent variable "positive emotion΄" (lack of positive emotion). To model "career΄" (and lack of career development), we used the negation of the following three Empath categories: "blue_collar_job", "white_collar_job", and "office". Finally, to model "urban΄" (i.e., the lack of an urban setting and facilities) we used the negation of LIWC category "urban". The (binary) variable "relapse" was the outcome variable in our model; it was set to 1 if an individual relapsed else it was set to 0.

The SEM model for relapse of addiction
In this model we estimated the effect of factors including the social and physical activities of a drug user, their positive or negative emotions, recourse to religion, career-related activities, and location (urban or rural) on relapse by employing linguistic characteristics determined using LIWC and Empath. The relapse behavior was itself measured using the observed variable "relapse". The latent variable "anti-social" was estimated using six negated LIWC categories ("friend΄", "we΄", "shehe΄", "you΄", "male΄", and "female΄") and two observed negated variables "motion΄" and "religion΄". The Empath model estimated the latent negation variable "positive emotion΄" using four negated categories ("joy΄", "zest΄", "cheerfulness΄", and "positive emotion΄"). Similarly, the latent negated variable "career΄" was estimated using three negated categories ("blue_collar_job΄", "white_col-lar_job΄", and "office΄"). Finally, the variable "urban΄" corresponding to the location of the user was an observed variable in the model. The models obtained using the LIWC and Empath variables are described in the "Results" section.