Evaluation of clustering and topic modeling methods over health-related tweets and emails (2024)

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (1)

Link to Publisher's site

Artif Intell Med. Author manuscript; available in PMC 2022 Jul 1.

Published in final edited form as:

PMCID: PMC9040385

NIHMSID: NIHMS1704563

PMID: 34127235

Author information Copyright and License information PMC Disclaimer

The publisher's final edited version of this article is available at Artif Intell Med

Abstract

Background:

Internet provides different tools for communicating with patients,such as social media (e.g., Twitter) and email platforms. These platformsprovided new data sources to shed lights on patient experiences with healthcare and improve our understanding of patient-provider communication.Several existing topic modeling and document clustering methods have beenadapted to analyze these new free-text data automatically. However, bothtweets and emails are often composed of short texts; and existing topicmodeling and clustering approaches have suboptimal performance on theseshort texts. Moreover, research over health-related short texts using thesemethods has become difficult to reproduce and benchmark, partially due tothe absence of a detailed comparison of state-of-the-art topic modeling andclustering methods on these short texts.

Methods:

We trained eight state-of- the-art topic modeling and clusteringalgorithms on short texts from two health-related datasets (tweets andemails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA),LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), OnlineTwitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM),as well as the k-means clustering algorithm with twodifferent feature representations: TF-IDF and Doc2Vec. We used clustervalidity indices to evaluate the performance of topic modeling andclustering: two internal indices (i.e. assessing the goodness of aclustering structure without external information) and five external indices(i.e. comparing the results of a cluster analysis to an externally knownprovided class labels).

Results:

In overall, for number of clusters (k) from 2 to 50,Online Twitter LDA and GSDMM achieved the best performance in terms ofinternal indices, while LSI and k-means with TF-IDF had thehighest external indices. Also, of all tweets (N=286,971; HPV represents94.6% of tweets and lynch syndrome represents 5.4%), for k= 2, most of the methods could respect this initial clustering distribution.However, we found model performance varies with the source of data andhyper-parameters such as the number of topics and the number of iterationsused to train the models. We also conducted an error analysis using theHamming loss metric, for which the poorest value was obtained by GSDMM onboth datasets.

Conclusions:

Researchers hoping to group or classify health related short-textdata can expect to select the most suitable topic modeling and clusteringmethods for their specific research questions. Therefore, we presented acomparison of the most common used topic modeling and clustering algorithmsover two health-related, short-text datasets using both internal andexternal clustering validation indices. Internal indices suggested OnlineTwitter LDA and GSDMM as the best, while external indices suggested LSI andk-means with TF-IDF as the best. In summary, our worksuggested researchers can improve their analysis of model performance byusing a variety of metrics, since there is not a single best metric.

Keywords: Topic modeling, Clustering, Internal cluster indices, External cluster indices, Natural language processing

1. Introduction

Several social networking and microblog platforms have emerged exponentiallyin the last decade. Social networks such as Twitter enable users to interact witheach other and share information on a wide range of different topics. Twitter is oneof the most popular social media platforms intersecting all types of contents,including health-related texts. Twitter enables users to write short messages,called “tweets”, composed of 280 characters (140 characters beforeSeptember 2017). Tweets are often adopted to share opinions, feelings, thoughts, andpersonal activities. With over 500 million tweets posted each day, Twitter hasbecome a very valuable data resource to get real-world insights. In healthcaredomain, Twitter has also been adopted by users to share their personal healthstatus, their experience with the care and treatment options with other users withsimilar conditions/diseases and symptoms as well as more broadly sharing and seekinghealth information of their interest, attracting the attention of clinical andbiomedical researchers with the ultimate goal to improve patients’ outcomes[130,40,154,139]. There have been various existing studies thatdemonstrated the use of Twitter as a low-cost data source for public healthsurveillance [138,108], such as for influenza vaccination [61], mental health [32,155], human papillomavirus(HPV) vaccination [153], tobacco [99,31],opioid [83], public mood [104], suicide [23], etc.

Furthermore, email is becoming popular in health care to establish andimprove interactions between patients and healthcare professionals [36]. Emails allow patients to participate more activelyin their health care, which can improve the quality and accessibility of healthservices [20,18]. Indeed, patient-physician email communication has been addressed invarious studies [13], such as for detectionof depression [137], rural family healthpractice [27], multiple sclerosis [51], disease prevention [126], coordination of healthcare appointments [18], communication between healthcareprofessionals [106], among others. Severalof these studies found positive effects in the use of emails such as the improvementof clinic efficiency and cost-effectiveness [39,48,27].

As a result, Twitter and emails have created a vast amount of short texts.Several natural language processing (NLP) methods, such as topic modeling andclustering, have been adopted to digest and assess these short texts, allowing us toinfer patients’ interests, track new health-related stories, and identifyemerging health topics. Clustering seeks to split documents into a certain number ofgroups based on a similarity metric. Topic modeling seeks to discover latent topicsthat describe the collection of documents. A topic represents a group of words thatfrequently occur together. There are numerous works that have used classicclustering methods (e.g., k-means) on short texts such as tweets[120,86,90,148]. Diverse topic modeling methods also have beenproposed to analyze short texts from different fields. Two of the most popularmethods are the latent dirichlet allocation (LDA) [22] and latent semantic indexing (LSI) [56]. There exist various LDA-based techniques applied to text fromvarious domains, including biomedicine [67].Also, several recent approaches have adopted the Dirichlet Mixture Model for shorttext clustering [149,150,75]. Despitethe abundance of NLP techniques available in the literature, there are severalchallenges when analyzing tweets [10]:significant noise and inconsistent tweeting behaviours of user prevent researchersfrom leveraging the full potential information carried in tweets.

Moreover, health research using Twitter and emails are difficult to measurebecause of the lack of comparisons between the various existing applications. AsDecember 2020, we identified two recent studies that compared several topic modelingand clustering methods on several short text datasets. The first study [117] evaluated nine topic modeling based onDMM, global word co-occurrence, and self-aggregation. They found that simplermethods such as GSDMM [149] and BTM [147,30]were the most suitable with respect to effectiveness and efficiency. The secondstudy [33] evaluated the performance of fourclassic clustering algorithms (with four different feature representations such asTF-IDF and Doc2Vec) and a topic modeling method (LDA). The experiments showed thatthe best performance was achieved by k-means with Doc2Vecrepresentation. However, there exist several gaps in these two studies: (1) [117] did not consider LSI or any LDA-basedmethod, (2) [117] did not consider anyclassic clustering algorithm, (3) [33]considered only LDA as topic modeling, (4) both used small datasets (≤30K docs), (5) both used external validity indices only (i.e.,comparing the results of a cluster analysis to an externally known provided classlabels), and (6) both used a predefined number of topics for the evaluation, sinceeach dataset was previously annotated.

In this paper, we seek to fill the gaps previously mentioned in order todiscover how effectively several standard topic modeling and clustering methodsperform on health-related tweets and emails. Therefore, we evaluate the performanceof several state-of-the-art topic modeling and clustering algorithms (includingthose suggested in [117,33]) on short texts from two health-related datasets. Thefirst dataset is composed of tweets (≤ 290K docs) and thesecond is composed of emails (50K docs). We consider individualtweets and emails as single documents, respectively. We include seven topic modelingapproaches including LSI, LDA, GibbsLDA [142], Online LDA [57], BTM [147, 30], Online Twitter LDA [76], andGSDMM; as well as the k-means clustering algorithm with twodifferent feature representations: TF-IDF and Doc2Vec. We use cluster validityindices to evaluate the performance of topic modeling and clustering: two internal(i.e., assessing the goodness of a clustering structure without externalinformation) and five external validity indices.

The remainder of the paper is organized as follows. We will review theliterature in the “Related work” section. We will explain our approachin the “Methods” section. The results of the experiments andevaluations of the topic modeling and clustering applications will be presented inthe “Experiments and results” section. We will discuss the obtainedfindings in the “Discussion” section. Finally, we will conclude thecurrent work and present future directions in the “Conclusions”section.

2. Related work

In this section, we review related work from short text clustering, topicmodeling, and validity indices.

2.1. Clustering of short texts

Clustering is an unsupervised machine learning method that seeks topartition objects into a certain number of clusters (i.e., groups or subsets)based on a similarity metric. Generally, clustering methods applied on text dataare based on vector representations; such as bag-of-words (BoW) or termfrequency-inverse document frequency (TF-IDF); and then grouping texts based ontheir similarity [85,89,90,21,84]. These techniques are frequently applied to several informationretrieval tasks such as event detection [65,93,115,82] andtext summarization [105,128,86,148]. There are several works usingclassic clustering methods on short texts, such as tweets, for instance, a study[120] compared three well-knownclustering algorithms: k-means, Singular Value Decomposition,and Affinity Propagation on over 600 tweets, and found that Affinity Propagation[46] had better performance. However,its complexity associated with the number of documents was quadratic, thusAffinity Propagation is not suitable for larger datasets. Other approachesfocused on variations of classic clustering techniques considering several tweetcomponents such as texts, hashtags, users, and temporal aspect (e.g., streamclustering) [129,86,75].

Short text clustering represents a big challenge due to the datasparsity, since most words co-occur once or twice in the dataset [10]. Several sparseness-resistant methodswere proposed to face this challenge such as text augmentation [19,156],topic modeling [147,30], neural networks [145,52], andDirichlet Mixture Model [149,150,75]. Data augmentation methods seek to enrich data representationwith external resources, such as Wikipedia [19,146]; or similar words byexploiting related text documents [69,133,35,156]; orthe incorporation of semantic features from ontologies, terminologies, anddictionaries, such as WordNet, DBpedia, Freebase [59,45,26,141].

Moreover, recent approaches based on low-dimension representations withneural network [145] proved to beeffective to tackle the sparsity problem in short text clustering [135,38,47,52], for instance using word embeddings [96,110], sentences embeddings [77,72] and documentembeddings [34]. Also, several studiesexplored sophisticated models for short text clustering. For instance, a workproposed a Dirichlet Multinomial Mixture model-based approach for short textclustering (GSDMM) [149] which alsoinfers the number of clusters and obtained the best performance when comparedwith clustering algorithms such as k-means [92], Hierarchical Agglomerative Clustering (HAC)[94], and DMAFP [60].

2.2. Topic modeling

Topic Modeling is also an unsupervised machine learning method mainlybased on statistical properties of the data to discover “topics”that describe the collection of documents. Topic modeling methods seek toextract topics from a set of documents based on statistical techniques. Eachtopic is defined as a distribution over a set of words. Diverse topic modelingmethods have been proposed to analyze texts from different fields includingpolitics, medicine, and psychology. Two of the most popular methods are thelatent dirichlet allocation (LDA) [22]and latent semantic indexing (LSI, a.k.a. LSA) [56]. A recent work has exhaustively listed LDA-based techniquesproposed from 2003 to 2016 applied to text from various domains, includingbiomedicine [67].

Some topic modeling algorithms were proposed to work on general healthand medical text [70,71]. Moreover, other proposed methods specificallyaimed to predict therapy outcomes from emails sent by patients under treatmentfor a social anxiety disorder [58];predict protein-protein [17], gene-drug[143] relations from biomedicalliterature; discover concepts in patients’ health records [16]; detect depression [124,137];recognize genuine suicide notes from notes written by healthy subjects [111]; classify patient issues from theirexperience and the result of using a particular drug [68]; improve the automating classification of patientportal messages through the use of semantic features and word context [132]; identify patterns of events frommedical reports of brain cancer patients [15]; treatment behaviors [62,63]; treatment activities[28]; determine patient mortality[49]; discover models of disease andphenotypes [112]; extract biologicalterminology [97]; discover biologicalprocesses [81]; among others. Topicmodeling methods have also been applied over health-related tweets to identifylatent health topics [107]. Hybridapproaches have also allowed to extract health trends in tweets by theintegration of visualization approaches with classical topic models [114,113]. Also, diverse studies addressed specific tasks such asgrouping opinions about HPV vaccines leveraging also community structure methods[134]; identify commonobesity-related themes through a combination of geographic information systemsand topic modeling methods [50]; identifythe associations of Zika-related topics, such as attitudes, knowledge, andbehaviors [44].

As clustering methods, topic modelling can also be used for clusteringby giving a probability distribution over a number of topics for each document.Indeed, clustering and topic modeling methods have been used for clusteringtasks and have been compared in different studies. For instance, the authors in[117] proposed three categories fortopic modeling based on: (1) DMM, (2) global word co-occurrence, and (3)self-aggregation. Then they compared nine different topic modeling techniquesfrom these categories: (1) GSDMM [149],LF-DMM [102], GPU-DMM [80], GPU-PDMM [79], (2) BTM [30], WNTM[158], and (3) SATM [118], PTM [157]. They found that: (i) strategies that use wordembeddings (LF-DMM, GPU-DMM, and GPU- PDMM) are very promising in short texttopic modeling, (ii) the highest computation costs were obtained with LF-DMM andGPU-PDMM (i.e., they are not suitable for large datasets), and (iii) simplermethods (GSDMM and BTM) are the most suitable with respect to effectiveness andefficiency. For the clustering task, GSDMM achieved the best results. Anotherrelated work [116] described an approachwith Gibbs sampling called PYPM. This model was tested on four short textdatasets and compared with five well-known techniques (Non-negative MatrixFactorization [78], LDA, DMAFP, GSDMM,and FGSDMM [151]). Results showed thatPYPM had the best results followed by GSDMM.

Moreover, another recent study [33] evaluated the performance of four clustering algorithms(k-means, k-medoids, HierarchicalAgglomerative Clustering, and Non-negative Matrix Factorization) and a topicmodeling method (LDA) on short texts from social networks such as Twitter andReddit. The paper also evaluated four different feature representationsincluding TF-IDF, Word2Vec, Word2Vec weighted with the top 1,000 TF-IDF scores,and Doc2Vec. The experiments showed that the best performance was achieved byk-means with Doc2Vec on both datasets.

Of note, topic modeling methods can be evaluated from several aspectssuch as from cluster evaluation, topic coherence, and classification evaluation.To compare clustering and topic modeling methods, we need to apply clustervalidity indices. For this purpose, after using topic modeling to compute topicprobabilities, the maximum topic probability of each document is selected to getthe cluster label of each document [24,88,144,147,116,117]. Then, cluster validity indices are applied to evaluate theirperformances.

2.3. Validity indices

Cluster validity indices are metrics to validate clustering results andto find natural structures for a given dataset [152,14]. In other words,validity indices seek to find optimal partitions that are well compacted andwell-separated from other partitions [54]. There are two kinds of cluster evaluation metrics which are calledexternal and internal validation indices [53]. External indices measure the quality based on ground-truthlabels, for instance Rand [119],Adjusted Rand Index [64],Fowlkes–Mallows, Variation of Information [11]. Internal indices evaluate the result oninformation intrinsic to the data alone. The latter is useful when there is noannotated dataset available and the usual approach focus on evaluate thecompactness and separation of clusters, such as Dunn [37], Calinski–Harabasz [25]. Several studies on cluster validity indicesconcluded that there is not a single best metric [42,95,14]. Also, they found that the performanceof cluster validity indices decrease considerably when there is noise orclusters overlap.

Recent works have compared topic modeling and clustering methods onshort text clustering. Most of them used annotated datasets for the experiments,therefore, they mainly used external indices to measure the quality of clusters,such as hom*ogeneity (H) [122],Completeness (C) [122], V-Measure (V)[122], Adjusted Rand Index (ARI)[64], Normalized Mutual Information(NMI) [29], Adjusted Mutual Information(AMI), Accuracy (ACC), F-measure, Entropy, Purity. For instance, the authorsthat introduced GSDMM [149] used fiveexternal indices such as H, C, ARI, NMI, and AMI to evaluate their model.Another study proposed an online semantic-enhanced Dirichlet model for shorttext clustering (OSDM) [75] andconsidered several indices such as NMI, V, H, and ACC to validate their results.In [52] the authors reported theevaluation of various text representations and self-training methods with ACCand NMI. A recent study [117] comparednine topic modeling techniques in clustering tasks using two external metrics:NMI and Purity. Moreover, a recent work evaluated one topic modeling and foursclassic clustering methods [33] usingthree external indices such as NMI, AMI, ARI.

On the other hand, there are also several commonly used internal indiceswhen assessing the goodness of short text clustering such as Calinski-Harabasz(CH) [25], Silhouette Coefficient (SC)[123], Dunn [37], Duda [43], Elbow [136], among others.The internal indices can also be used to determine the optimal number ofclusters in the data [98]. Internalindices in comparison to external ones usually detect improvements in theclustering distribution which have positive implications in the systemevaluation [66]. In our previous work[87], we evaluated topic modeling andclustering methods using only tweets and two internal indices (CH and SC).However, most studies previously cited that compared topic modeling andclustering methods did not use internal validity indices to evaluate theirresults.

Therefore, in this paper, we use seven validity indices: five external(NMI, ARI, H, C, and V) and two internal (CH and SC). We selected the five mostcommon external indices used in the literature that are also independent of theabsolute values of the labels in comparison to F-measure, ACC, among others.Moreover, we included two internal indices: CH index which has demonstrated inseveral works to be effective [12], andSC which is one of the most well-known measures and provides graphicalrepresentations of how well each element has been classified. We used theimplementation of these metrics in sklearn [109] in the experimental study.

3. Methods

This section describes our study for evaluating state-of-the-art topicmodeling and clustering methods to automatically extract relevant topics fromhealth-related tweets and emails. Figure 1outlines our approach with the basic steps for this evaluation. In this section wedescribe: (1) the two datasets used: tweets and emails; (2) the applications basedon topic modeling and clustering algorithms; and (3) the validity indices used toassess the clusters defined by the algorithms we studied.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (2)

Workow of our approach to compare state-of-the-art topic modeling andclustering methods over health-related tweets and emails.

3.1. Datasets

3.1.1. Tweets dataset

Our tweets dataset is an unbalanced collection composed of twosubsets: human papillomavirus (HPV) and lynch syndrome. The HPV represents94.6% of all tweets while lynch syndrome represents 5.4% of all tweets. Theextraction strategy considered keywords and hashtags1 containing common generic HPV and lynchsyndrome names and colloquial terms. Table1 shows a description of our tweets collection. This datasetcontains a total of 286,971 tweets containing at most 140 characters. Table 2 shows a sample of eight tweetsrelated to HPV and lynch syndrome extracted from our tweets dataset. Theaverage of number of characters per tweet is 60.6 and the average of numberof tokens is 5.5. We applied several rules to preprocess the tweetscollection: 1) text was changed to lowercase; 2) suppression of duplicatedtweets; 3) suppression of stop-words; and 4) omission of links from thetweets.

Table 1

Details of our health-related tweets dataset.

SubsetHPVLynch syndrome
No. of tweets271,53315,438
No. of users99,2274,492
Collection periodJan 2014 - Mar 2016Oct 2016 - Nov 2017
No. of unique hashtags14,8751,649
No. of tweets with hashtag115,85910,224
No. of tokens before preprocessing1,767,920147,144
No. of tokens after preprocessing1,042,06396,437
Tokens per tweet after processing (mean± SD)9.55 ± 3.858.86 ± 3.01

Table 2

Examples of tweets with HPV and lynch syndrome as contents in ourtweets dataset.

DatasetExample of tweets
HPVContinue to have incapacitating symptoms andremain unable to attend school or work
HPVThis is Cervical Health Awareness Month! Goodnews- HPV, the main cause of cervical cancer, isvaccine-preventable
HPVYou learn something new every day! Did youknow that the cells of the cervix change with age in women?Immunity...
HPVEarlier HPV vaccination may be morebeneficial
LynchResearch: Pain evaluation duringgynaecological surveillance in women with Lynch syndrome
Lynch#LynchSyndrome is an inherited condition whichcan predispose women to an increased risk of #endometrial #cancer#Lynchsyndr…
LynchExact happened to my hubs– all symptomskidney stone, no visual– get cultured urine test, test 4 LynchSyndro…
LynchFDA apprvd Keytruda #immunotherapy- 1st drugto treat cancer based on #tumorgenetics
Annotation of tweets:

We only annotated tweets that contained hashtags, thus, a totalof 126,083 tweets were labeled (115,859 HPV and 10,224 lynch syndrome).The annotation was semi-automatically performed. First, we selected themost frequent hashtags. We then manually selected the most relevanthashtags and grouped them by their semantic similarity, for instance“#vaccine”, “#vaccines”, and“#vax” represented a single group. We then manuallyselected the top 50 more frequent hashtag groups. Table 3 shows a sample of the top 10 groupsof hashtags. Finally, we created a script to automatically annotate thetweets with the groups (labels) previously created. Of note, for ourevaluation, the dataset was annotated into a different number of topics:first, the dataset was annotated into 2 topics, then 3, until 50topics.

Table 3

Top 10 groups of the most frequent hashtags from our tweetsdataset.

TopGroup of similar hashtags
1#hpv, #hpvrelated, #hpvfacts,#humanpapillomavirus, #hpvassociated, #knowhpv
2#lynch, #lynchsyndrome,#lynchsyndromeawareness, #lynchs
3#vaccine, #vaccines, #vax, #vaxxed,#vaccination, #vaccinations
4#cancer, #cancers
5#cervical, #cervicalhealth, #cervicalcancer,#cervicalcancerawareness, #cervicalhealthawareness
6#gardasil, #gardasil9, #gardasilvaccine
7#health, #healthcare, #salud, #publichealth,#healthy
8#learntherisk
9#study
10#cdc, #cdcwhistleblower

3.1.2. Emails dataset

We accessed the 50,000 available emails sent by patients withprostate cancer to their health care providers. The emails are apart of aclinical data warehouse at a tertiary academic care center from 2010 to 2019[127]. Table 4 shows a description of the emailscollection. We processed the emails to preserve the confidentiality,integrity, and availability of protected health information (PHI). Thus, wearbitrarily formatted text emails into uniformly formatted emails such that:1) all text was lowercase; 2) generic tokens for dates, days, time, emailaddress, and URLs replaced specific occurrences; 3) named entities such aspeople, organizations, and locations were replaced with generic tokens.

Table 4

Details of our emails dataset.

No. of emails50,000
No. of patients4,535
Collection periodJan 2010 - Jan 2019
No. of unique tokens before processing74,988
No. of unique tokens afterpreprocessing42,868
Tokens per email after preprocessing (mean± SD)31.78 ± 32.72
Annotation of emails:

We annotated the vast unstructured, free-text emails by labelingeach document with the topics with a similar method as we annotated thetweets. After thoroughly reading several hundred emails, we defined 2,then 3, up to 50 topics by grouping tokens with significant semanticinformation together (e.g. germ, infection) and thenlabeled the emails based on the most frequent token(s) corresponding tothe topics that we defined. Table5 lists 10 of the most frequently occurring topics. Weremoved 13 emails from the analysis because they did not contain enoughmeaningful tokens after processing to assign them one of our externallabels.

Table 5

Top 10 groups of similar words of the most frequent topics inemails.

TopGroups of similar words
1psa, test, results, scan, hba1c
2surgery, prostate, urology
3blood, pain, rash, nausea, symptoms, uti
4prescription, mg, dose, ml, injection
5obesity, nutritionist, weight
6cancer, oncology, mass, masses
7appointment, date, times
8thank, thanks, dear, hello, hi, sincerely
9germ, sick, infection
10doctors, doctor, nurse

3.2. Applications

We used several available online implementation of topicmodeling2 and clusteringmethods. To cover every aspect, we briefly describe all used systems along withour experiments.

3.2.1. Topic modeling

We set up seven well-known available methods used for shorttexts.

  • Latent Semantic Indexing (LSI): a well-knowninformation retrieval algorithm [41, 6]. LSI hasbeen applied to a wide variety of learning tasks, such as search andretrieval, classification and filtering. LSI uses singular valuedecomposition of vector space spanned by the documents to describelatent semantics within the collection of documents.

  • Latent Dirichlet Allocation (LDA): is agenerative probabilistic model seeking to describe a set ofobservations as a mixture of distinct categories [22,5]. An observation is a document which represents amixture of topics. Each topic is represented as a mixture ofdistributions of words. With these distributions, one can computethe probability that a document part of topic based the words usedin that document. LDA uses a Bayesian [22] approach to learn the distributions:set of topics, word probabilities, etc.

  • LDA with Gibbs Sampling (GibbsLDA): GibbsSampling is another technique for parameter estimation and inferenceof the distributions defined in the LDA model [142,3]. It was designed to analyze hidden and latent topicstructures from large-scale datasets including large collections ofdocuments from the Web. The LDA method with Gibbs Sampling has beenshown to be comparable to k-means in terms ofcomputational costs and execution time.

  • Online LDA: the online variational Bayesalgorithm for LDA (Online LDA) is based on online stochasticoptimization [57,7], which has been shown to findgood parameter estimates much faster than batch algorithms on largedatasets. Online LDA analyzes a massive number of documents withoutstoring the documents in a dataset; each can arrive in a stream andthen be discarded after one look.

  • Biterm (BTM): topics are learnt from shorttexts by directly modeling the generation of word co-occurrencepatterns (i.e., biterms) in the text corpus [147,1]. Each latent topic is presented as a significanceprobability as well as a probability distribution over a vocabulary.The experimental results showed that BTM produces discriminativetopic representations as well as more coherent topics for shorttexts.

  • Online Twitter LDA: tracks emerging events inmicroblogs [76,8]. In contrast to other LDA algorithms,this method employs a built-in update mechanism that uses timeslices and creates a dynamic vocabulary. In every update, words thatdo not reach a frequency threshold are removed and a new word isadded when it reaches the threshold. This allows the model to studythe topic evolution and detect emerging topics over time. The inputis discretized time slices and documents partitioned into theseslices. Thus, the model is able to process the input and update themodel periodically; generate comparable topics throughout differenttime slices that allows topic shift evolution measurement; ensuresensitivity to the changes of topic over time. There are two maindifferences with Online LDA. First, the convergence is handled bythe introduction of a new parameter called contribution factor,which reduces the influence from the previous model. Second, whileOnline LDA assumes a fixed vocabulary, Online Twitter LDA considersthat it is not possible to calculate the vocabulary ahead of timeand, thus, constructs a dynamic vocabulary.

  • GSDMM: is a collapsed Gibbs Sampling algorithmfor the Dirichlet Multinomial Mixture model (DMM) applied to shorttext [149,101]. DMM is a probabilistic generativemodel for documents, and embodies two assumptions about thegenerative process [103]:(1) the documents are generated by a mixture model, and (2) there isa one-to-one correspondence between mixture components and clusters.Thus, GSDMM assumes each document belongs to a single topic, whichis a suitable assumption for some short texts. Given an initialnumber of topics, this algorithm groups documents and extracts topicstructures that are present in the datatset. If the number of topicsis set to a high value, then the model will be able to automaticallylearn the number of topics.

3.2.2. Clustering

In this study, we use one of the most well-known algorithms,k-means (with k-means++initialization) [91,4], with two different dataset representations:TF-IDF and Doc2Vec.

  • TF-IDF: [125,9] termfrequency-inverse document frequency, is a statistic measureintended to reflect how important a word is for a document in acollection of documents.

  • Doc2Vec: is a simple extension ofword2vec to include the learned embedding ofword sequences rather than words alone [77,2]. It has been shown to out perform similar embeddingtechniques in terms of accuracy and computational cost.

3.3. Analysis of applications

Note that to evaluate the performance of clustering and topic modelingmethods we need to apply cluster validity indices. Thus, after executing topicmodeling to compute topic probabilities, the highest probability of eachdocument is selected to get the cluster label of each document [117]. Then internal and external validity indices areapplied to assess their performances. Therefore, we performed an evaluation ofall topic modeling and clustering applications in terms of: 1) experiments andresults over the tweets dataset, a complete calculation of two internal and fiveexternal indices, and 2) experiments and results over the emails dataset, also acomparison of the internal and external indices. In the paragraph below, weexplain the configuration for our evaluation, as well as the seven indicesused.

3.3.1. Configuration

For topic modeling, “k” (i.e., numberof topics) will range from 2 to 50. In our work, topic modeling results areused to classify tweets and emails to a particular topic. Each tweet/emailis represented by a feature vector, where each component of the vector isthe probability of the tweet/email to belong to a given topic. For instance,k=2 implies the size of the feature vector is 2 whilefor k=50 is 50. We then use the argmaxfunction to determine the most prominent topic of each tweet.

The clustering algorithm, k-means,uses two document representations: TF-IDF and Doc2Vec. We tested varioussizes of feature vectors (bag-of-words): 100, 200, 500, and 1,000 mostfrequent words after preprocessing (stop-words removal, deletion ofpunctuation, and correction of misspelled words) and considering only nounwords. We determined that results were very similar using the fourvariations. Therefore, in this paper, we consider the 100 most frequentswords as the number of features for TF-IDF and Doc2Vec, with a“k” (i.e., number of clusters) alsoranging from 2 to 50.

Parameters of topic modeling are set as suggested inprevious studies to get the optimal performance on short texts. For LDA, thehyper-parameters are set to α = 0.05 andβ = 0.01 as suggested in [147]. For GibbsLDA [147], α = 0.05 andβ = 0.01. For BTM [147], the parameters settings areα = 50/k andβ = 0.01, where k representsthe number of topics. Online LDA [57]uses α = 1.0/k,β = 0.01, and θ = 1.Online Twitter LDA [76] setsα = 0.001, β = 0.01, andc (contribution factor) = 0.5. GSDMM [149] is set with α = 0.1and β = 0.1. Note that there is no distinction inthe use of “k” when discussing the number ofclusters and topics.

We evaluated all topic modeling and clustering algorithms using 100,500, and 1,000 iterations. The initial number of iterations is recommendedin [55] and is a default value in theapplications.

3.3.2. Internal indices

We used two internal measures: Calinski-Harabasz index (CH) [25] and Silhouette Coefficient (SC)[123]. CH index has demonstratedin several works to be an effective measure for determining the mostappropriate number of clusters [12].On the other hand, SC is one of the most well-known measures and providesgraphical representations of how well each element has been classified.Next, we explain the principles of the internal indices.

  • Calinski-Harabasz: also known as the VarianceRatio Criterion. A higher CH value indicates themodel has well defined clusters. TheCHk value isgiven by the ratio between average inter-cluster dispersion matrix(Bk) andintra-cluster dispersion matrix(Wk) as definedin Formula 1.

CHk=BkWk×nkk1

(1)

where n is the totalnumber of points and k the number of clusters. TheBk value is based on thedistance between clusters and is defined as:

Bk=iknidist2(cic)

where ni is thenumber of elements of clusterCi,ci is the center ofCi, and cis the center of the complete dataset.Wk is based on thedistance within clusters and is defined as:

Wk=i=1kxCidist2(ci,x)

where x is a point of clusterCi. Note that to obtainwell separated and compact clusters,Bk is maximized andWk minimized. Therefore,the maximum value of CH indicates a suitable partition forthe dataset.

  • Silhouette Coefficient: describes theseparation distance between clusters. A width is computed for eachpoint, which depends on its membership cluster. The widths are thenaveraged over all observations for each k. TheSC value has a range of [−1,1], where−1 represents poor clustering quality or poorly definedclusters and 1 high clustering quality or well-defined clusters. TheSCk value for asingle sample is defined in Formula 2.

SCk=1n×inbiaimax(ai,bi)

(2)

where n represents thetotal number of elements in a cluster,ai is the averagedistance between an element i of the cluster and all otherelements within the same cluster,bi represents the averagedistance between the element i of the cluster and all otherelements in the nearest cluster.

In summary, higher clustering quality of a particular algorithmtends to yield higher predictive performance on information retrieval tasks.For this reason, we seek to identify the algorithms that maximize theoverall clustering quality (i.e., internal indices).

3.3.3. External indices

Note that external evaluation measures can be applied when classlabels for each data point in some evaluation set can be determined apriori. We used five well-known external measures for evaluation over theannotated dataset: Normalized Mutual Information (NMI) [131], Adjusted Rand Index (ARI) [64], V-measure (V) [122], hom*ogeneity (H) index [122], and Completeness (C) score [122].

  • Normalized Mutual Information: is anormalization of the Mutual Information score, which scales theresults between 0 (no mutual information) and 1 (perfectcorrelation) as defined in Formula 3.

NMI(Y,C)=2×I(YC)[H(Y)+H(C)]

(3)

where Y represents the class values, Cthe cluster labels, H the entropy, and I(Y,C) is the Mutual Informationbetween Y and C, and is defined as:

I(Y,C)=H(Y)H(YC)

where H(Y) is the entropy of class labels, and H(Y|C) is theentropy of class labels within each cluster.

  • Adjusted Rand Index: the rand index (RI)computes a similarity measure between two clusterings by consideringall pairs of samples and counting pairs that are assigned in thesame or different clusters in the predicted and true clusterings. RIgives a value between 0 and 1, where 1 indicates that the data inclusterings are the same. This measure can be seen as the percentageof correct decisions made by the algorithm, it can be expressedas:

RI=TP+TNTP+TN+FP+FN

where TP is the number of truepositives, TN is the number of true negatives, FP is the number of falsepositives, and FN is the number of false negatives. The Adjusted Rand Indexrescales the RI, considering that random chances will cause some objects tooccupy the same clusters. The ARI is calculated using the Formula 4.

ARI=(RIExpectedRI)(max(RI)Expected_RI)

(4)

  • V-measure: is an entropy-based measure whichexplicitly measures how successfully the criteria of hom*ogeneity andcompleteness have been satisfied. V-measure is computed as theharmonic mean of distinct hom*ogeneity and completeness scores,

  • hom*ogeneity: a cluster has perfect hom*ogeneityif all members of that cluster have the same external label. Thatis, the class distribution within each cluster contains only oneclass or equivalently has zero entropy. We determine how close agiven clustering is to this ideal by examining the conditionalentropy of the class distribution given the proposed clustering.

  • Completeness: : is similar to hom*ogeneity as italso describes how well the elements of the same external class areassigned to a single cluster. To evaluate completeness, we examinedthe distribution of cluster assignments within each class.Completeness is formally defined as the conditional entropy of theproposed cluster distribution given the external class label.

4. Experiments and results

The focus of this study is to compare the performance of the applicationscited below using internal and external indices over the tweets and emails datasets.Thus, next sections presents the results obtained fork={2,5,10,50}.

4.1. Results on the tweets dataset

4.1.1. Internal indices

We perform experiments using CH and SC to measure the performance ofthe topic modeling and clustering algorithms. Tables 6, ,7,7, ,88 and and99 show the CH and SC results for the seven topicmodeling methods and k-means algorithm with Doc2Vec andTF-IDF for 2, 5, 10, and 50 number of clusters/topics(“k”) respectively. Of note, CH and SCseek to evaluate the clusters/topics based on two aspects: the similarity oftweets within the same cluster (cohesion), and the difference between thetweets of different clusters.

Table 6

Internal index results for k=2.

100 iterations500 iterations1,000 iterations
CHSCCHSCCHSC
LSI86,2600.4186,2600.4086,2600.40
BTM634,4120.74661,2420.75604,6290.72
LDA515,7370.68486,5220.68521,1980.69
GibbsLDA10,767,0600.9710,514,7300.989,932,7220.97
Online LDA849,0680.77834,3510.76938,4280.78
Online Twitter LDA50,110,5000.9953,291,2600.9950,730,2600.99
GSDMM16,232,5400.9718,030,8100.9717,961,2800.97
k-means+Doc2Vec31,1960.2031,1960.2031,1960.20
k-means+TF-IDF5,7640.045,7640.045,7640.04

Table 7

Internal index results for k=5.

100 iterations500 iterations1,000 iterations
CHSCCHSCCHSC
LSI50,6410.4051,9610.4051,9610.40
BTM165,5150.60171,0410.60175,9370.61
LDA69,5260.3771,2400.3871,7410.38
GibbsLDA967,6830.911,016,6400.931,010,7730.92
Online LDA173,2550.61170,6590.60185,0050.62
Online Twitter LDA6,554,3390.9810,117,2700.9810,989,5300.99
GSDMM2,208,7310.912,660,2680.922,302,0450.91
k-means+Doc2Vec13,9980.0413,9980.0413,9980.04
k-means+TF-IDF4,7220.074,7220.074,7220.07

Table 8

Internal index results for k=10.

100 iterations500 iterations1,000 iterations
CHSCCHSCCHSC
LSI25,3460.3425,5390.3425,5320.35
BTM55,1100.4161,7900.4367,8560.53
LDA24,4980.2724,1020.2724,8600.27
GibbsLDA239,4570.83266,0650.85272,0350.86
Online LDA70,8240.5469,4180.5569,8920.55
Online Twitter LDA1,400,0450.961,925,9030.972,035,5470.97
GSDMM878,2940.88957,8560.89955,8490.89
k-means+Doc2Vec7,6170.027,6170.027,6170.02
k-means+TF-IDF3,7580.073,7580.073,7580.07

Table 9

Internal index results for k=50.

100 iterations500 iterations1,000 iterations
CHSCCHSCCHSC
LSI3,8940.213,9250.223,9070.19
BTM9,5010.3510,0890.3710,5070.37
LDA3,0060.142,8010.122,9600.14
GibbsLDA18,1880.6521,2560.6822,0140.69
Online LDA9,8350.4410,3220.4510,2990.46
Online Twitter LDA39,0140.7962,7490.8566,0510.87
GSDMM162,2800.88173,9790.88169,1090.89
k-means+Doc2Vec2,028−0.022,028−0.022,028−0.02
k-means+TF-IDF2,2000.172,2000.172,2000.17

In addition, Figure 2 shows ageneral overview of the performance of the applications based on SC and CHfor 100, 500, and 1,000 iterations; and for“k” ranging from 2 to 50. In all cases, thebest values are obtained by Online Twitter LDA followed by GSDMM.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (3)

Silhouette Coefficient and Calinski-Harabasz metrics with 100, 500, and1,000 iterations, for “k” ranging from 2 to50.

4.1.2. External indices

We perform experiments using the five external indices (NMI, ARI, V,H, and C) to measure the performance of the topic modeling and clusteringalgorithms. Tables 10, ,11,11, ,1212 and and1313 depict theNMI, ARI, V, H, and C values obtained with all topic modeling and clusteringmethods for 2, 5, 10, and 50 number of clusters/topics(“k”) respectively. Of note, externalindices measure the extent to which cluster labels match externally suppliedclass labels.

Table 10

External index results for k=2.

100 iterations500 iterations1,000 iterations
NMIARIVHCNMIARIVHCNMIARIVHC
LSI0.05−0.020.090.160.070.05−0.020.090.160.070.05−0.020.090.1590.07
LDA0.000.020.010.010.010.050.070.090.160.070.000.010.000.000.00
GibbsLDA0.020.010.030.050.020.020.010.040.080.030.02−0.000.040.070.03
Online LDA0.030.060.070.120.050.000.000.000.000.000.060.090.120.210.09
BTM0.110.230.240.380.170.110.230.230.370.170.090.140.180.300.13
Online Twitter LDA0.000.000.000.000.000.000.000.000.000.000.000.000.000.000.00
GSDMM0.170.130.170.120.280.160.120.160.110.260.05−0.060.050.040.08
k-means + Doc2Vec0.00−0.010.000.000.000.00−0.010.000.000.000.00−0.010.000.000.00
k-means + TF-IDF0.04−0.040.090.140.060.04−0.040.090.140.060.04−0.040.090.140.06

Table 11

External index results for k=5.

100 iterations500 iterations1,000 iterations
NMIARIVHCNMIARIVHCNMIARIVHC
LSI0.270.090.190.210.170.280.090.190.220.180.280.090.190.220.18
LDA0.100.030.070.080.070.130.040.090.090.080.130.030.090.090.08
GibbsLDA0.090.030.070.080.060.130.050.090.100.090.140.050.090.110.09
Online LDA0.090.030.070.070.060.080.020.050.060.050.070.010.050.060.05
BTM0.350.150.250.270.250.370.150.270.280.260.350.140.260.270.25
Online Twitter LDA0.000.000.000.000.000.010.000.000.000.000.010.000.010.010.01
GSDMM0.230.110.230.210.250.240.130.240.220.260.240.140.240.230.25
k-means + Doc2Vec0.01−0.010.010.010.010.01−0.010.010.010.010.01−0.010.010.010.01
k-means + TF-IDF0.470.290.360.360.350.470.290.360.360.350.470.290.360.360.35

Table 12

External index results for k=10.

100 iterations500 iterations1,000 iterations
NMIARIVHCNMIARIVHCNMIARIVHC
LSI0.590.170.310.370.270.590.170.320.370.280.580.170.310.360.27
LDA0.160.020.080.100.070.160.020.080.100.070.190.030.090.120.08
GibbsLDA0.140.020.070.090.060.190.040.100.120.080.210.040.100.130.09
Online LDA0.160.030.080.100.070.160.020.080.100.070.120.020.060.080.05
BTM0.400.080.210.250.180.420.090.220.260.190.440.110.240.280.21
Online Twitter LDA0.010.000.000.010.000.030.000.010.020.010.040.000.020.020.01
GSDMM0.240.120.240.210.270.220.100.220.190.260.220.120.220.200.25
k-means + Doc2Vec0.040.000.020.020.010.040.000.020.020.010.040.000.020.020.01
k-means + TF-IDF0.530.180.290.330.260.530.180.290.330.260.530.180.290.330.26

Table 13

External index results for k=50.

100 iterations500 iterations1,000 iterations
NMIARIVHCNMIARIVHCNMIARIVHC
LSI0.870.100.300.380.250.870.110.310.380.260.840.090.290.370.24
LDA0.400.010.130.170.100.470.010.150.200.120.440.010.150.190.12
GibbsLDA0.400.010.130.170.100.440.010.140.190.110.450.010.140.200.11
Online LDA0.360.010.110.160.090.360.010.110.160.090.340.000.110.150.09
BTM0.620.030.210.270.170.640.020.220.280.180.630.020.210.280.17
Online Twitter LDA0.190.000.060.080.040.280.000.090.120.070.340.000.110.150.08
GSDMM0.220.040.220.180.280.220.040.220.190.280.230.040.230.190.30
k-means + Doc2Vec0.150.000.050.070.040.150.000.050.070.040.150.000.050.070.04
k-means + TF-IDF0.690.040.230.300.180.690.040.230.300.180.690.040.230.300.18

We also plotted the values obtained with NMI, ARI, V, H, and C with100 iterations only as shown in Figure3. In general, the best values are obtained by LSI followed byk-means with TF-IDF. Also, note that Online Twitter LDAsignificantly decreased its performance in comparison to the values obtainedin the internal indices evaluation. It obtained the lowest performance,while other algorithms such as LSI, BTM, and techniques such as TF-IDFimproved and in general, they are the three methods with the bestperformance.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (4)

Normalized Mutual Information, Adjusted Rand Index, V-measure,hom*ogeneity, and Completeness metrics with 100 iterations for“k” ranging from 2 to 50 over the tweetsdataset.

4.2. Results on the emails dataset

4.2.1. Internal indices

We trained each model for integer values of kranging from 2 to 50, for 100 iterations and measured each model’sperformance with SC and CH scores. Online Twitter LDA was initially the bestperforming model both in terms of CH and SC scores of 8.2 million and 0.94,respectively, but in contrast to the results of the previous experiments, wesaw a greater decrease in the model’s performance, relative to LDAand Online LDA, as k increased. The GSDMM model had themost stable, high level performance, with a SC that never dropped below0.86, well above the next best performing model, LDA, which had a SC scoreof 0.65 (for k = 50). LDA and Online LDA achieved thesecond and third best SC scores once k exceeded 30. LSImodel had the worst performance, achieving negative SC values for almost allvalues of k.

The CH scores rapidly and substantially decreased with increasingvalues of k for all models. The rates of decreasesperformance were not uniform as the Online Twitter LDA stated out well abovethe rest at 8.2 million for k = 2 and dropped to 5,719 fork = 50 behind the LDA model, which had a CH score of8,062. The GSDMM model became the best performing model oncek exceeded 12 and was the only model for which all CHscores were greater than 10,000.

Table 14 lists the values ofthe SC and CH scores for the models with 2, 5, 10, and 50 clusters/topicswhile Figure 4 illustrates the SC andCH values for all values of k.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (5)

Silhouette Coefficient and Calinski-Harabasz metrics over the emailsdataset with 100 iterations, for “k” ranging from2 to 50.

Table 14

Internal index results after 100 iterations.

k = 2k = 5k = 10k = 50
CHSCCHSCCHSCCHSC
LSI1,3160.06529−0.19696−0.22517−0.24
BTM121,8140.6219,8840.339,2000.241,1730.11
LDA237,6840.74118,9550.7554,8480.758,0620.65
GibbsLDA590.041,128−0.02706−0.02103−0.07
Online LDA222,2960.76128,5840.7847,6530.707,9740.64
Online Twitter LDA8,197,5410.99699,2310.95136,0770.885,7190.56
GSDMM1,576,4410.94342,1090.90122,9520.8727,8890.86
k-means+Doc2Vec130.48130.52210.32770.09
k-means+TF-IDF5560.014010.01260−0.01880.01

4.2.2. External indices

We also analyzed the same email clustering/topic models for externalvalidity indices. Table 15 displaysthe NMI, ARI, V, H, and C for models with 2, 5, 10, and 50 clusters/topics.It is worth noting that all metrics of external validity range between 0 and1. Figure 5 illustrates our findingsfor all values of k between 2 and 50. Notably, we see thatthe k-means with TF-IDF embedding and LSI the are bestperforming algorithms across all metrics while Online Twitter LDA andk-means with Doc2Vec embedding showed the worstperformance consistently across all 5 metrics of external validity. Overall,we see that all models improve along measures of external validity withincreasing values of k. This trend is rather gradualstarting at values 0.01–0.05 and increasing by one or two hundredths,with each value of k, reaching values between 0.04 and0.08. Two exceptions to this trend are the best models: LSI andk-means with TF-IDF, which start at 0.01–0.02for k = 2, quickly increase to vales 0.12–0.20 fork = 10, and then gradually increase to reach values0.18–0.30. Another two exceptions are the worst models: OnlineTwitter LDA and k-means with Doc2Vec, which start at valesless than or equal to 0.01 and never grow larger than 0.02 for any value ofk.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (6)

Normalized Mutual Information, Adjusted Rand Index, V-measure,hom*ogeneity, and Completeness indices with 100 iterations for“k” ranging from 2 to 50 over the emaildataset.

Table 15

External index results after 100 iterations.

k =2k =5
NMIARIVHCNMIARIVHC
LSI0.020.010.010.020.010.090.100.160.090.12
LDA0.020.030.020.020.020.030.020.030.030.03
GibbsLDA0.050.080.050.050.050.020.020.020.030.02
Online LDA0.030.050.030.030.030.020.010.020.020.02
BTM0.040.080.050.040.050.040.030.050.040.04
Online Twitter LDA0.010.020.010.010.01<0.01<0.01<0.01<0.01<0.01
GSDMM0.030.060.030.030.030.020.020.020.020.02
k-means + Doc2Vec0.010.010.010.010.01<0.01<0.01<0.01<0.01<0.01
k-means + TF-IDF<0.01<0.01<0.01<0.01<0.010.080.070.140.080.1
k =10k =50
NMIARIVHCNMIARIVHC
LSI0.180.190.240.180.200.210.210.300.210.25
LDA0.060.040.060.060.060.060.020.070.070.07
GibbsLDA0.050.020.050.050.050.080.020.080.090.08
Online LDA0.030.010.030.030.030.030.010.040.040.04
BTM0.090.030.100.090.090.130.040.140.140.14
Online Twitter LDA<0.01<0.01<0.01<0.01<0.010.02<0.010.020.020.02
GSDMM0.050.020.050.050.050.020.060.090.060.08
k-means + Doc2Vec0.01<0.010.010.010.010.01<0.010.020.020.02
k-means + TF-IDF0.130.120.180.130.150.210.080.220.210.22

5. Discussion

5.1. Tweets dataset

A popular area of study is supervised algorithms using unbalanceddatasets. However, skewed distributions also affect the learning process inunsupervised methods, especially in clustering [100] that are based on centroids [140,73]. Despite enormoussolutions, there is a reduced effectiveness when the groups have highlydifferent sizes [74], however, most ofthe models we used proved capable of creating a group of tweets bigger than theother that reflected the unbalanced nature of the tweets data set (94.6% and5.4% of tweets related to HPV and lynch syndrome, respectively).

Online Twitter LDA followed by GSDMM obtained the highest values of SCand CH, which indicates that those clusterings were more compact, more dense(within the cluster), and better separated than all other models. In general,topic models out performed clustering methods in terms of CH and SC, whichprovides insights into the interconnected nature of medical communication. Thisproblem is well suited for LDA proposed as improvements of LDA such as OnlineLDA.

For the evaluation of the external indices, a subset of tweets withhashtags was used. The external indices showed that LSI followed byk-means with TF-IDF obtained the best results. Note thatOnline Twitter LDA significantly decreased its performance in comparison to thevalues obtained in the internal indices evaluation. We did not find an obviousrelationship between the number of iterations and the performance each of thedifferent experimental configurations. In several cases, the performance isproportional to the number of iterations, although this is not a common patternfor all algorithms.

Several findings from the applications of topic modeling and clusteringmethods confirmed that many in society are using Twitter to share past andcurrent experiences of a disease (HPV and lynch syndrome in our case), symptoms,treatment information, side effects, emotions, research, among others, asdepicted in Table 2 and Figure 6. Figure6 shows the most important topics extracted from two clusters createdwith k-means and Table 2presents tweets extracted from these two clusters.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (7)

Most important topics extracted from two clusters created withk-means with TF-IDF.

5.2. Emails dataset

In contrast to the tweets, the emails in our dataset characterize asmaller and more hom*ogeneous domain of language. Each email was sent by apatient with prostate cancer (or their caregiver) to a health care provider.Careful consideration of the broader context of our modeling task can explainthe findings of our experiments using emails as well as shed light on theresults of our experiments with the tweets data.

The questions and concerns that arise as patients undergo treatment forprostate cancer, from scheduling procedures to managing a sudden crisis, arerarely discrete issues. This poses a substantial problem for clusteringalgorithms that search for perfectly separated clusters.k-means (with either embedding) is such an algorithm, whichhelps explain why it did not find internally meaningful clusters for any valueof k. The nature of the emails may also help explain why theLSI model, which searches for a fixed low dimensional representation, generatedinternally inconsistent labels. In contrast, the LDA approach models document asa mixture of topics and seems to naturally represent an email that is primarilyabout family member introducing himself as the patient’s new primary caregiver while also mentioning several previously unreported health issues, forexample. We see in our results that the three best performing models, withrespect to internal indices, were a variation of LDA.

Despite the dramatic difference model performance over tweets and emailswith respect to internal indices, we observed very similar patterns inperformance with respect to external indices. Notably, LSI andk-means with TF-IDF, performed very well despite havemostly negative or near zero SC scores, respectively. BTM was always among thetop 3 or 4 models while Online Twitter LDA had the best and worst measures ofinternal and external consistency, respectively in both experimentaldesigns.

Comparison to related studies:

Recent works compared topic modeling and clustering methods on shorttext clustering using the same external validity indices. In [117], GSDMM obtained the highestvalues, thus, one of the best suggested by external indices (NMI andPurity); while in our work GSDMM was suggested to be one of the best byinternal indices (SC and CH) and the best for tweets only by an externalindex (C). In [117], GSDMM was thebest on 3 out of 6 datasets. NMI values for GSDMM varied from 0.3 to 0.8.Also, for a given partition, in several cases, GSDMM obtained the highestresults in terms of NMI; while other external indices (e.g., Purity)obtained the highest results for another method.

In [33],k-means+Doc2Vec was suggested to be one of the best byexternal indices (NMI and ARI), while in our work it was suggested to be oneof the worst with also external indices (NMI, ARI, H, C, and V). In [33], k-means+Doc2Vecwas one of the best on the 3 datasets used. NMI and ARI values varied from0.03–0.69 and 0.03–0.71 respectively. In [33] k-means+TF-IDF achieved theworst results, while in our work it was one of the best when evaluatingexternal validity indices.

Both studies used small datasets (≤ 30Kdocs). Both used a fixed number of topics for comparisons, since eachdataset was already annotated. Both studies did not consider LSI (the bestby external indices) or Online Twitter LDA (the best by internalindices).

Note that several studies showed that there is not a unique metricto validate clustering results [42,95,14], and the performance of metrics notablylowers with noise or overlapping clusters. Also, internal indices incomparison to external ones usually detect improvements in the clusteringdistribution which have positive implications in the system evaluation[66]

5.3. Error analysis

We also conducted an analysis of the types of error patterns found onshort text clustering tasks. For this, we used the Hamming loss metric, which isa loss function, so the optimal value is zero (i.e., closer Hamming distance tothe external classes and better performance) and its upper bound is one. Hammingloss measures the fraction of wrong labels to the total number of labels.Hamming Loss is relevant for an unbalanced classification tasks and relevant formulti-label classification. Thus, we computed the Hamming loss on the tweets andemails datasets. Note that this function depends on the labels of each document(tweet/email), thus, the interpretation of the output is very similar to theexternal validity indices.

For instance, Figure 7 illustratesthe Hamming loss for emails for all models trained with 100 iterations. We foundremarkably stable values of loss for all models for values of k> 15. The sable values of loss are consistent with the otherexternal indices of validity: LSI and k-means with a TF-IDFrepresentation being the first and second best performing models, and thevariants of LDA along with GSDMM among the poorest performing models.

Evaluation of clustering and topic modeling methods overhealth-related tweets and emails (8)

Hamming loss for topic modeling and clustering methods trained for 100iterations and “k” ranging from 2 to 50 overemails dataset.

We then manually evaluated a group of tweets and emails to assess theassignment of clusters. There are several factors that caused errors whenassigning emails or tweets to their respective clusters: (1) most of the tweetsand emails contained misspellings, that received a part-of-speech category of“noun” or “unknown”, thus, considered for theclustering tasks; (2) tweets and emails contain terms created by patients suchas “onco” instead of “oncology” which also affectedthe groups of texts; (3) most of tweets contain hashtags composed of two or morewords, for instance “#hpvvaccine”; (4) lack of more context (e.g.,semantic information), indeed, n-gram terms (n≥ 2) as features provide more context for clustering than a single wordterms, for instance “hpv vaccination” and ”fluvaccination” rather than “vaccination”; (5) thesubjectivity to tell for a tweet/email to what cluster it belongs, the morenumber of cluster the more subjective become this task; among others.

Our study has known limitations. First, the annotation of both datasetswas semi-automatically performed which directly affected the values of theexternal indices compared to the internal indices. We selected the most frequenthashtags (tweets) and words (emails), we then manually selected the mostrelevant hashtags/words and grouped them by their semantic similarity, finally,we executed a script that automatically annotated the tweets/emails for eachnumber of clusters, (i.e., dataset labeled with two groups only whenk = 2, then, dataset labeled with three groups only whenk = 3, until k = 50). Second, for theexternal validation of the tweets dataset, we have only considered thosecontaining hashtags which potentially affected the different results betweeninternal and external indices. A possible solution could be to also considervisualization methods that can intuitively reflect the validity of clusteringresults.

6. Conclusions

In this paper, we conducted a detailed comparison of different topicmodeling techniques and a document clustering method on short texts from twohealth-related datasets. The first composed of tweets and the second of emails. Weset up LSI, LDA, GibbsLDA, Online LDA, BTM, Online Twitter LDA, GSDMM, andk-means based on TF-IDF and Doc2Vec document vectorizations. Weevaluated our models with two internal indices and five external indices. The twointernal indices included Calinski-Harabasz index and Silhouette Coefficient. OnlineTwitter LDA obtained the best results, which indicates it created more consistentclusters of topics for tweets and emails. The five external indices includedNormalized Mutual Information, Adjusted Rand Index, V-measure, hom*ogeneity, andCompleteness. These indices were evaluated using a ground truth dataset. Methodsbased on term and document frequencies such as LSI and k-means withTF-IDF obtained the best performance. Overall, this comparison provides encouragingresults towards the application of topic modeling and clustering over shorthealth-related texts from tweets and emails.

As a rapidly growing number of machine learning methods for natural languageprocessing are becoming easier to implement for experts and novices alike, our studyshowed us that thoughtful analysis of language models along several dimensions isessential to know if one has arrived at a significant result. We observed notablevariation in performance metrics attributable to sometimes subtle differences inmodel assumptions or computational methods alone. Moreover, we showed additionalvariation in performance when using data generated from a different process butwithin the same domain. For us, one or two cutoffs would not have given ussufficient information to evaluate model performance. Our work suggests researcherscan improve their analysis of model performance by using a variety of metrics.

We provide this benchmark over different datasets to help other researchersdetermine whether their topic modeling and clustering methods are well suited toinvestigate healthcare questions such as: what health topics are most oftendiscussed in tweets and email threads or what kinds of conversations are occurringbetween healthcare professionals and patients.

As future work, different other methods could be considered for evaluation,such as recent methods based on data augmentation and deep neural networks. Giventhe error patterns found in the clustering process, we also shall furtherinvestigate how to better leverage the selection of more informative features. Forexample, we shall include n-gram terms (n ≥2) and adjectives as features for the methods. For the emails dataset, it would beinteresting to consider the features of the people writing the emails: is thepatient, the patient’s family/caretaker, their age, how topics vary withcourse of disease. Finally, given the limitations associated with the datasetannotation, we shall annotate a subset of tweets and emails, compute theinter-annotator agreement, in order to manually assess the validity indices as wellas the error within the clustering process.

Acknowledgements

A portion of the research reported in this publication was supported by theNational Cancer Institute of the National Institutes of Health under Award NumberR01CA183962. The content is solely the responsibility of the authors and does notnecessarily represent the official views of the National Institutes of Health.

Abbreviations

ARIAdjusted Rand Index
BOWBag-of-Words
BTMBiterm
CCompleteness
CHCalinski-Harabasz
FNFalse Negatives
FPFalse Positives
Hhom*ogeneity
HPVHuman Papillomavirus
LDALatent Direchlet Allocation
LSILatent Semantic Indexing
NLPNatural Language Processing
NMINormalized Mutual Information
PHIProtected Health Information
RIRand Index
SCand Silhouette Coefficient
TF-IDFTerm Frequency-Inverse Document Frequency
TNTrue Negatives
TPTrue Positives
VV-measure

Footnotes

Conflict of interest

The authors declare that they have no conict of interest.

1Hashtag is a word or phrase preceded by a hash sign (#) to identifymessages on a specific topic.

2Most of the implementations were extracted using gensim version 3.8.3[121], an opensource library forunsupervised topic modeling.

Publisher's Disclaimer: This is a PDF file of an uneditedmanuscript that has been accepted for publication. As a service to our customerswe are providing this early version of the manuscript. The manuscript willundergo copyediting, typesetting, and review of the resulting proof before it ispublished in its final form. Please note that during the production processerrors may be discovered which could affect the content, and all legaldisclaimers that apply to the journal pertain.

Contributor Information

Juan Antonio Lossio-Ventura, Stanford Center for Biomedical Informatics Research, StanfordUniversity, 1265 Welch Road, 94305-5479, Stanford, California, USA.

Sergio Gonzales, Stanford Center for Biomedical Informatics Research, StanfordUniversity, 1265 Welch Road, 94305-5479, Stanford, California, USA.

Juandiego Morzan, Universidad del Pacífico, Av. Salaverry 2020, JesúsMaría, 15072, Lima, Peru.

Hugo Alatrista-Salas, Universidad del Pacífico, Av. Salaverry 2020, JesúsMaría, 15072, Lima, Peru.

Tina Hernandez-Boussard, Stanford Center for Biomedical Informatics Research, StanfordUniversity, 1265 Welch Road, 94305-5479, Stanford, California, USA.

Jiang Bian, Health Outcomes & Biomedical Informatics, College of Medicine,University of Florida, 2004 Mowry Road, 32610, Gainesville, FL, USA.

References

1. Biterm. https://github.com/xiaohuiyan/BTM.[Online; accessed December 15,2019].

2. Doc2Vec. https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html.[Online; accessed December 15,2019].

3. GibbsLDA. https://nlp.stanford.edu/static/software/tmt/tmt-0.4/.[Online; accessed December 15,2019].

5. LDA. https://radimrehurek.com/gensim_3.8.3/models/ldamodel.html.[Online; accessed December 15,2019].

6. LSI. https://radimrehurek.com/gensim_3.8.3/models/lsimodel.html.[Online; accessed December 15,2019].

7. Online LDA.https://radimrehurek.com/gensim_3.8.3/models/ldamulticore.html.[Online; accessed December 15,2019]. [Google Scholar]

8. Online Twitter LDA.https://github.com/jhlau/online_twitter_lda.[Online; accessed December 15,2019].

9. TF-IDF.https://radimrehurek.com/gensim_3.8.3/models/tfidfmodel.html.[Online; accessed December 15,2019]. [Google Scholar]

10. Aggarwal CC and Zhai C. A survey of text clusteringalgorithms. In Mining text data, pages77–128.Springer,2012. [Google Scholar]

11. Amigó E, Gonzalo J, Artiles J, and Verdejo F. A comparison of extrinsic clustering evaluationmetrics based on formal constraints. Inf. Retr.,12(4):461–486,Aug. 2009. [Google Scholar]

12. Anderson MJ. A new method for non-parametric multivariateanalysis of variance. Austral ecology,26(1):32–46,2001. [Google Scholar]

13. Antoun J. Electronic mail communication between physiciansand patients: a review of challenges and opportunities.Family practice,33(2):121–126,2016. [PubMed] [Google Scholar]

14. Arbelaitz O, Gurrutxaga I, Muguerza J, PéRez JM, and Perona I. An extensive comparative study of clustervalidity indices. Pattern Recognition,46(1):243–256,2013. [Google Scholar]

15. Arnold C and Speier W. A topic model of clinical reports. InProceedings of the 35th International ACM SIGIR Conference onResearch and Development in Information Retrieval, SIGIR‘12, page 1031–1032,New York, NY, USA, 2012.Association for Computing Machinery. [Google Scholar]

16. Arnold CW, El-Saden SM, Bui AA, and Taira R. Clinical case-based retrieval using latent topicanalysis. In AMIA annual symposium proceedings, volume 2010, page 26.American Medical Informatics Association,2010. [PMC free article] [PubMed] [Google Scholar]

17. Aso T and Eguchi K. Predicting protein-protein relationships fromliterature using latent topics. In Genome Informatics 2009: Genome Informatics Series Vol.23, pages 3–12.World Scientific,2009. [PubMed] [Google Scholar]

18. Atherton H, Sawmynaden P, Sheikh A, Majeed A, and Car J. Email for clinical communication betweenpatients/caregivers and healthcare professionals.Cochrane Database of Systematic Reviews,(11), 2012. [PubMed] [Google Scholar]

19. Banerjee S, Ramanathan K, and Gupta A. Clustering short texts using wikipedia. InProceedings of the 30th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, SIGIR‘07, page 787–788,New York, NY, USA, 2007.Association for Computing Machinery. [Google Scholar]

20. Bergmo TS, Kummervold PE, Gammon D, and Dahl LB. Electronic patient–provider communication:Will it offset office visits and telephone consultations in primarycare?International journal of medical informatics,74(9):705–710,2005. [PubMed] [Google Scholar]

21. Bicalho P, Pita M, Pedrosa G, Lacerda A, and Pappa GL. A general framework to expand short text fortopic modeling. Information Sciences,393:66–81,2017. [Google Scholar]

22. Blei DM, Ng AY, and Jordan MI. Latent dirichlet allocation.Journal of machine Learning research,3(Jan):993–1022,2003. [Google Scholar]

23. Braithwaite SR, Giraud-Carrier C, West J, Barnes MD, and Hanson CL. Validating machine learning algorithms fortwitter data against established measures of suicidality.JMIR mental health,3(2):e21,2016. [PMC free article] [PubMed] [Google Scholar]

24. Cai D, Mei Q, Han J, and Zhai C. Modeling hidden topics on document manifold. InProceedings of the 17th ACM Conference on Information andKnowledge Management, CIKM ‘08, page911–920, New York, NY,USA, 2008. Association for ComputingMachinery. [Google Scholar]

25. Caliński T and Harabasz J. A dendrite method for clusteranalysis. Communications in Statistics-theory and Methods,3(1):1–27,1974. [Google Scholar]

26. Cano AE, Varga A, Rowe M, Ciravegna F, and He Y. Harnessing linked knowledge sources for topic classification in social media. In Proceedings of the24th ACM Conference on Hypertext and Social Media, HT‘13, page 41–50,New York, NY, USA, 2013.Association for Computing Machinery. [Google Scholar]

27. Chang F, Paramsothy T, Roche M, and Gupta NS. Patient, staff, and clinician perspectives onimplementing electronic communications in an interdisciplinary rural familyhealth practice. Primary health care research & development,18(2):149–160,2017. [PubMed] [Google Scholar]

28. Chen JH, Goldstein MK, Asch SM, Mackey L, and Altman RB. Predicting inpatient clinical order patterns withprobabilistic topic models vs conventional order sets.Journal of the American Medical Informatics Association,24(3):472–480,2017. [PMC free article] [PubMed] [Google Scholar]

29. Chen W-Y, Song Y, Bai H, Lin C-J, and Chang EY. Parallel spectral clustering in distributedsystems. IEEE transactions on pattern analysis and machine intelligence,33(3):568–586,2010. [PubMed] [Google Scholar]

30. Cheng X, Yan X, Lan Y, and Guo J. Btm: Topic modeling over shorttexts. IEEE Transactions on Knowledge and Data Engineering,26(12):2928–2941,2014. [Google Scholar]

31. Chu K-H, Unger JB, Allem J-P, Pattarroyo M, Soto D, Cruz TB, Yang H, Jiang L, and Yang CC. Diffusion of messages from an electroniccigarette brand to potential users through twitter.PloS one,10(12):e0145387,2015. [PMC free article] [PubMed] [Google Scholar]

32. Coppersmith G, Harman C, and Dredze M. Measuring post traumatic stress disorder in twitter. In Proceedings of the AAAI Eighth InternationalConference on Weblogs and Social Media, ICWSM 2014,Ann Arbor, Michigan, USA, June 1–4,2014., 2014. [Google Scholar]

33. Curiskis SA, Drake B, Osborn TR, and Kennedy PJ. An evaluation of document clustering and topicmodelling in two online social networks: Twitter and reddit.Information Processing & Management,57(2):102034,2020. [Google Scholar]

34. Dai AM, Olah C, and Le QV. Document embedding with paragraphvectors. arXiv preprint arXiv:1507.07998,2015. [Google Scholar]

35. Dai Z, Sun A, and Liu X-Y. Crest: Cluster-based representation enrichment for short text classification. In Pacific-Asia Conference onKnowledge Discovery and Data Mining, pages256–267. Springer,2013. [Google Scholar]

36. Dash J, Haller DM, Sommer J, and Perron NJ. Use of email, cell phone and text message betweenpatients and primary-care physicians: cross-sectional study in afrench-speaking part of switzerland. BMC health services research,16(1):549,2016. [PMC free article] [PubMed] [Google Scholar]

37. Davies DL and Bouldin DW. A cluster separation measure.IEEE Trans. Pattern Anal. Mach. Intell.,1(2):224–227,Feb. 1979. [PubMed] [Google Scholar]

38. De Boom C, Van Canneyt S, Demeester T, and Dhoedt B. Representation learning for very short textsusing weighted word embedding aggregation. Pattern Recogn. Lett.,80(C):150–156,Sept. 2016. [Google Scholar]

39. de Jong CC, Ros WJ, and Schrijvers G. The effects on health behavior and healthoutcomes of internet-based asynchronous communication between healthproviders and patients with a chronic condition: a systematicreview. Journal of medical Internet research, 16(1):e19,2014. [PMC free article] [PubMed] [Google Scholar]

40. De Martino I, D’Apolito R, McLawhorn AS, Fehring KA, Sculco PK, and Gasparini G. Social media for patients: benefits anddrawbacks. Current reviews in musculoskeletal medicine,10(1):141–145,2017. [PMC free article] [PubMed] [Google Scholar]

41. Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R. Indexing by latent semanticanalysis. J. of the American society for information science,41(6):391–407,1990. [Google Scholar]

42. Dimitriadou E, Dolničar S, and Weingessel A. An examination of indexes for determining thenumber of clusters in binary data sets.Psychometrika,67(1):137–159,2002. [Google Scholar]

43. Duda RO, Hart PE, et al. Pattern classification and scene analysis, volume3. Wiley; New York, 1973. [Google Scholar]

44. Farhadloo M, Winneg K, Chan M-PS, Jamieson KH, and Albarracin D. Associations of topics of discussion on twitterwith survey measures of attitudes, knowledge, and behaviors related to zika:probabilistic study in the united states. JMIR public health and surveillance,4(1):e16,2018. [PMC free article] [PubMed] [Google Scholar]

45. Fodeh S, Punch B, and Tan P-N. On ontology-driven document clustering using coresemantic features. Knowledge Information System,28(2):395–421,Aug. 2011. [Google Scholar]

46. Frey BJ and Dueck D. Clustering by passing messages between datapoints. science,315(5814):972–976,2007. [PubMed] [Google Scholar]

47. Ganguly D and Ghosh K. Contextual word embedding: A case study in clustering tweets about emergency situations. In CompanionProceedings of the The Web Conference 2018, WWW ‘18, page73–74, Republic and Canton ofGeneva, CHE, 2018. International World WideWeb Conferences Steering Committee. [Google Scholar]

48. Garrido T, Meng D, Wang JJ, Palen TE, and Kanter MH. Secure e-mailing between physicians and patients:transformational change in ambulatory care. The Journal of ambulatory care management,37(3):211,2014. [PMC free article] [PubMed] [Google Scholar]

49. Ghassemi M, Naumann T, Doshi-Velez F, Brimmer N, Joshi R, Rumshisky A, and Szolovits P. Unfolding physiological state: Mortality modelling in intensive care units. In Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD‘14, page 75–84,New York, NY, USA, 2014.Association for Computing Machinery. [PMC free article] [PubMed] [Google Scholar]

50. Ghosh D and Guha R. What are we ‘tweeting’aboutobesity? mapping tweets with topic modeling and geographic informationsystem. Cartography and geographic information science,40(2):90–102,2013. [PMC free article] [PubMed] [Google Scholar]

51. Haase R, Schultheiss T, Kempcke R, Thomas K, and Ziemssen T. Use and acceptance of electronic communication bypatients with multiple sclerosis: a multicenter questionnairestudy. Journal of medical Internet research, 14(5):e135,2012. [PMC free article] [PubMed] [Google Scholar]

52. Hadifar A, Sterckx L, Demeester T, and Develder C. A self-training approach for short text clustering. In Proceedings of the 4th Workshop onRepresentation Learning for NLP (RepL4NLP-2019), pages194–199,2019. [Google Scholar]

53. Halkidi M, Batistakis Y, and Vazirgiannis M. Cluster validity methods: Part i.SIGMOD Rec.,31(2):40–45,June2002. [Google Scholar]

54. Halkidi M and Vazirgiannis M. A density-based cluster validity approach usingmulti-representatives. Pattern Recognition Letters,29(6):773–786,2008. [Google Scholar]

55. Halko N, Martinsson P-G, and Tropp JA. Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrixdecompositions. SIAM review,53(2):217–288,2011. [Google Scholar]

56. Hinton GE and Salakhutdinov RR. Reducing the dimensionality of data with neuralnetworks. science,313(5786):504–507,2006. [PubMed] [Google Scholar]

57. Hoffman M, Bach F, and Blei D. Online learning for latent dirichletallocation. advances in neural information processing systems,23:856–864,2010. [Google Scholar]

58. Hoogendoorn M, Berger T, Schulz A, Stolz T, and Szolovits P. Predicting social anxiety treatment outcome basedon therapeutic email conversations. IEEE journal of biomedical and health informatics,21(5):1449–1459,2016. [PMC free article] [PubMed] [Google Scholar]

59. Hu X, Sun N, Zhang C, and Chua T-S. Exploiting internal and external semantics for the clustering of short texts using world knowledge. InProceedings of the 18th ACM Conference on Information andKnowledge Management, CIKM ‘09, page919–928, New York, NY,USA, 2009. Association for ComputingMachinery. [Google Scholar]

60. Huang R, Yu G, Wang Z, Zhang J, and Shi L. Dirichlet process mixture model for documentclustering with feature partition. IEEE Transactions on knowledge and data engineering,25(8):1748–1759,2012. [Google Scholar]

61. Huang X, Smith MC, Jamison AM, Broniatowski DA, Dredze M, Quinn SC, Cai J, and Paul MJ. Can online self-reports assist in real-timeidentification of influenza vaccination uptake? a cross-sectional study ofinfluenza vaccine-related tweets in the usa,2013–2017. BMJ open,9(1):e024018,2019. [PMC free article] [PubMed] [Google Scholar]

62. Huang Z, Dong W, Duan H, and Li H. Similarity measure between patient traces forclinical pathway analysis: problem, method, andapplications. IEEE journal of biomedical and health informatics,18(1):4–14,2013. [PubMed] [Google Scholar]

63. Huang Z, Lu X, and Duan H. Latent treatment pattern discovery for clinicalprocesses. Journal of medical systems,37(2):9915,2013. [PubMed] [Google Scholar]

64. Hubert L and Arabie P. Comparing partitions.Journal of Classification,2:193–218,1985. [Google Scholar]

65. Ifrim G, Shi B, and Brigadir I. Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In Second Workshop onSocial News on the Web (SNOW), Seoul,Korea, 8 April 2014. ACM,2014. [Google Scholar]

66. Ingaramo D, Pinto D, Rosso P, and Errecalde M. Evaluation of internal validity measures in short-text corpora. In International Conference on Intelligent TextProcessing and Computational Linguistics, pages555–567. Springer,2008. [Google Scholar]

67. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, and Zhao L. Latent dirichlet allocation (lda) and topicmodeling: models, applications, a survey. Multimedia Tools and Applications,78(11):15169–15211,2019. [Google Scholar]

68. Jiang Y, Liao QV, Cheng Q, Berlin RB, and Schatz BR. Designing and evaluating a clustering system fororganizing and integrating patient drug outcomes in personal healthmessages. In AMIA Annual Symposium Proceedings, volume 2012, page 417.American Medical Informatics Association,2012. [PMC free article] [PubMed] [Google Scholar]

69. Jin O, Liu NN, Zhao K, Yu Y, and Yang Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20thACM International Conference on Information and Knowledge Management, CIKM‘11, page 775–784,New York, NY, USA, 2011.Association for Computing Machinery. [Google Scholar]

70. Karami A, Gangopadhyay A, Zhou B, and Karrazi H. Flatm: A fuzzy logic approach topic model for medical documents. In 2015 Annual Conference of the NorthAmerican Fuzzy Information Processing Society (NAFIPS) held jointly with2015 5th World Conference on Soft Computing (WConSC), pages1–6. IEEE,2015. [Google Scholar]

71. Karami A, Gangopadhyay A, Zhou B, and Kharrazi H. Fuzzy approach topic discovery in health andmedical corpora. International Journal of Fuzzy Systems,20(4):1334–1345,2018. [Google Scholar]

72. Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, and Fidler S. Skip-thought vectors. In Proceedingsof the 28th International Conference on Neural Information ProcessingSystems - Volume 2, NIPS’15, page3294–3302, Cambridge, MA,USA, 2015. MITPress. [Google Scholar]

73. Krawczyk B. Learning from imbalanced data: open challengesand future directions. Progress in Artificial Intelligence,5(4):221–232,2016. [Google Scholar]

74. Krawczyk B, Minku LL, Gama J, Stefanowski J, and Woźniak M. Ensemble learning for data stream analysis: Asurvey. Information Fusion,37:132–156,2017. [Google Scholar]

75. Kumar J, Shao J, Uddin S, and Ali W. An online semantic-enhanced Dirichlet model for short text stream clustering. In Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics,pages 766–776, Online,July 2020. Association for ComputationalLinguistics. [Google Scholar]

76. Lau JH, Collier N, and Baldwin T. On-line trend analysis with topic models: twitter trends detection topic model online. In Proceedings of the 24thInternational Conference on Computational Linguistics, COLING‘12, pages1519–1534,2012. [Google Scholar]

77. Le Q and Mikolov T. Distributed representations of sentences and documents. In Proceedings of the 31st InternationalConference on International Conference on Machine Learning - Volume 32,ICML’14, pageII–1188–II–1196.JMLR.org, 2014. [Google Scholar]

78. Lee DD and Seung HS. Algorithms for non-negative matrix factorization. In Proceedings of the 13th InternationalConference on Neural Information Processing Systems,NIPS’00, page535–541, Cambridge, MA,USA, 2000. MITPress. [Google Scholar]

79. Li C, Duan Y, Wang H, Zhang Z, Sun A, and Ma Z. Enhancing topic modeling for short texts withauxiliary word embeddings. ACM Trans. Inf. Syst., 36(2), Aug.2017. [Google Scholar]

80. Li C, Wang H, Zhang Z, Sun A, and Ma Z. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR‘16, page 165–174,New York, NY, USA, 2016.Association for Computing Machinery. [Google Scholar]

81. Liu B, Liu L, Tsykin A, Goodall GJ, Green JE, Zhu M, Kim CH, and Li J. Identifying functional mirna–mrnaregulatory modules with correspondence latent dirichletallocation. Bioinformatics,26(24):3105–3111,2010. [PMC free article] [PubMed] [Google Scholar]

82. Lo SL, Chiong R, and Cornforth D. An unsupervised multilingual approach for onlinesocial media topic identification. Expert Systems with Applications,81:282–298,2017. [Google Scholar]

83. Lossio-Ventura JA and Bian J. An inside look at the opioid crisis over twitter. In 2018 IEEE International Conference onBioinformatics and Biomedicine (BIBM), pages1496–1499. IEEE,2018. [Google Scholar]

84. Lossio-Ventura JA, Bian J, Jonquet C, Roche M, and Teisseire M. A novel framework for biomedical entity senseinduction. Journal of biomedical informatics,84:31–41,2018. [PMC free article] [PubMed] [Google Scholar]

85. Lossio Ventura JA, Hacid H, Ansiaux A, and Maag ML. Conversations reconstruction in the social web.In Proceedings of the 21st International Conference on World WideWeb, WWW ‘12 Companion, pages573–574, New York, NY,USA, 2012. ACM. [Google Scholar]

86. Lossio-Ventura JA, Hacid H, Roche M, and Poncelet P. Communication overload management through social interactions clustering. In Proceedings of the 31stAnnual ACM Symposium on Applied Computing, SAC ‘16, page1166–1169, New York, NY,USA, 2016. Association for ComputingMachinery. [Google Scholar]

87. Lossio-Ventura JA, Morzan J, Alatrista-Salas H, Hernandez-Boussard T, and Bian J. Clustering and topic modeling over tweets: A comparison over a health dataset. In 2019 IEEE InternationalConference on Bioinformatics and Biomedicine, BIBM’19.IEEE Computer Society, 2019 (inpress). [PMC free article] [PubMed] [Google Scholar]

88. Lu Y, Mei Q, and Zhai C. Investigating task performance of probabilistictopic models: An empirical study of plsa and lda.Information Retrieval,14(2):178–203,Apr. 2011. [Google Scholar]

89. Lu Y, Zhang P, Liu J, Li J, and Deng S. Health-related hot topic detection in onlinecommunities using text clustering. Plos one, 8(2):e56221,2013. [PMC free article] [PubMed] [Google Scholar]

90. Ma L, Wang Z, and Zhang Y. Extracting depression symptoms from social networks and web blogs via text mining. In International Symposium onBioinformatics Research and Applications, pages325–330. Springer,2017. [Google Scholar]

91. MacQueen Jet al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeleysymposium on mathematical statistics and probability, volume1, pages 281–297.Oakland, CA, USA, 1967. [Google Scholar]

92. MacQueen JB. Some methods for classification and analysis ofmultivariate observations. In Cam LML and Neyman J, editors, Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1,pages 281–297. Universityof California Press, 1967. [Google Scholar]

93. Manaskasemsak B, Chinthanet B, and Rungsawang A. Graph clustering-based emerging event detection from twitter data stream. In Proceedings of the FifthInternational Conference on Network, Communication and Computing, ICNCC‘16, page 37–41,New York, NY, USA, 2016.Association for Computing Machinery. [Google Scholar]

94. Manning CD, Schütze H, and Raghavan P. Introduction to information retrieval.Cambridge university press,2008. [Google Scholar]

95. Maulik U and Bandyopadhyay S. Performance evaluation of some clusteringalgorithms and validity indices. IEEE Transactions on pattern analysis and machine intelligence,24(12):1650–1654,2002. [Google Scholar]

96. Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26thInternational Conference on Neural Information Processing Systems - Volume2, NIPS’13, page3111–3119, Red Hook, NY,USA, 2013. Curran AssociatesInc. [Google Scholar]

97. Millar JR, Peterson GL, and Mendenhall MJ. Document clustering and visualization with latent dirichlet allocation and self-organizing maps. InTwenty-Second International FLAIRS Conference,2009. [Google Scholar]

98. Milligan GW and Cooper MC. An examination of procedures for determining thenumber of clusters in a data set.Psychometrika,50(2):159–179,1985. [Google Scholar]

99. Myslín M, Zhu S-H, Chapman W, and Conway M. Using twitter to examine smoking behavior andperceptions of emerging tobacco products. Journal of medical Internet research,15(8):e174,2013. [PMC free article] [PubMed] [Google Scholar]

100. Nguwi Y-Y and Cho S-Y. An unsupervised self-organizing learning withsupport vector ranking for imbalanced datasets.Expert Systems with Applications,37(12):8303–8312,2010. [Google Scholar]

101. Nguyen DQ. jLDADMM: A Java package for the LDA and DMM topicmodels. arXiv preprint arXiv:1808.03835,2018. [Google Scholar]

102. Nguyen DQ, Billingsley R, Du L, and Johnson M. Improving topic models with latent feature wordrepresentations. Transactions of the Association for Computational Linguistics,3:299–313,2015. [Google Scholar]

103. Nigam K, McCallum AK, Thrun S, and Mitchell T. Text classification from labeled and unlabeleddocuments using em. Machine learning,39(2–3):103–134,2000. [Google Scholar]

104. Ofoghi B, Mann M, and Verspoor K. Towards early discovery of salient health threats: A social media emotion classification technique. InBiocomputing 2016: Proceedings of the PacificSymposium, pages 504–515.World Scientific, 2016. [PubMed] [Google Scholar]

105. co/>Olariu A. Hierarchical clustering in improving microblog stream summarization. In International Conference onIntelligent Text Processing and Computational Linguistics, pages424–435. Springer,2013. [Google Scholar]

106. Pappas Y, Atherton H, Sawmynaden P, and Car J. Email for clinical communication betweenhealthcare professionals. Cochrane Database of Systematic Reviews, (9),2012. [PubMed] [Google Scholar]

107. Paul MJ and Dredze M. Discovering health topics in social media usingtopic models. PloS one,9(8), 2014. [PMC free article] [PubMed] [Google Scholar]

108. Paul MJ and Dredze M. Social monitoring for publichealth. Synthesis Lectures on Information Concepts, Retrieval, and Services,9(5):1–183,2017. [Google Scholar]

109. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and duch*esnay E. Scikit-learn: Machine learning inpython. Journal of Machine Learning Research,12:2825–2830,Nov. 2011. [Google Scholar]

110. Pennington J, Socher R, and Manning CD. Glove: Global vectors for word representation.In Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP), pages1532–1543,2014. [Google Scholar]

111. Pestian J, Nasrallah H, Matykiewicz P, Bennett A, and Leenaars A. Suicide note classification using naturallanguage processing: A content analysis. Biomedical informatics insights,3:BII–S4706,2010. [PMC free article] [PubMed] [Google Scholar]

112. Pivovarov R, Perotte AJ, Grave E, Angiolillo J, Wiggins CH, and Elhadad N. Learning probabilistic phenotypes fromheterogeneous ehr data. Journal of biomedical informatics,58:156–165,2015. [PMC free article] [PubMed] [Google Scholar]

113. Prasad KR, Mohammed M, and Noorullah R. Visual topic models for healthcare dataclustering. Evolutionary Intelligence,pages 1–18,2019. [Google Scholar]

114. Prasad KR, Mohammed M, and Noorullah RM. Hybrid topic cluster models for social healthcaredata. International Journal of Advanced Computer Science and Applications, 10(11),2019. [Google Scholar]

115. Preo D¸tiuc-Pietro P. Srijith, Hepple M, and Cohn T. Studying the temporal dynamics of word co-occurrences: An application to event detection. In Proceedings of theTenth International Conference on Language Resources and Evaluation(LREC’16), pages4380–4387,2016. [Google Scholar]

116. Qiang J, Li Y, Yuan Y, and Wu X. Short text clustering based on pitman-yor processmixture model. Applied Intelligence,48(7):1802–1812,2018. [Google Scholar]

117. Qiang J, Qian Z, Li Y, Yuan Y, and Wu X. Short text topic modeling techniques,applications, and performance: a survey. IEEE Transactions on Knowledge and Data Engineering,2020. [Google Scholar]

118. Quan X, Kit C, Ge Y, and Pan SJ. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24thInternational Conference on Artificial Intelligence,IJCAI’15, page2270–2276. AAAI Press,2015. [Google Scholar]

119. Rand WM. Objective criteria for the evaluation ofclustering methods. Journal of the American Statistical association,66(336):846–850,1971. [Google Scholar]

120. Rangrej A, Kulkarni S, and Tendulkar AV. Comparative study of clustering techniques for short text documents. In Proceedings of the 20th InternationalConference Companion on World Wide Web, WWW ‘11, page111–112, New York, NY,USA, 2011. Association for ComputingMachinery. [Google Scholar]

121. Rehurek R and Sojka P. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 workshop onnew challenges for NLP frameworks. Citeseer,2010. [Google Scholar]

122. Rosenberg A and Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 JointConference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning (EMNLP-CoNLL), pages410–420, Prague, CzechRepublic, June 2007. Associationfor Computational Linguistics. [Google Scholar]

123. Rousseeuw PJ. Silhouettes: a graphical aid to theinterpretation and validation of cluster analysis.Journal of computational and applied mathematics,20:53–65,1987. [Google Scholar]

124. Rude S, Gortner E-M, and Pennebaker J. Language use of depressed anddepression-vulnerable college students. Cognition & Emotion,18(8):1121–1133,2004. [Google Scholar]

125. Salton G and Buckley C. Term-weighting approaches in automatic textretrieval. Information processing & management,24(5):513–523,1988. [Google Scholar]

126. Sawmynaden P, Atherton H, Majeed A, and Car J. Email for the provision of information on diseaseprevention and health promotion. Cochrane Database of Systematic Reviews, (11),2012. [PubMed] [Google Scholar]

127. Seneviratne MG, Seto T, Blayney DW, Brooks JD, and Hernandez-Boussard T. Architecture and implementation of a clinicalresearch data warehouse for prostate cancer.eGEMs, 6(1),2018. [PMC free article] [PubMed] [Google Scholar]

128. Shou L, Wang Z, Chen K, and Chen G. Sumblr: continuous summarization of evolving tweet streams. In Proceedings of the 36th international ACMSIGIR conference on Research and development in informationretrieval, pages 533–542,2013. [Google Scholar]

129. Shou L, Wang Z, Chen K, and Chen G. Sumblr: Continuous summarization of evolving tweet streams. In Proceedings of the 36th International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR‘13, page 533–542,New York, NY, USA, 2013.Association for Computing Machinery. [Google Scholar]

130. Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, and Merchant RM. Twitter as a tool for health research: asystematic review. American J. of public health,107(1):e1–e8,2017. [PMC free article] [PubMed] [Google Scholar]

131. Strehl A and Ghosh J. Cluster ensembles—a knowledge reuseframework for combining multiple partitions. Journal of machine learning research,3(Dec):583–617,2002. [Google Scholar]

132. Sulieman L, Gilmore D, French C, Cronin RM, Jackson GP, Russell M, and Fabbri D. Classifying patient portal messages usingconvolutional neural networks. Journal of biomedical informatics,74:59–70,2017. [PubMed] [Google Scholar]

133. Sun A. Short text classification using very few words.In Proceedings of the 35th International ACM SIGIR Conference onResearch and Development in Information Retrieval, SIGIR‘12, page 1145–1146,New York, NY, USA, 2012.Association for Computing Machinery. [Google Scholar]

134. Surian D, Nguyen DQ, Kennedy G, Johnson M, Coiera E, and Dunn AG. Characterizing twitter discussions about hpvvaccines using topic modeling and community detection.Journal of Medical Internet Research,18(8):e232,2016. [PMC free article] [PubMed] [Google Scholar]

135. Tian F, Gao B, Cui Q, Chen E, and Liu T-Y. Learning deep representations for graph clustering. In Proceedings of the Twenty-Eighth AAAIConference on Artificial Intelligence, AAAI’14, page1293–1299. AAAI Press,2014. [Google Scholar]

136. Tibshirani R, Walther G, and Hastie T. Estimating the number of clusters in a data setvia the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology),63(2):411–423,2001. [Google Scholar]

137. Van der Zanden R, Curie K, Van Londen M, Kramer J, Steen G, and Cuijpers P. Web-based depression treatment: Associations ofclientsẃord use with adherence and outcome.Journal of affective disorders,160:10–13,2014. [PubMed] [Google Scholar]

138. Ventola CL. Social media and health care professionals:benefits, risks, and best practices. Pharmacy and Therapeutics,39(7):491,2014. [PMC free article] [PubMed] [Google Scholar]

139. Vraga EK, Stefanidis A, Lamprianidis G, Croitoru A, Crooks AT, Delamater PL, Pfoser D, Radzikowski JR, and Jacobsen KH. Cancer and social media: A comparison of trafficabout breast cancer, prostate cancer, and other reproductive cancers ontwitter and instagram. Journal of health communication,23(2):181–189,2018. [PubMed] [Google Scholar]

140. Wang Y and Chen L. Multi-exemplar based clustering for imbalanced data. In 2014 13th International Conference on ControlAutomation Robotics & Vision (ICARCV), pages1068–1073. IEEE,2014. [Google Scholar]

141. Wei T, Lu Y, Chang H, Zhou Q, and Bao X. A semantic approach for text clustering usingwordnet and lexical chains. Expert Syst. Appl.,42(4):2264–2275,Mar. 2015. [Google Scholar]

142. Wei X and Croft WB. Lda-based document models for ad-hoc retrieval.In Proceedings of the 29th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages178–185. ACM,2006. [Google Scholar]

143. Wu Y, Liu M, Zheng WJ, Zhao Z, and Xu H. Ranking gene-drug relationships in biomedicalliterature using latent dirichlet allocation. InBiocomputing 2012, pages422–433. WorldScientific, 2012. [PMC free article] [PubMed] [Google Scholar]

144. Xie P and Xing EP. Integrating document clustering and topic modeling. In Proceedings of the Twenty-Ninth Conferenceon Uncertainty in Artificial Intelligence, UAI’13, page694–703, Arlington, Virginia,USA, 2013. AUAIPress. [Google Scholar]

145. Xu J, Xu B, Wang P, Zheng S, Tian G, and Zhao J. Self-taught convolutional neural networks forshort text clustering. Neural Networks,88:22–31,2017. [PubMed] [Google Scholar]

146. Xu T and Oard DW. Wikipedia-based topic clustering formicroblogs. Proceedings of the American Society for Information Science and Technology,48(1):1–10,2011. [Google Scholar]

147. Yan X, Guo J, Lan Y, and Cheng X. A biterm topic model for short texts. InProc of the 22nd Int Conference on World Wide Web, WWW‘13, pages1445–1456, New York, NY,USA, 2013. ACM. [Google Scholar]

148. Yin J, Chao D, Liu Z, Zhang W, Yu X, and Wang J. Model-based clustering of short text streams.In Proc of the 24th ACM SIGKDD Int Conference on Knowledge Discovery& Data Mining, KDD ‘18, pages2634–2642, New York, NY,USA, 2018. ACM. [Google Scholar]

149. Yin J and Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD‘14, page 233–242,New York, NY, USA, 2014.Association for Computing Machinery. [Google Scholar]

150. Yin J and Wang J. A model-based approach for text clustering with outlier detection. In 2016 IEEE 32nd International Conference onData Engineering (ICDE), pages625–636. IEEE,2016. [Google Scholar]

151. Yin J and Wang J. A text clustering algorithm using an online clustering scheme for initialization. In Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ‘16, page1995–2004, New York, NY,USA, 2016. Association for ComputingMachinery. [Google Scholar]

152. Žalik KR and Žalik B. Validity index for clusters of different sizesand densities. Pattern Recognition Letters,32(2):221–234,2011. [Google Scholar]

153. Zhang H, Wheldon C, Dunn AG, Tao C, Huo J, Zhang R, Prosperi M, Guo Y, and Bian J. Mining twitter to assess the determinants ofhealth behavior toward human papillomavirus vaccination in the unitedstates. Journal of the American Medical Informatics Association,27(2):225–235,2020. [PMC free article] [PubMed] [Google Scholar]

154. Zhang L, Hall M, and Bastola D. Utilizing twitter data for analysis ofchemotherapy. International journal of medical informatics,120:92–100,2018. [PubMed] [Google Scholar]

155. Zhao Y, Guo Y, He X, Huo J, Wu Y, Yang X, and Bian J. Assessing mental health signals among sexual and gender minorities using twitter data. In 2018 IEEEInternational Conference on Healthcare Informatics Workshop(ICHI-W), pages 51–52.IEEE, 2018. [PMC free article] [PubMed] [Google Scholar]

156. Zheng CT, Liu C, and San Wong H. Corpus-based topic diffusion for short textclustering. Neurocomputing,275:2444–2458,2018. [Google Scholar]

157. Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, and Xiong H. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD‘16, page 2105–2114,New York, NY, USA, 2016.Association for Computing Machinery. [Google Scholar]

158. Zuo Y, Zhao J, and Xu K. Word network topic model: A simple but generalsolution for short and imbalanced texts. Knowl. Inf. Syst.,48(2):379–398,Aug. 2016. [Google Scholar]

Evaluation of clustering and topic modeling methods over
health-related tweets and emails (2024)
Top Articles
Latest Posts
Article information

Author: Twana Towne Ret

Last Updated:

Views: 5773

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Twana Towne Ret

Birthday: 1994-03-19

Address: Apt. 990 97439 Corwin Motorway, Port Eliseoburgh, NM 99144-2618

Phone: +5958753152963

Job: National Specialist

Hobby: Kayaking, Photography, Skydiving, Embroidery, Leather crafting, Orienteering, Cooking

Introduction: My name is Twana Towne Ret, I am a famous, talented, joyous, perfect, powerful, inquisitive, lovely person who loves writing and wants to share my knowledge and understanding with you.