An Automated Multiscale Map of Conversations: Mothers and Matters
About Cafemom Dataset
As with many social media platforms, Cafemom contains a huge repository of data. As mentioned before, Cafemom’s discussion boards are divided into groups, and while some portions are open to the public, a majority of the groups are private. To access the data in these groups, we create a membership profile. Once accepted into these groups, we proceed to crawl all the data from groups that refer to vaccination.
The resultant corpus contains 139,457 threads spanning 18 groups with a total of 1,700,086 posts from 27,790 users over a span of around 5 years, beginning from Feb 6th 2007 to Apr 24th 2012. During this time, there were a total of 18,498,306 thread views from users as well as readers, people who read but do not post.
Figure on the right, a) depicts the relationship between threads and their lengths (number of posts). 85% of the threads have thread lengths between 1 to 20. Only 1% of the total number of threads in the corpus have thread lengths greater than 100. We also find that 98% of the threads are under 6 months of age, which suggests that threads addressed issues that were relevant to specific happenings during a certain time frame. Figure 1.b) shows the relationship between users and their posts. 79% of the users posted between 1 to 20 posts. The most active members comprising 1% of the total users in the dataset posted more than 1000 posts that include a subset of group administrators employed by Cafemom.
Site Topology and Characteristics
We use the time stamps of posts to examine the growth of user interactions. From 2007 to 2008, user activity shown in Fig. c) increases. After it reaches the peak during the end of 2008, user activity begins to decrease gradually. The novelty of Cafemom as a social platform exclusively for mothers and mothers-to- be when it first started in 2006 contributed to its rapid rise in popularity for the first few years. As shown in Fig. d), the peak post activity was during 2010, which is followed by a sharp decline. Possible explanations could be that users were migrating to other groups on Cafemom or even other social networks such as Facebook.
To get a broader observation of user content, we employ Latent Dirichlet Allocation (LDA), which is an unsupervised method of topic discovery. In LDA, a document (thread) is comprised of a mixture of topics and in turn each word in the document can be attributable to one of the topics. We generate ten well-defined set of topics that is available here. File:TopicModeling Results.zip This folder contains -
- List of ten topics in file cafemomthread_topickeys
- Topic Probabilities per document for all 139,457 threads in file cafemomthread_doctopics
After performing topic modeling on threads, we categorized the threads under these topics. If a thread has a maximum topic proportion greater than 0.3, then it becomes associated with that topic alone. The threshold is taken as 0.3 because all such threads were found to have relatively very low probabilities for the other topics associated with that thread.
The 10 topics found highlight distinct themes in our data set. To examine each topic in greater detail and granularity, we perform sub-topic modeling on the threads associated with each topic. The results give a set of five well-defined sub-topics for each topic. By performing sub-topic modeling, we can see that there is an overlap between topics, which suggests that users voice their concern about a particular topic in more than one arena. For example, one can see that users concerned about vaccination were also discussing it in the context of autism.
Thus, interplay between topics is viewable through sub-topic modeling. Results from sub-topic modeling is available here. File:Sub-TopicModeling Results.zip This folder contains -
- List of five sub-topics generated for each topic in file cafemomthreadtopic#_topickeys
- Sub-topic Probabilities per document for all threads that were categorized under the main topic in file cafemomthreadtopic#_doctopics
The posts that had previously been categorized under topics are further divided under several 6-month time slots beginning on Feb 6th 2007. For all posts falling under that time slot, we perform tokenization using appropriate regular expressions, filter out the stop words, and create a bag of words. We calculate the term weights for each unigram under each time slot categorized under each topic. We then sort the unigrams in the order of decreasing term weight. This data-driven approach is conducive in providing an accurate picture of the relevancy of issues at different periods of time and thus studying the evolution of user generated content. Results for the top 20 unigrams under each time slot under each topic is available here. File:TopUnigrams.zip
In addition to the online discussion boards, Cafemom has an underlying friendship network. Out of 27,790 users, 16,731 (60%) have friends on the site, which forms the underlying friendship network in our dataset. We use a fast greedy community finding approach to cluster users our network dataset. After performing community finding on 16,731 users, we eliminate users belonging to communities having sizes less than 100, leaving us with 15,332 users. To perform community finding on this set of users, you can use the Network_Unipartite.txt file available in Unipartites.zip. Community finding on this set of users give us 88 communities with a modularity of 0.5.
We perform sub-community finding on the 5 biggest communities to break them down into smaller sub-communities having sizes less than 1000 to make all communities comparable in size. The top 5 biggest communities have sizes 4030, 3508, 2572, 2546 and 1314 respectively. To perform sub-community finding on these communities you can use the com#_uni.txt files available in Unipartites.zip. The sub-community findings for these 5 communities give modularities 0.53, 0.47,0.65,0.39 and 0.77 respectively. Our aim was to break down larger communities into smaller chunks in order to find more meaningful groups by investigating for user similarity based on topics. File:Unipartites.zip