A number of predictors have been suggested to detect the most influential spreaders of information in online social media across various domains such as Twitter or Facebook. In particular, degree, PageRank, k-core and other centralities have been adopted to rank the spreading capability of users in information dissemination media. So far, validation of the proposed predictors has been done by simulating the spreading dynamics rather than following real information flow in social networks. Consequently, only model-dependent contradictory results have been achieved so far for the best predictor. Here, we address this issue directly. We search for influential spreaders by following the real spreading dynamics in a wide range of networks. We find that there are plausible situations where the widely-used degree and PageRank fail in ranking users’ influence. We find that the best spreaders are consistently located in the k-core across dissimilar social platforms such as Twitter, Facebook, Livejournal and scientific publishing. Furthermore, when the complete global network structure is unavailable, we find that the sum of the nearest neighbors’ degree is a reliable local proxy for user’s influence. Our analysis provides practical instructions for optimal design of strategies for “viral” information dissemination in relevant applications.
The data that we used in this study can be downloaded here, compressed in .rar format:
(1) APS Dataset: This file contains the coauthorship and citations of all scientific papers published in Americal Physical Society (APS) journals until 2005, including Physical Review A, B, C, D, E and Physical Review Letters. Each node represents an author of scientific papers.
The files PR.txt – PRL.txt record the information of each paper in corresponding journals. Each record appears like:
<journal jcode=”PRI” short=”Phys. Rev. (Series I)”>Physical Review (Series I)</journal>
<title>On the Relation between the Lengths of the Yard and the Meter</title>
<aff >Shannon Physical Laboratory, Colby University</aff>
<cpyrtdate date=”1893″ /><cpyrtholder>The American Physical Society</cpyrtholder>
The file citing_cited.csv records the citation of papers.
Each line means a paper (citing_paper_doi) has cited another paper (cited_paper_doi).
I also attached data files I have processed. Papers and authors are indexed with integers. You can open it in Matlab.
1. Citation Data
2. Paper Information
timestamp, paper_doi_id, author_id
Each line means (author_id) has published a paper (paper_doi_id) at (timestamp). This dataset excludes PR.txt and PRI.txt since their publication time is too early (late 19th century and early 20th century).
(2) Facebook Dataset: This dataset is available online at http://socialnetworks.mpi-sws.org/data-wosn2009.html. It contains the friend relations of New Orleans Facebook social network as well as the wall posts records of users during a period of nearly two years. In the social network there are 63731 nodes with average degree 24.3. The total number of wall posts is 876992.
1. List of links
These files contain a list of all of the user-to-user links from the Facebook New Orleans networks. All links are treated as directed, even though they are undirected on Facebook.
Format: Gzipped ASCII. Each line contains two anonymized user identifiers, meaning the second user appeared in the first user’s friend list. Finally, the third column is a UNIX timestamp with the time of link establishment (if it could be determined, otherwise it is ‘\N’).
Data: Facebook Links (10.4MB)
2. List of wall posts
These files contain a list of all of the wall posts from the Facebook New Orleans networks.
Format: Gzipped ASCII. Each line contains two anonymized user identifiers, meaning the second user posted on the first user’s wall. The third column is a UNIX timestamp with the time of the wall post.
Data: Facebook Wall Posts (6.8MB)
(3) Twitter Dataset: This file contains the mention network and retweet relations extracted from the tweets sampled between January 23rd and February 8th, 2011 provided by Twitter (http://trec.nist.gov/data/tweets/). We are not allowed to distribute any private information about the twitter users, so in the dataset each user is represented by an anonymized ID.
1. Mention Data
mention.txt records the mention relation between twitter users
Attributes are seperated by commas without blank spaces.
Each line means: The user (user_id) mentioned another user (mentioned_user_id) in one tweet (tweet_id) at time (mention_time). (mention_time) appears like “Sun Jan 23 00:15:51 +0000 2011″. (mention_time_in_sec) is an integer translated from (mention_time), which means the number of seconds elapsed from a given time point to (mention_time).
2. Retweet Data
Adjacency list for retweet data which is the network used in the collective influence paper, Flaviano Morone and Hernán Makse, “Influence maximization in complex networks through optimal percolation”, Nature 524, 65-68 (2015).
retweet.txt records the retweet events between twitter users
Attributes are seperated by commas without blank spaces.
Each line means: The user (retweet_user_id) retweeted a tweet from (origin_user_id) in one tweet (tweet_id) at time (retweet_time). Unfortunately we cannot track the id of the original tweet. The user (retweet_user_id) has (retweet_user_friendscount) friends and (retweet_user_followerscount) followers. (retweet_time) has the format like “Sun Jan 23 00:15:51 +0000 2011″. (retweet_time_in_sec) is also an integer translated from (retweet_time).
For further information or help with the files, please contact Hernan Makse.