Companies have made billions of dollars by turning everything we say, do, and look at online into an experiment in consumer profiling. Recently, some users have had enough, curtailing their use of social media or deleting their accounts completely. But that’s no guarantee of privacy, according to a new study. If you can be linked to other users, their activity can expose you, too. Now, computer scientists have shown that the Twitter streams of your 10 closest contacts can predict your future tweets even better than your own stream.
“It’s much easier than it looks,” to figure out a person’s character from such second-hand surveillance, says David Garcia, a computational social scientist at the Medical University of Vienna in Austria who was not involved in the study.
Instead of predicting anyone’s actual tweets, researchers at the University of Vermont in Burlington estimated how predictable a person’s future words would be, using a measurement known as entropy. More entropy means more randomness and less repetition. They looked at the Twitter streams of 927 users, each of whom had 50 to 500 followers, as well as the 15 users each of them had tweeted at the most. In each individual’s stream, they calculated how much entropy the sequence of words contained. (On average, tweeters had more entropy than Ernest Hemingway, less than James Joyce.) They then plugged that number into a tool from information theory called Fano’s inequality to calculate how well a person’s stream could predict the first word in his or her next tweet. That upper bound on accuracy was, on average, 53%. But predicting each successive word is somewhat less accurate.
Next, they calculated the upper bound for prediction based on the user’s stream, plus the streams of his or her 15 closest contacts. Accuracy rose to 60%. When they removed the user’s stream from the equation, that figure dropped to about 57%. That means that looking at the streams of a users’ contacts is nearly as good as including the user—and even better than surveilling the user alone, the researchers report today in Nature Human Behavior. It took the streams of just 10 contacts to surpass the predictive accuracy of the individual’s own Twitter stream. For comparison, predicting what someone will write based on a random assortment of strangers’ tweets yields a maximum accuracy of 51%. (That’s nearly the 53% predictability using the person’s own tweets because there’s a lot of regularity in the English language and in what people tweet about.)
“We used some very interesting mathematics from information theory to say: If you had the perfect machine learning method, how well could you do?” says lead author James Bagrow, a data scientist at the University of Vermont in Burlington. Joanne Hinds, a psychologist at the University of Bath in the United Kingdom, agrees. The new approach is “a unique method that goes beyond much of the existing work in this area,” she says.
The results show that in principle, one could roughly predict what someone who’s not even on Twitter would tweet, Bagrow says. In reality, that would mean finding out who a person’s friends were offline and then finding those friends’ feeds on Twitter. But many apps ask for access to contact lists—and some have been known to share them. Facebook, for example, has plied users’ contact lists to create “shadow profiles” of people not even on the network. Researchers have already used people’s own tweets to predict personality, depression, and political orientation. Hypothetical tweets based on friends’ tweets might allow the same inferences.
One practical limitation of this work is that it treats all words as equally informative, but some might say more about you than others, Bagrow says. If your friends tweet a lot about, say, gay rights, or follow only Republican politicians, that could be especially revealing of your sexuality or political orientation. Garcia has found that contacts on Friendster can predict one’s sexuality and relationship status, and contacts on Twitter can predict one’s location. “We have barely scratched the surface of what types of information can be revealed in this way,” Hinds says.
“What concerns me in terms of privacy,” Bagrow says, “is that there are so many ways that these big platforms are getting at data that I think people don’t realize.” Another thing people may not consider, he says: “When they give up their own data, they’re also giving up data on their friends. What we think is an individual choice in a social network is not really.”