If you spend much time working with computers then you’re likely familiar with metadata, or data about data. In the digital era metadata is relied upon for many of the tagging and categorization systems that are seen in popular web environments, such as Twitter, Digg, Delicious, Facebook, and so forth, and is more generally used to define, structure, and administrate data across all digital environments. I should state, upfront, that metadata is incredibly valuable: nothing that I’m going to write about should leave you with the suggestion that metadata should be removed from the digital landscape or could be removed. Instead I’m advocating for a responsible use of metadata.

In this post I will be drawing on a pair of examples to underscore just how much data is contained in popular metadata structures: the information divulged every time a person tweets on Twitter, and what your mobile phone operator may be giving up to third-parties when you browse the web on your phone. In the latter case, especially, we see that metadata is not just important for routing data traffic but also responsible for disclosing a considerable amount of personal information. I’ll conclude by noting, once again, that our privacy regulators, commissioners, advocates, and researchers need to additional funding if citizens are to have those parties regularly identify ‘bad’ metadata practices and seek rapid remedies before the data ends up being datamined for illicit or unjustifiable reasons.

Turn to Twitter

Twitter is a social networking service that lets individuals ‘tweet’ short messages to the public, and to each other in 140 characters or less.  I’ve used it for some time now and find it valuable for sharing short bursts of information (e.g. web links) as well as brief commentary and short ‘water-cooler style’ conversations. While it’s well known that the conversations people have using the social networking tool are generally public, how much data, really, could be involved in 140 characters? It turns out, there’s quite a bit divulged in the structure of a tweet.

The above image was put together by Raffi Krikorian and breaks down all of the information emitted when someone releases a tweet into the Twitter ecosystem; as should be evident it holds a lot more than just the words that you send to the ‘net and required identifying information. After the identifier of the tweet in the first line, and text of the message in the second, and then other content-related information (e.g. is it a rebroadcast of a prior message, when was it created, etc) we see that the users’ full biographical information is transmitted along with geocoding information, when the account was created, number of favorites the user has, number of followers the user has and how many people the user follows, the application that sent the tweet, total number of tweets, and so on. Clearly, more is being broadcast then the content of a message, and this brings me to my point: the metadata associated with information exchanges, even using a system as ‘simple’ as Twitter is extreme. The medium, as constituted through content and metadata, communicates the whole message whereas a content analysis alone reveals only a surprisingly small part of what is communicated with each tweet.

This metadata can constitute personally identifiable information. Depending on the settings that a user has enabled or disabled it is possible to develop precise ideas of where the user is located in space, when they are communicating with people, who they are regularly in conversation with or see as influential (by examining the number of times they re-broadcast, or retweet, other people’s messages or communicate with particular other users), and where and how they likely use the system from (e.g. if you see Tweetie 2 as the application, then you know that they’re using an iPhone, and if the geolocation information is regularly changing then you can assume that they are using the service while mobile). Drawing on metadata, just in this limited example, it is possible to develop very personalized pictures of individuals. I would suggest that few people know just how much information is being transmitted when they choose to use this social networking tool, and the information divulged will only increase with the full-scale rollout of annotations.

According to Sarah Perez’s article “This is What a Tweet Looks Like” annotations will let third-party developers;

add any additional metadata to a Twitter post. That’s right, any data. And a tweet can have more than one annotation attached to it. This extra data will initially start off small –Twitter developer Marcel Molina said it will “probably” be around 512 bytes. But over time, it will gradually grow larger as Twitter rolls out the feature and scales up in order to support it. The company hopes to have it end up “around 2K,” says Molina. How developers use that extra space is entirely up to them – there can be one giant piece of extra data attached to a tweet or a thousand tiny ones.

What is most significant is that Twitter actually has no clue how, exactly, annotations will be used. This can be read in a very positive light – it means that the community can come together to develop new uses in an organic fashion – or not-so-positive light – the company is abandoning the responsibility of preventing certain kinds of annotations at the API level and not building in privacy by design. I’m not claiming that annotations should be prevented from coming into being, but I do strongly believe that with the amount of information already contained in metadata of tweets that it’s very possible for enterprising developers to include information that would further increase the amount of personally identifiable information that is trafficked. Further, given that few people will know how to identify what is contained in the packets sent to and from Twitter, end-users are unlikely to realize just what they’re unintentionally divulging.

On a note of the (lack of) corporate transparency at Twitter, I have yet (after over hours of hunting) to find information like that revealed by Krikorian anywhere on Twitter’s website in an easily accessible, relatively easily understandable, format. Developer-speak is insufficient to meet any reasonable transparency requirement where you are providing a service to the public. This failure to be maximally transparent to generally non-technically inclined public is a substantial corporate failing, and one that should be remedied by Twitter itself. Relying on third-party researchers to make information available should not be be an acceptable way for the public to learn what they are, and are not, transmitting to the ‘net when they use a communications system. Unfortunately, we are often forced to rely on these enterprising minds because they are the only means though which the public can actually learn what their communications companies are, and are not, doing.

The Mobile Web Experience

If you have what’s commonly referred as a ‘dumb-phone’, or a phone that has relatively limited functionality (e.g not an iPhone, Blackberry, Windows Mobile device), that can still surf the web then you’re potentially leaving your phone number (and more!) all over the web. At a recent security conference in Vancouver, CanSecWest, Collin Mulliner (PhD student at Security in Telecommunications at Technical University Berlin, German) revealed that a substantial amount of data is often leaked when people browse the web with their phones. What is particularly interesting is that the Mobile Subscriber Integrated Services Digital Network Number (MSISDN) – your phone number – is regularly being revealed to third-parties as a result of how mobile operators appear to have configured their networks. Other information that is often provided to third-parties includes the International Mobile Subscriber Identity (IMSI) – your SIM card’s unique identifier – and the International Mobile Equipment Entity (IMEI) – the phone’s unique ID – in addition to some data that cannot be correlated 100% to any known element of mobile networks.

When you browse the web on your mobile phone, it’s often the case that your request for a webpage goes through a proxy maintained by your mobile network operator before requesting data from the website that you’re interested in. This can be done for a variety of reasons, including data compression and to facilitate better experiences with mobile operators’ own websites, such as faster billing processes. The existence of these proxies was revealed when Mulliner found that the mobile phones he was testing didn’t  have the information that was being sent to third-party websites. As an example, his personal phone didn’t have his customer ID, but when he configured his own website to capture header information of incoming packets originating from his phone, he learned that his Vodafone/BILDmobil mobile phone revealed both his phone number and (what appears as) subscriber ID. This information was contained in the header information of packets being received by his server. Other mobile carriers that also are leaking sensitive information include;

  • Rogers in Canada displays the MSISDN (phone number)
  • H3G S.p.a. in Italy displays the MSISDN
  • Orange  in the UK reveal networking information that includes connection type, IP address, and Gateway IDs and the MSISDN
  • Pelephone in Israel who leak the MSISDN, IMSI, and IMEI
  • Zain in Nigeria who reveal whether you’re roaming and you MSISDN
  • Bharat Sanchar Nigam Ltd in India, who reveal the MSISDN, network access type, how the phone is paid for (e.g. prepaid or not), and IMSI

These are a handful of the mobile operators providing this information to a third-party website, and such information is likely being divulged because mobile operators’ proxy servers are attaching the information for internal purposes. Rogers, as an example, might want to attach this information because when visiting an internal Rogers website with the information attached some services might be provided more quickly and easily. Unfortunately, the information is appended to data packets regardless of whether you are visiting an internal Rogers website or my website, christopher-parsons.com. The theory that a proxy is responsible follows because, as noted by Mulliner, he has no log entries from smart phones that avoid pre-configured mobile proxies.

Commonly leaked data includes the MSISDN, IMSI, IMEI, access point name (APN), and customer or account ID. Also captured, though more rarely, is the roaming status type and whether the account is pre-paid or not. These are significant data leaks because the unique IDs – especially those not tied to a phone, but to a user or account instead – can be used to track mobile subscribers’ online behaviour. Moreover, with a phone number it is possible to do a reverse lookup to personally identify site visitors, or to send an SMS message to extract further information on web visitors.

Mulliner has provided a webpage – which he states is not being used to capture logs of inbound traffic! – to test whether your mobile phone is divulging personal information. For the record: my Rogers iPhone is not divulging anything it shouldn’t be according to Mulliner’s test.

Takeaway

So, what should we take away from the work of both Krikorian and Mulliner? To begin, it should be evident that people with technical capabilities are interested in privacy, are interested in forcing companies to be transparent, and are successful in extracting the technical information required to reveal what various communications companies are up to. This is very, very important work, and runs counter to the trite statements such as ‘technologists don’t care about privacy’. What needs to be done next, however, is find additional avenues through which their findings can be effectively released to the policymakers, lawmakers, and regulators that are responsible for securing citizens’ privacy.

On the one hand, this requires a proactive body of privacy legislators and regulators – what could potentially be appended in an annotation to a tweet, and what should be prevented from being appended per Canadian privacy law? – and fiercely and rapidly reactive – Rogers should be immediately required to alter their operating practices. The Canadian commissioners, along with some members of the CRTC, genuinely work to secure the privacy of Canadians but are (obviously!) unable to be everywhere at once: additional resources need to flow especially into the offices of the privacy commissioners so that they can have members of their office at key security conferences, telco meetings, and related events. Further, they often need broader mandates; Ontario’s commissioner, as an example needs to see more fall under her auspice than just health breeches.

To secure privacy, which acts simultaneously as a community and individual good, we must see better resourcing of ‘official’ outlets such as the Commissioners. There has to be better funding of civil advocacy groups that often are involved in launching research projects and initiating legal fights over infringements of Canadians’ privacy. Moreover, enhanced funding for Canada’s research institutions is required, so that that aspiring graduate students and enterprising faculty can work with civil society and government to address the seemingly endless infringements into Canadians rights to privacy. Without substantial influxes of funding, without a real commitment to protecting Canadians’ privacy, we will see occasional successes that are soured by the prevalent failure to account for, and respond to, the hidden infringements to our privacy that often manifest only after examining the metadata of traffic delivered to, and received from, the ‘net.