Technology, Thoughts & Trinkets

Touring the digital through type

Update to Virgin Media and Copyright DPI

virginmedialogoRecently, I’ve heard back from Detica about CView and wanted to share the information that Detica has been provided. CView is the copyright detection Deep Packet Inspection (DPI) appliance that Virgin Media will be trialling, and is intended to measure the amount of copyright infringing files that cross Virgin’s network. This index will let Virgin determine whether the content deals they sign with content producers have a noticeable impact on the amount of infringing P2P traffic on their network. Where such deals reduce infringements, then we might expect Virgin to invest resources in agreements with content producers, and if such agreements have no impact then Virgin’s monies will likely be spent on alternate capital investments. I’ll note up front that I’ve sent some followup questions to seek additional clarity where the answers I received were somewhat hazy; such haziness appears to have been from a miscommunication, and is likely attributable to a particular question that was poorly phrased. Up front, I will state that I’m not willing to release the name of who I’m speaking with at Detica, as I don’t think that their name is needed for public consumption and would be an inappropriate disclosure of personal information.

The key question that is lurking in my own mind – if not that of others interested in the CView product – is whether or not the appliance can associate inspected data flows with individuals. In essence, I’m curious about whether or not CView has the ability to collect ‘personally identifiable information’ as outlined by the Privacy Commissioner of Canada in her recent findings on Bell’s use of DPI. In her findings, the Commissioner argues that because Bell customers’ subscriber ID and IP address are temporarily collated that personal information is being collected that Bell does collect personal information.

In the case of Bell, this didn’t mean that they had to stop the collection, but that they had to adjust their privacy policies to reflect this collection (though it should be noted that any such association and collection will happen, with or without a DPI appliance, because Bell always associates a subscriber ID with dynamically assigned IP addresses).

Now, this means that my examination of the CView system and consideration of privacy is different from those approaching the system from the stance of Regulation of Investigative Powers Act which might be seen as putting me off-side of some privacy advocates. I don’t necessarily have an issue with that, and in fact think that strong, well meaning discussion amongst the privacy community can be quite healthy – different levels of analysis and approaches are called for when facing particularly novel technological systems, and expecting a lockstep approach of these technologies and their accompanying politics is somewhat absurd. For my purposes, I’ll simplify things and identify a privacy infringement (for my purposes, if not those of RIPA) as entailing:

  1. A collection, processing, storage, or analysis of data that is associated with an individual, or a very specific set of individuals;
  2. A case where whatever is collected, processed, stored, or analyzed is done so to influence the individual, or specific set of individuals, in a particular and reasonably direct manner;
  3. An instance of data anonymization where there is the strong likelihood that such anonymization is either intentionally compromised or unlikely to be effective.

In terms of the CView system, let’s first address the concern of anonymization. Specifically, we have to ask how stringent the anonymization system actually is. When I asked Detica about this process, they informed me that because the CView device is intended to produce a Copyright Infringement Index (aka the ‘Piracy Index’) by evaluating the overall filesharing on a network that identity information isn’t required for this objective. IP addresses are anonymized at the source/DPI device using a pseudo-random replacement algorithm, which also entails ignoring the external IP addresses. The key generation system is managed automatically by the device (and thus an ISP can’t muck around with the system), and keys are periodically cycled and redistributed. The keys are never made available outside of the device, and once a set of keys for a given time period are discarded they cannot be recovered – the process is irreversible. On this basis, we can argue that no subscriber ID is associated with the randomized replacement algorithm, there is no way to associate a subscriber ID with the pseudo-random number after the fact, and as such the anonymization system should serve its purpose. Of course, there is a concern that there are no such things as anonymization processes – as noted by Paul Ohm – but I think that a more technical analysis of data logs would be required to figure out whether or not we could make the push that Detica’s system is a failure. At the very least, they appear to be making a real effort in keeping data sets anonymous and doing what they can to prevent privacy infringing behaviour.

One of the questions I posed to Detica, which was related to CView identifying copyright infringing files, went as follows:

“…what method is used to identify content. Is Detica using a file hash-based identification process or fingerprinting system? I ask because broadly identifying protocol alone would render any analysis of P2P data traffic as inherently infringing somewhat problematic, given that P2P is also used for legitimate file transfers.”

The member of the company I wrote to admitted that they couldn’t go into the specifics of how the system performed identifications for commercial reasons – this is normal when dealing with what are effectively corporate secrets – and thus couldn’t speak to their system using either fingerprinting or hash-based analysis. They did say, however, that the system is conservative, insofar as it makes its assessments based on assumptions that transfers are legitimate unless there are reasonable grounds for determining otherwise. As I read/translate this statement, it says to me that rather than classifying all P2P traffic as infringing, the system only flags infringing content as that which can be matched against its index of infringing files. Whether this entails fingerprinting (where only a fragment of a file is identified as infringing, as in a mashup that includes a second or two of a song, instead of the whole file, as in a .mp3 file of Madonna’s ‘Like a Virgin’), however, is unknown.

Detica’s responses maintain that their CView system is deployed in a passive mode (which is expected), and I’ve asked for clarification about whether or not it rests inline or offline – whether the appliances will perform traffic analysis in real time inline with the flow of data passing through the ISP’s network, or in a delayed fashion that sees the data traffic ‘offloaded’ out of the ISP’s network. I expect that it is a passive, inline appliance, but we’ll see. The company does maintain that “there is no persistence of any analysed content – Detica CView(tm) is a measurement system so could not be used as an evidence collection mechanism.” This means that the DPI appliance cannot be used, as designed, to identify individuals trading infringing material online, and thus cannot be effectively used to enforce any three-strikes law.

Ultimately, given that CView is engaging in network-level intelligence, without correlating IP addresses with a unique signature or code, let alone a subscriber ID, I’m not certain that this system is necessarily ‘privacy infringing’ as it’s presently configured and deployed. Does this mean that it can be used to subsequently insist on deeper penetration and analysis of who is trafficking following the establishment of a ‘piracy index’? Quite possibly – the political ramifications of having quantifiable network intelligence are vast. One of the reasons why DPI appliances in general are so interesting is how they are wrapped up in the politics of net neutrality, privacy, and copyright. Despite their interesting intersection along the crossroads of digital issues, perhaps we need to develop an archetype to engage with these devices as follows:

  1. What does the technology do, today? Does this constitute a privacy (or, preferably, constitutional rights grounded) infringement?
  2. What can the technology do, tomorrow? In light of what it can do, how should we advocate for strong protections to prevent our concerns from arising, and channel the technology towards ‘good’ outcomes?
  3. Ask the question ‘what needs to be put in place to ensure that the ‘good’ outcomes of tomorrow triumph over the possible ‘bad’ ones?’ and provide resources to achieve the good and avoid the bad.

This is a simple schema (and, actually, deserving a deeper analysis), but parallels what I’ve come to adopt over the past few months. It is critical that we analytically distinguish between temporal realities and futural possibilities, as well as between the issues of network neutrality, copyright, and privacy (among others) to develop sufficiently nuanced and complicated understandings and resolutions to the insertion of DPI appliances in ISP infrastructures. DPI is unlikely to go away; the aim now has to be to identify and proclaim ‘good’ uses of the technology and work to prevent the ‘bad’ uses from becoming prominent telecommunication practices.

3 Comments

  1. This is the metaphorical equivalent of gauging people’s religious affiliations by opening letters, to see if you find Christmas cards. Or perhaps determining the popularity of a failing political leader by searching P2P messages for the term “Gordon Brown is a tyrant”. You could even apply it to industrial espionage (as Phorm did, and Experian Hitwise do) and gather competitive intelligence about companies.

    I don’t care what the purposes are. I don’t care what the motivation is. I don’t care how subtle the filtering sophistry is.

    Virgin should not be examining the *content* of private/confidential communication traffic without a warrant. People are innocent until proven guilty, a minority use P2P protocols, a subset of those people engage in sharing copyright infringing media, and those that do commit civil not criminal offences.

    Therefore Virgin customers should not be subjected to intrusive communication surveillance. It is simply completely disproportionate, and utterly illegal.

  2. I would steadfastly maintain that this is NOT like opening letters. If the ISP were somehow compromising individuals’ encryption of data packets, that would constitute ‘opening letters’ – packets are (generally) sent in the clear, as postcards are, and are subject to interception at any point. Wireshark does a good job at demonstrating the general accessibility of data packet information.

    As it stands, of course, it’s a partnership between Virgin and Detica that is performing hash-based inspections of content – identifying unique signatures to determine whether packets are involved in transmitting copyright infringing work. I have big issues with this – I think that it could provoke subsequent political approaches to ‘resolving’ copyright infringement that I see as dangerous for society – and that the issue ought to be framed in either a constitutional-rights based approach, or from a consumer advocacy/copyright position. I do not, however, think that a privacy approach on its own is sufficient to critique what Virgin and Detica are engaging in.

    It’s this granular distinction of issues that are raised by the technology that I think is important to maintain. As previously stated, I’m less aware of UK law, but if DPI is ruled as unlawful then there is a substantial impact for broad consumer network provisioning that would follow. Now, whether this impact is seen as acceptable in the face of modes of network intelligence is an interesting question, but one that would need to be addressed were packet inspection technologies to be banned (something that I really can’t see as happening, truth be told).

  3. Having been involved with the design of systems that require anyonimity and tracability (for the system to function within it’s self) I am acutely aware of some of the difficulties involved.

    What I dislike intensly is the “fob off” when it comes to the technical details so that anonymity can actually be assessed.

    For instance I’m acutely aware (apparently unlike many who should know better) that,

    “IP addresses are anonymized at the source/DPI device using a pseudo-random replacement algorithm, which also entails ignoring the external IP addresses. The key generation system is managed automatically by the device (and thus an ISP can’t muck around with the system), and keys are periodically cycled and redistributed. The keys are never made available outside of the device, and once a set of keys for a given time period are discarded they cannot be recovered – the process is irreversible.”

    Is a compleat load that you would expect to be emitted from an oraface not to distant from a pony tail…

    If you work it backwards you will see why,

    1, The claim that keys “cannot be recovered – the process is irreversible”.

    Might well be true but is also irrelevant. There are two ways to get to the key used for the “pseudo-random replacement algorithm”. The “determanistic”? forwards direction and the “irreversible” reverse direction.

    If I know how the “key” is generated in the “forwards” direction what is to stop me “recreating” the key in the forwards direction as opposed to (the supposaly) impossible reverse direction?

    The simple answer is “If I designed the system and there is no TRUE randomness with sufficient ENTROPY” then the answer is nothing…

    Likewise even if there is a TRNG with sufficient energy how do I know that the value to make the key is not encoded in the first records after the key change (See work by Adam Young and Moti Yung on cleptography).

    Further how do I know the key generating value is not “spread spectrum modulated onto the time stamps.

    Or any of many other covert channels.

    Even if not done deliberatly by the system designers how do I know that their implementation does not allow either the key or it’s generation values to be found by a CPU cache attack etc?

    That is the “keys” may not be reversable, also “The keys are never made available outside of the device” and thus be unavailable to Virgin. But is it realy unavailable to the system designers or knowledgable attackers?

    2, Saying “and once a set of keys for a given time period are discarded” is not realy saying much.

    Are the keys kept in unpaged memory?

    Are they securely deleted/overwritten?

    How is memory “garbage collection” etc carried out?

    All of these can criticaly effect if the key is realy unrecoverable or not.

    3, As for “The key generation system is managed automatically by the device… …and keys are periodically cycled and redistributed”

    This makes the short hairs on my neck rise faster than a bolt of lightning.

    What on earth does “keys are periodically cycled and redistributed” mean?

    Does this actually mean that the developers load the “master keys” in to be used like an OTP or that the keys are dependnet on some value such as the unit’s serial number etc etc.

    And what if any relationship does this actually have with the “pseudo-random replacement algorithm”?

    Likewise what about, “The key generation system is managed automatically by the device”.

    Does it simply mean a “cron” type system generates a new key set?

    If so to what level how are active connections delt with etc etc?

    4, The quote says “IP addresses are anonymized” and “which also entails ignoring the external IP addresses”.

    I’m assuming the anonymized IP addresses are those of the P2P transaction that is currently in progress.

    However what are the “external IP addresses” are they an ‘encrypted’ version of the P2P transaction IP addressess and if so where do they go and what other information accompanies them.

    For instance it they are output with a sufficiently accurate time stamp then Virgin may be able to identify the real IP address simply by looking it up in it’s “traffic managment” logs that obviously are “not covered by RIPA” due to that hughmungeous exsemption for managing the network…

    5, Finaly we arive at “IP addresses are anonymized at the source/DPI device using a pseudo-random replacement algorithm”.

    What is the “pseudo-random replacement algorithm”?

    It sounds hand wavingly good but is it say 3DES in ECB mode?

    How about an ARC4 stream generator?

    Or just a “developers cludge” that has not been analysed by anybody with the degree of skill required to give others confidence?

    The devil is in the details and this is all prior to any discussion about if examining the P2P data…

Leave a Reply

Your email address will not be published.

*