GDPR : why Facebook will have to delete 80% of their european user’s data
• Thiébaut Devergranne
Facebook has been in the center of privacy debacles for quiet a long time, in Europe as much as in the US – the Cambridge Analytica scandal and the announce of the data breach of most of it’s 2 billion users being just two recent examples. Last week’s testimony of Mark Zuckerberg raised again questions about the opportunity to regulate the processing of personal data in United States, but the most important subject is by far : Facebook’s GDPR compliance.
Indeed, the impact of GDPR over Facebook is so important that it will likely impose the company to delete massive amounts of it’s European user’s data. There are many complex problems posed by GDPR – like consent (article 6.1.a), or the right to be forgotten (art. 17) often discussed – but the real issue Facebook’s compliance teams will have to solve is data minimisation.
Let’s talk about this new principle in detail (I) before looking at it’s operational impact on Facebook (II). It’s a great practical case to understand the impacts of GDPR on any business.
I – The new principle of data minimisation
A – What is data minimisation
» Personal data shall be : adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed (‘data minimisation’) »
In order to truly understand the scope of this principle you have to look at Recital 39, who explains how the principle should be applied ; notice how strict it’s interpretation is :
« The personal data should be adequate, relevant and limited to what is necessary for the purposes for which they are processed. This requires, in particular, ensuring that the period for which the personal data are stored is limited to a strict minimum. Personal data should be processed only if the purpose of the processing could not reasonably be fulfilled by other means. In order to ensure that the personal data are not kept longer than necessary, time limits should be established by the controller for erasure or for a periodic review. »
So, two distinctive legal rules that come from Recital 39 :
- the controller should collect the minimum data points as possible ;
- these data points should be keep as little time as possible
Those are two extremely important legal rules. Of course, in some cases, the controller is obliged by law to collect specific data, like in KYC procedures. And that’s ok, since the law requires it. The last element to note, from a purely legal perspective, is that data minimisation also overlaps with another principle set by article 5.1.e that is storage minimisation :
« Personal data shall be: » kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed
Now this second principle targets more the data that is « kept« , so it’s complementary – but also overlapping data minimisation – which is broader (collected, stored, consulted, processed, etc.). That’s is for the principle itself. Let’s take a quick look at what the regulation was before, and take a practical example to understand how the data minimisation works in real life.
B – Before data minimisation : anything « non excessive »
GDPR operates a huge change. Before, the directive 95/46 stated that the controller could collecte personal data as long as it was not excessive (article 6 : « personal data must be (…) (c) adequate, relevant and not excessive in relation to the purposes for which they are collected and/or further processed;« ). Now you only can collect personal data as long as it’s strictly necessary. This has a lot of impacts in the way information systems are build, in particular in regard to big data – for example – where companies tend to collect as much data as possible whatever future use will be. If it’s personal data, GDPR ends that.
C – A practical example of data minimisation
Let’s take a practical example of a newsletter. If we apply the rules set by GDPR we have to ask ourselves, what’s the minimal data that we need in order to process our newsletter ? And the answer is pretty straight forward : the email. That’s it ! Anything else should be heavily weighted. We maybe able to argue about the first name, because newsletters are personalised from time to time. But even that, we can for-see situations where the first name can be too much (ex: the newsletter is about sexual preferences, or any other subject in the scope of article 9- in that case data minimisation should be applied more strictly because of the increased risks of identification of people in the case of a data breach). Other than the first name, no other data point seems to be required for a newsletter, unless a valid reason is put forward (ex : the IP address, if you want to record the user’s consent – but that also raises a lot of other questions).
So whatever personal data processing a company has, care must be taken to ensure as minimum personal data as possible is processed ; otherwise the controller will take the risk the very heavy fines of article 83.
Now let’s take a look at the operational challenge of implementation for Facebook.
II – How data minimisation will impact Facebook
The least we can say is that Facebook has real challenge to solve in regard to the principle of minimisation, considering it collects huge amounts of personal data. But first, let’s take a look at the personal data collected and processed by Facebook (A), before we ask ourselves about it’s compliance in regard to GDPR (B).
A – What personal data is processed by Facebook ?
Facebook made it pretty easy to export our own data (actually that is the implementation of the right to access personal data – or partial implementation), so let’s take a look at what kind of data is available. This can be done by clicking on settings, then « Download a copy » of one’s account data. It takes some time for to build the .zip file but once it’s done, Facebook delivers an archive that contains some of the user’s personal data :
Once the .zip is downloaded it can be extracted to see it’s own data.
B – Out dataset is incomplete !
The first conclusion we can make here is that the dataset given to the user is incomplete. Just to take one element, most of the personal data related to profiling users has been excluded. It seems obvious that Facebook has a lot more personal data processings done in regard to advertisement (like the data necessary to build personalised audiences that probably require advanced user profiling). A simple way to see that is to turn to the advertising side of the platform. There, advertisers can target super-specific groups of people like people who are away from their family :
Or people who are back from travel recently – or frequent travellers – which likely implies, from a technical perspective, that a user profile for ad targeting has being created at some point :
If that’s the case (we would need to audit the database infrastructure to get that straight), Facebook will have the obligation to make all that data accessible to the user. In that regard, article 12 imposes to the controller to provide information by electronic means and facilitate user’s access rights :
1. The controller shall take appropriate measures to provide any information referred to in Articles 13 and 14 and any communication under Articles 15 to 22 and 34 relating to processing to the data subject in a concise, transparent, intelligible and easily accessible form, using clear and plain language, in particular for any information addressed specifically to a child. The information shall be provided in writing, or by other means, including, where appropriate, by electronic means. When requested by the data subject, the information may be provided orally, provided that the identity of the data subject is proven by other means.
2. The controller shall facilitate the exercise of data subject rights under Articles 15 to 22. In the cases referred to in Article 11(2), the controller shall not refuse to act on the request of the data subject for exercising his or her rights under Articles 15 to 22, unless the controller demonstrates that it is not in a position to identify the data subject.
To avoid that, Facebook could argue that the data related to advertising is actually not personal data and therefore is not be included in the communication imposed by article 12 and 15. Since we don’t have the data here to discuss it (and to be clear, we haven’t asked it to Facebook) it’s not possible to give a definitive answer to that.
However, when it comes to data privacy, often advertisers live under the illusion that if they hash a user id, and put that into « secure compartiments » the data itself will loose the quality of being personal data, preventing GDPR from being applied. Unfortunately this logic doesn’t work from a legal perspective, the definition of personal data being pretty clear on that matter (art. 4) :
‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
And Recital 26 cleared the question about of pseudonymisation – by stating that it still is personal data :
Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
So we are unable to assess here how much data minimisation could impact Facebook in regard to advertisement, or other items that are not shown in the dataset sent back to us (website logs…).
C – The user’s contacts (probably delete 100%)
So back to what we can audit, let’s start by the user’s contacts. Here the contacts are not the user’s friends – but the list of emails imported when a user has connected his account with his email account. These contacts seems to have been collected by Facebook through interconnexion of email services. From GDPR this raises the question of information of the contacts themselves (article 14) as much as minimisation :
It seems that all the user’s contacts are kept with no time limit set apparently. If that’s the case, it is contrary to the principle of data minimisation set by art. 5.1.c. Actually it’s also contrary to the principle of storage limitation (article 5.1.e : « kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed; »). So from a GDPR perspective, if a user connects his email account and sends an invite to all his friends, the the controller would have to delete the contact list right after the operation.
So Facebook will likely to have to delete all the contacts of all the users in that regard to comply with GDPR.
D – The timeline (probably delete 80%)
Next element available is the timeline ; every event since the user was registered is kept by Facebook :
Again here, the content seems to be kept without any time limits. Let’s discuss the respect of the provisions of article 5.1.c.
The main point a user could make is that keeping this data is neither « relevant » or « limited to what’s necessary » (as required by art. 5.1.c) for the purposes for which they are processed. Indeed, it’s unusual to really look back at content that was published 5 or 10 years ago and Facebook is not built for that purpose.This argument can be strongly defended since the point of a social network is to keep users engaged with people in the present moment not in the past. Therefore, it seems hard to justify to keep these data points.
To respond to that argument, Facebook could defend the fact that users want to have access to this data (like memories shown to the user form time to time). And it would be fair to accept that a few main events could be kept for these services. But what about the entire timeline ?
The most logical solution to solve that apparent dilemma comes from article 25 which imposes « data protection by design and by default » :
1. Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects.
2. The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.
So the most logic way to ensure compliance for the timeline would therefore be to let the user choose the time the data should be kept and set a short time default like 3-6 month, as well as a maximum limit of (ex: 1 year). Passed that time limit only major events could be kept.
E – Photos (probably delete 100% of the metadata and probably delete 50% of the photos)
The photos present another interesting challenge in regard to the data collected. On one side it’s a core feature of Facebook and users tend to store photos long term. On the other side, Facebook stores a lot of meta data related to these photos. Let’s discuss both elements.
In regard to the photos, a similar solution than the timeline seems to be most appropriate – that is to implement data minimisation in regard to article 25 which imposes data protection by design and by default. In concret terms that means, let the user choose the time for which the photos will be stored, but set a small time frame by default. 1 to 3 years seems appropriate to by default, and allow the user to change that to a maximum of 5 years after which the photos would be deleted by default (or the user could be asked about storing these images for a longer timeframe).Storing Facebook photos for longer time frames poses real privacy problems because a lot of images are taken from other people than the user himself. Therefore privacy of these users should also be taken in consideration in regard to data minimisation. For these reasons, a hard limit should be set at some point where all the images will be deleted in all cases (ex : 7 years seems fair).
On another side, Facebook also stores a lot of meta data related to these photos – again – apparently with no time limit. For example on this photo, taken in 2014, we can see the IP address, the precise localisation of the image, data related to the camera (f-stop, exposure…), the model of camera used and the time of the image :
Again, to asses the compliance to the principle of data minimisation we have to ask ourselves two questions here : are these data points strictly necessary for the purposes for which they are processed ? And do we absolutely need to keep these data points in time ? Both of these questions invite a negative answer.
Facebook could indeed argue that the service uses features related to this data (a user can see where the picture has been taken). However, considering the volume and the privacy risks that their storage creates – users could make strong cases in regard to the fact that this data is not really necessary to operate the service. It is our opinion that this meta data should be deleted entirely.
F – Friends (delete 100% of old requests)
Another interesting data point is the list of friends – which is the core of Facebook application :
Now the dataset we downloaded also shows the followings :
- Sent Friend Requests
- Received Friend Requests
- Declined Friend Requests
- Removed Friends
- Friend Peer Group
Now to be fair to Facebook, if there’s one thing that we really can’t do without on Facebook is Facebook friends. These data points are absolutely necessary for the service to operate. However there’s more to say about the followings :
- Sent Friend Requests : these one seems kept without time limit. It seems excessive and not compliant in regard to data minimisation. A period of 2 month seems reasonable after what the data should be erased.
- Received Friend Requests : same remark as for sent requests.
- Declined Friend Requests : same remark that the 2 previous ones.
- Removed Friends : again here there’s no apparent need to keep that data, and should probably be deleted – maybe after a short time period like 2 month.
- Followees : this one has to be kept for operations as long as the user is following someone.
G – Messages (probably delete 90%)
All the messages that were sent by the users are also kept on Facebook since the creation of the account, so in my own example we can find messages dated from 2011 :
The messages exchanged between users accumulate a huge quantity of personal data. The main issue here is that these conversations are kept as long as the user has an account. Again this seems really contrary to article 5.1.c since the data collected and processed has to be limited to what is necessary in relation to the purposes for which they are processed. To that extend it’s reasonable to consider that the purpose of messaging is to exchange live messages, not really to archive conversation forever. And it’s probably reasonable to consider that few users really re-access very old conversations (it’s also very unpractical on Facebook). For these reasons a reasonable limit, respectful of data minimisation could be around one month – period of time after which the data would have to be deleted.
H – Events (probably delete 80%)
The events follow the same logic, as we can find the data related to each event that was recorded since the beginning of the account :
The events follow the exact same logic than the previous elements. We can’t see real good reasons to keep that data to offer the services Facebook really offers. A timeframe more respectful of data minimisation could be around 3 month.
We haven’t scratched the surface of the problems posed by the GDPR compliance process and yet we see how complex and challenging it is, in particular for companies like Facebook who collect massive amounts of data.
There has been a lot of discussion about GDPR implementation, but the real challenging aspects of it’s compliance are often forgotten or set aside. As we have seen, data minimisation is one of the real challenges most companies will have to attack.