You must have heard about the GDPR, and you might also have heard about big data, also defined as the three V:s (Volume, Velocity and Variety). The term is used to refer to the huge amount of digital information from individuals that public and private organisations collect, store and analyse for various purposes. In this digital era, where the number of people using various digital services and tools are higher than ever before, opportunities abound to collect large amounts of data for statistical purposes and identifying behavioural patterns.
This can be used for decision-making by governments for national defence and policy analysis, or by companies to optimize their products and services, such as targeted advertisements based on individual preferences. Some examples of different sectors include retail, transportation, healthcare, insurance, media and entertainment or public sectors such as medical research, statistics on demographics, etc.
The collection, storing, analysis, and use of large amounts of data, to produce useable outcomes is in conflict with what Article 8(2) of the Charter of Fundamental Rights and the GDPR guarantees individuals. Personal data should be protected and processed in a fair manner for specific purposes and should not be kept longer than necessary. The collection and analysis of huge amounts of data can be useful in many cases. Hence, companies should incorporate security, privacy and technical measures in their internal processes and services right from the start, in order to guarantee data subjects their rights.
Compliant Big Data Collection Under GDPR
Big data aims at collecting as much data as possible to analyze and make decisions based on it. The GDPR, on the other hand, states that only the minimal amount should be used for clear purposes. These protecting principles apply to the processing of personal data and are regulated in Article 5 of the GDPR. One such principle states that the processing must be lawful, fair and conducted in a transparent manner in relation to the data subject, i.e, the person whose data is used, Article (5)(a) GDPR.
This means that organisations must evaluate whether a given use of personal data is within the reasonable expectation of the data subject concerned in clear contradiction with big data practice. The purpose of the collection should be explained to the data subject through a clear privacy notice that is concise, written in plain language and easily accessible.
In some instances, the further processing of personal data for different purposes other than the original intention can take place. This is not necessarily incompatible with the GDPR. Compatibility needs to be assessed on a case-by-case basis where the relationship, the expectations of the data subject at the time of collection, the context and the nature of the data should be considered. This is outlined in opinion 03/2013 on purpose limitation, by the Article 29 Working Party (predecessor of the European Data Protection Board).
Furthermore, the purpose for the collection should be specified, lawful and not incompatible with the original purpose. However, this is hardly the case when the processing is based on legal grounds like consent whereby consent is only valid if it is given for a known (disclosed) purpose. Problems arise when the intended purpose is not clarified nor when personal data is analyzed for unstated reasons.
Do data subjects own their (big) data?
Many organisations assume they are GDPR compliant since they use the personal data lawfully and fairly but forget to either delete unused data or to articulate why they need to collect and process particular datasets. According to Article 5(1)(e), personal data should not be stored longer than necessary. Organisations should therefore set retention periods and implement automatic erasure of the data after the period expires.
Data subjects have the right to access, rectify and erase personal data as well as restrict its processing, Article 17 GDPR. This means organisations should be able to dig into the large amount of data stored across several different systems to locate and/or erase the data belonging to the data subject. Many tools such make it easier to categorize the data, while metadata management can be used to catalog data assets (for e.g. Talend, Apache Atlas, Collibra, AtScale and etc). Data analysis of data collected on the legal base of consent is risky since the data subject can withdraw their consent and ask for the erasure. So, organisations should consider only having legitimate interest for the data.
Generally, organisations circulate data on a global scale, to their customers, partners or subcontractors. Data controllers, those organisations responsible for determining the means and purposes for the processing, must ensure data is transferred in terms with GDPR safeguards and supplement measures where needed. On this topic, the transfer of data to the US has been dramatically limited since the CJEU’s (Court of Justice of the European Union) issued its judgement on the so-called Schrems IIcase (C-311/18). In the wait for official solutions, the current consensus is that organisations should implement technical measures to supplement those they currently rely on. This can take the form of encrypting data before exporting it from the EU and keeping the encryption keys in the EU.
Can anonymization be the solution?
The GDPR stresses the difference between pseudonymization and anonymization. According to the definition found in Article 4(5) of the GDPR, pseudonymization, is the substitution of direct identifiers in a way that data can no longer be attributed to a specific data subject without the use of additional information. Anonymization on the other hand, refers to the practice of rendering data unidentifiable in such a way that it is impossible to re-construct the identity of the data subject. Anonymized data falls outside the scope of the GDPR, provided it is carried out optimally.
Article 29 Working Party stated in its opinion 05/2014 that organisations need to evaluate the robustness of its anonymization techniques, ensure that it is secure and that re-identifying the individual is not possible. Failure to do so would result in a situation that is similar to when Netflix released anonymous information of movie rankings by 500,000 individuals that partly got de-anonymized by associating information from iMDB with it. Using privacy models like k-anonymity, an organisation can pretty much achieve the anonymization of its released data.
Machine Learning and Artificial Intelligence
Organisations can use a large amount of data to either make automated decisions based on it or profile the individual using sophisticated algorithms. This is one of the biggest use cases of big data and is commonly referred to as the practice of machine learning or artificial intelligence. This poses as a double edged sword; they might be useful in certain areas like medical research or controlling pollution, while at the same time, might invasively predict an individual’s likelihood to fall ill and as a consequence lead to refusal for a loan or health insurance.
The data subject should also be able to have a good understanding of the logic of how the data will be processed and how it will affect them. Article 13(2)(f) of the GDPR offers them the right to be offered meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing”. In addition to this obligation imposed on the data controller, Article 22(3) provides data subjects with the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the [automated] decision. The GDPR’s flexibility clauses allow for member states to craft further restrictions. This is the case in the French data protection act that makes it compulsory for data controllers to provide explanations as to how the algorithm works.
In the area of digital advertising and real time bidding, organisations should not target individuals without making them aware of the fact that they are subjected to tailored advertisement. The analysis of data should not be used to manipulate individuals via political messages or tailored messages based on their personality. The algorithms behind the processing should not have a discriminatory effect, this, it has frequently been debated, occurs in decisions made that involve processing personal data such as an individual’s residential locality (postcode), gender, sexual orientation, race. Organisations often unknowingly collect or process sensitive data, regulated by Article 9 of the GDPR, exposing them to compliance risks.
Big Data and the GDPR
Big data is important for organisations of any kind to analyse their data assets and improve their processes and products. But returning control to the data subject, as made possible by the GDPR, means that organisations now face different problems when collecting and analysing data. While processing data in a compliant manner comes with design challenges, it positively impacts data subjects’ confidence in the organisation they entrust their data to. Data subjects who trust organisation are more likely to give their consent.
At the end of the day, big data is part of our lives in this digital age. It is useful for many practical applications and can lead to great development of both organisations and countries. However, if misused, it can lead to a general distrust amongst the public and can have a detrimental effect.
Jeren Agh graduated with a degree in Masters of Laws from Stockholms University. In her master thesis, she explored the effects of CLOUD act, it's collision with the GDPR and it's impact on the Privacy Shield.