On data protection and anonymization

The Voices of the 20th Century Archive specifically collects materials from interview research to make them available to registered users. The level of access is determined on the one hand by the restrictions set according to the intentions of the depositing researchers, and on the other hand by data protection considerations arising from the nature of the deposited documents themselves. The RDC is responsible for developing and implementing a system of conditions that meets both donor requests and privacy principles.

We need to take into account several issues to decide what data protection rules and considerations apply to a collection or to individual documents and interviews within a collection. The guiding principles are obviously provided by the general regulations, such as the GDPR, as well as our internal institutional provisions implementing European and domestic regulations. At the same time, we need to examine the specific risks involved in different data protection settings on a case-by-case basis. The main goal is to protect the interviewee and other persons mentioned in the interview by ensuring that their identities are not revealed. In order to do this, a number of circumstances should be reviewed to determine the appropriate means and the desired level of protection.

It is important to consider the topic of the research, which can be seemingly neutral or noticeably sensitive from a data protection perspective. Ferenc Erős and András Kovács conducted life-course interviews with second-generation Holocaust survivors in their study of Jewish identity. It is therefore known from the outset that the interviewees belong to a historically persecuted minority. Although they are not listed by their own names, but are indicated by a code in the document title, they mention a number of facts in the interviews from which their personal identity can be deduced. In addition, they also talk about other people, often also of Jewish origin, whose identities must also be made unrecognizable.

Obviously, one has to determine whether it is an ordinary person or a public figure, a well-known person, such as a politician, a famous scientist, or an artist, who is mentioned in a recorded conversation. While an ordinary person should be protected, the archivist usually does not need to take precautions in case of a public figure. On a technical level, to make everyday people unidentifiable, it is enough to modify a few details (their name, place of residence, workplace or the name of their profession, etc.), so that they cannot be identified on the basis of neither a singular piece, nor the combination of their personal data. At the same time, a well-known person can be identified sometimes based on a story that has received some publicity.

Another factor to consider is the time of data collection. The Erős research took place in the 1980s, thus the respondents are likely still alive, and their personal rights must be protected. Moreover, since the researchers used the so-called snowball method to sample — that is, new participants were recruited from the interviewees' circle of acquaintances — the disclosure of the identity of one informant could result in the “unmasking” of others.

The RDC staff have developed an anonymization method that allows the data to be redacted in the interview texts with minimal loss of information. In coding the proper names and other text items that are suitable for identifying the protected persons in the interviews, we provide descriptions in brackets to fill the gap and help interpret the stories. For example, in the case of geographical names or occupations, we provide categories one taxonomic level higher, i.e. replace specific data (through which someone could be recognized) with more general descriptions. Another means of reducing information loss is to include the deleted passages and the codes that replace them in the text in a table. This allows us to apply the same code to identical names occurring in various documents within the same collection, so that the relationships that are important for interpreting the interviews are preserved. Another advantage of maintaining a code table is that any encryption can be easily decoded and the original text passage found if needed.

Our basic goal is to code only the necessary minimum. That is, we do not automatically eliminate all proper names or other specific details related to individuals from the interview texts. Likewise, we only replace those dates with approximate time designations that can easily identify a person, for example, those that indicate the time of biographical facts included in personal documents. We therefore anonymize data only and exclusively if they, alone or in combination, are likely to contribute to the recognition of the interviewees.

Our anonymization methods and guidelines have been developed in the framework of our internship program, based on the previous practices at the RDC. The result was included in a comprehensive manual to facilitate the operations, which are often carried out by external staff. In the next step, following consultations and collaborations with staff of other European interview archives, we will develop techniques of partially automated, i.e. machine-assisted, anonymization. We expect this development to bring increased consistency and, above all, to enable us to process our interview materials in a significantly larger volume than before and to make them available for our audiences as soon as possible.