The pseudo debate about pseudonymised data
Effective medical research and management of the NHS are both at risk because journalists and policymakers can't seem to explain how patient identity is protected when we use their data.
Every time the topic of pseudonymisation appears in print it seems like it is treated as something like a conspiracy to violate patient confidentiality. Those of us who want to use patient-level information to improve the NHS or to do better research are arch villains threatening to make sensitive patient data public.
The Guardian article on the 100,000 genome project does it again.
The piece alleges that the cack-handed way the department of health has explained how the data will be used is part of an unaccountable conspiracy to misuse people’s most sensitive medical data.
But this is nonsense. The root cause of the problem here is a failure of the policy makers and journalists to understand or explain what pseudonymisation is.
And it isn’t that hard. Explaining it is probably easier than spelling it. The claim in the article is that the data associated with the genome project will be pseudonymised and not anonymised, thereby risking disclosure of sensitive personal details. But the thing about pseudo data is that it is the most effective way to make data useful while protecting confidentiality.
The thing about anonymous data is that it is pretty useless. If we have two anonymous datasets, one about what diseases patients suffer from and another about their genetic makeup we can do almost nothing useful with it. We can’t for example, tell that patients with a particular marker in their genes are more likely to get breast cancer because we can’t link the records in one dataset to the matching records in the other dataset. Same thing with NHS activity data: if we can’t join up the records from the NHS datasets that record patient’s interaction with the NHS to the datasets recording whether they die early, we have no way of finding out what NHS activities keep them alive for longer.
The way we can link the datasets to derive useful insights without risking patients’ privacy is pseudonymisation. Researchers and NHS managers don’t need to know the identities of the patients, just whether the records from different datasets are referring to the same patient. This is done by replacing the more identifiable parts of the data (like date of birth and postcode) with long random-looking numbers in a way that guarantees the same number is generated for the same patient in the different databases. This provides strong protection for the identity of the patients without making the data useless when looking for patterns.
For example, if we join up a variety of records about the interactions of patients with the NHS and those subsequently diagnosed with cancer, we can get great insights that lead to improving the way the NHS works and the lives of future patients. See this Macmillan report, for example.
It isn’t perfect. If you already know a lot about a particular individual and are prepared to put a great deal of effort in it is possible to re-identify people. But this is both hard, difficult to do accidentally and illegal. Besides, there are far easier ways to get confidential information about people which already contains their identity. Most breaches of identity and all the harm from such breaches come from the careless way identifiable information is handled (check the conclusion of the Partridge Report or the Nuffield Council on Bioethics report).
So the furore about the use and handling of pseudo data is a complete red herring. The point is to carefully balance protection of identity, the benefits of research and management use of the data. The alternative is either to throw away the possibility of getting any useful insights or to use identifiable data which runs a much larger risk to confidentiality.
The paranoia around the use of pseudo data seems to arise because neither journalists nor policy makers have understood this simple point.
There are only three options to resolve the public concern about confidentiality:
- We can only allow use of anonymous data which will send a strong reassuring message to the public. Unfortunately a false reassurance since confidentiality breaches are unknown in the pseudo data and all happen from identifiable data handled carelessly by hospitals and GPs. We will also guarantee that the data is useless.
- We can use identifiable data for all analysis, research and improvement. This increases the risk of leaks and imposes an enormous burden of protective bureaucracy on analysts and managers who rather spend their time searching for benefits.
- Or we can work with pseudonymised data and achieve big benefits with little risk to confidentiality.
By focussing on and exaggerating the risks of pseudonymisation, the debate obfuscates the key conclusion that, if we want our health data to be useful, pseudonymisation is by far the best option. Even patients will agree if anyone bothers to explain it to them.