Pages

Wednesday, 14 August 2019

Good news: The AI won't kill you; Bad news: it won't solve your problems either.



AIs won't take over the world because they won't be able to work out how to do it. But they won't solve our big problems either. The hype machine behind current AI investment is founded on an unwarranted extrapolation from recent AI successes that doesn't apply to most real world problems. We need to learn where AI can work and where it can't or we will waste money on systems that can't possibly work. 

Scifi and a scientific speculation about Artificial Intelligence (AI) has ignored a very fundamental limitation in how AIs work. This limitation has surprising parallels with why some of the less pleasant political philosophies the world has seen are doomed to fail. But this limitation has been–as far as I can tell–largely missed by serious speculators and fiction writers.

And the failure to understand the limitations of AI has consequences. On one hand there is a fair amount of worry about the dangers. On the other there is a dangerous and naive belief in their ability to solve many difficult problems. If I am right, neither view is justified.

I started thinking about this while reading Tom Chivers' book The AI Does Not Hate You (a really good read that covers a lot of the background and a lot of thinking from the community of nerds who have devoted a lot of time worrying about the consequences of AI). But I was also reading some Karl Popper books I skipped as a student (The Poverty of Historicism and The Open Society and its Enemies). And–somewhat amazingly, since Popper wrote these before AI had ever been invented–it turned out his ideas had some current relevance.

Let me explain.

An introduction: AIs in fiction
There are plenty of SciFi stories involving malevolent AIs. The 1970 saw the movie Colossus: The Forbin Project (where The US basically creates SkyNet 27 years early to run nuclear defence only to find the computer able to outsmart the government and its creators). The Terminator added time travel to a similar scenario (there is nothing new under the sun). The Matrix added virtual reality simulations to related idea. 

What happened in the worlds created by Isaac Asimov is revealing. His Robot stories created robots with positronic brains and acted at least as intelligently as people. His unconnected series The Foundation Trilogy didn't have any AIs but did have a human psychohistorian who had worked out the laws determining the future path of human history. Asimov's Hari Seldon used "calculators" to solve his equations. But the underlying idea was that, given enough data, calculations could be done that would predict the future of society and, when required, steer it to a better outcome. This idea is very important even though it didn't, when originally written, involve a superintelligent AI.

Later Asimov (very ill-advisedly) retconned The Foundation stories to make them part of the same universe as the Robot stories and make a powerful AI the driver for all the messing with human history. In this case the AI was benevolent. But the same assumptions apply in his optimistic world as they do in the dystopian stories where the AI wants to kill us all or where it accidentally kills us all in pursuit of the goal of making more paperclips.

The fear expressed in many of the stories of rogue AIs (mirrored by the optimism in Asimov) is that, once we have created computers that can process information faster than we can, we will lose control and they will be able to dominate us (in Asimov for beneficial ends; in many other stories, for perverse or malicious ends). 

Why the assumptions are wrong
The underlying assumption in both the optimistic stories and the dystopian ones is that people are both not entirely rational and have a finite capacity for reasoning that can easily be exceeded once a critical mass of computing power can be assembled.

There are two parts to this belief that we can be outdone by a sufficiently powerful computer. One is the belief that rational thought unencumbered with emotion will produce better solutions to problems than human thinking. The other is that sufficiently powerful computers can become powerful enough to outmaneuver people in managing the world (either leading to a utopian paradise or a dystopian slave state depending on the orientation and goals of the AI).

The idea that better solutions will emerge if we purge the world of emotion is attractive to many naive scientists. But it is easy to refute because of the way the human brain works. The brain is, to some extent, compartmentalised with different physical areas dealing with different roles. Vision is controlled by different parts of the brain then smell or hearing. And human emotions, not to simplify too much, can be partly localised in specific parts of the brain. When these are damaged we can see what sort of person emerges. The famous case of Phineas Gage is often quoted as the archetype of what a person is like when their emotions are removed ( he suffered a very specific brain injury in an industrial accident). But the emotionless person turns out not to be a super rational problem solver freed from the misleading siren calls of conflicting emotional drives. The emotionless person that emerged after Phineas Gage's injury turned out to be a complete mess incapable of making basic decisions or getting anything done. He was heavily studied as have other people with similar injuries. He turned out more like Buridan's Ass than a super rational thinker. It turns out that emotions are a vital part of human thinking and decision making and not a burden that gets in the way of logic as the Vulcans would have us believe. (Though the details of his recovery are often omitted and the case is more complex than often supposed and Spock is not entirely emotionless–don't @ me). 

AI researcher David Gelertner wrote a whole book explaining that an AI can't hope to emulate human intelligence without incorporating emotions.

But the other part of the problem is more significant for understanding why AIs won't work the way SciFi worries about. 

Many thinkers have speculated is that, when AI gets to a certain threshold of power, its ability to learn grows exponentially and it rapidly outgrows its designers. Hence the worry about taking over the human world. And this possibility has been greatly enhanced by the recent success of AI in surpassing human players at Chess, Go and (apparently) even poker.

We have known that computers could beat us as some board games for a long time. Chequers was basically solved by computers decades ago. But analysts thought for a long time that more complex games would resist the ability of computers to conquer them. Chess is far more complex than chequers (with more possible games than atoms in the known universe–see this Numberphile video for an explanation of those estimates) and some thought that only humans could think well enough to play the game effectively. Then an IBM creation beat the best player in the world (though the original Deep Blue was not an AI, just a very powerful chess computer using expert algorithms and a lot of computing power). Chess was hard because the search-space of possible moves was too large for any brute force searching to find the best possible moves regardless of computer power. Then, very recently, AlphaZero–a learning AI–was taught the rules of chess (but not chess-playing algorithms fine tuned by people to simplify the job of searching all the combinations of moves). And, in a short period, it became the strongest chess playing system in the world, just by playing itself and learning what worked. For a short time many predicted that Go would be resistant to that approach as it has another order of magnitude or two more combinatorial complexity than chess. But it, too, fell to Alpha Go, a similar AI trained with the rules of Go.

Interestingly in both cases it turns out that an AI has found new strategies that people have not thought of and plays the game in new ways that are very effective.

These triumphs were hailed as illustrating the coming AI singularity when a general AI will do the same exponential learning trick and take over the world. Go and Chess were conquered by an AI that improved at an exponential rate once it learned the basic rules of the games.

The reason why this was, to some, unexpected was because their thinking was based on the combinatorial complexity of the games. The barrier to computer progress was, they speculated, the fact that there are too many possible games to enumerate explicitly and therefore only some different sort of intelligence that could think strategically could win. This, some argued, was what distinguished human thinking.

This is bollocks.

There are only vague analogies between the real world and Chess or Go. While they have been thought to have similarities to warfare and can train people in some of the strategy of war, this is mostly untrue. Real warfare is not like playing Go: it is more like playing Go when you don't know how big the board is, have only a vague and often mistaken idea where your opponents pieces are, often don't know where your own pieces are, and  have to make decisions while being jabbed in the face with a broken wine bottle. 

The real distinction between the world of human society and the world of chess (but also between many finite problems in the world of people) is not about combinatorial complexity. The real world is, usually, a lot more complicated and reflexive than Chess or Go. It is more uncertain. It has rules that are less clear or that may not even exist. It has more than two players. And it has, sometimes, no clear way of working out who won or even whether winning has any meaning.

This isn't just extra combinatorial complexity. If Go or chess were played on a bigger board with more complex rules they would add more complexity and that complexity would grow very quickly. But they would be just as susceptible to an AI learning how to play them (it might need another generation of processor or bigger memory chips but the same method would surely eventually yield a successful computer player.)

Not in the real world and not for many typical human problems.

To see why consider how the all conquering game AIs have learned how to play their games of Go and Chess. The games have a finite set of rules that define a valid game or sequence of moves. AIs have an unambiguous way of calculating the score at the end of the game and no doubt about which side won the game. AIs can play games and observe the outcomes. Even if they start with random moves they can learn by observing the outcomes and, therefore, what patterns usually yield victory. They can improve their recognition of those winning patterns by incorporating them in future play and playing further games. They can create new games as fast as they can compute their moves and every game adds to their knowledge of which patterns yield more victories. Learning AIs are, at their heart, pattern recognition engines that can process vast amounts of data. The limitation of their learning in finite games is how fast they can generate more possible games to learn from. Hence, the more computer power the faster they learn. And once they start learning the growth in expertise is exponential until they have exhausted their storage and memory capacity.

But this sort of learning is only possible because the rules are finite and the outcomes certain. While some practical problems are like this (more later), most are not. In the real world both the rules and outcomes may be unclear.

Take, for example, the problem of extending the human life span. Maybe some combination of diet, physical exercise and intellectual challenge and genetic manipulation could extend the human lifespan to a median of 150 years. It isn't impossible. It doesn't violate the laws of biology or physics. What would a benevolent AI have to do to solve that problem? It could suggest a series of genetic interventions to extend the human lifespan. And it could, with some help from us, test those ideas. But it can't learn quickly what works because it has to wait 150 years to find out whether it achieved its goal. And even a full simulation of a functioning human body inside the AI doesn't solve this problem because that simulation can only be built correctly from actual observations about what happens in the real world. The AI can't learn faster than it can observe the outcome of experiments IRL. 

This is worth dwelling on not least because it helps distinguish which tasks might be soluble with a decent learning AI. Problems that are worth tackling will have unambiguous rules or inputs and unambiguous outcomes that can be checked against reality with little uncertainty or ambiguity. So finite games with clear rules are not a problem. Even games involving chance are in this domain even if they involve bluffing, which is why it seems that AIs can play poker. But poker has unambiguous rules and outcomes. The best bluffing strategy is, in principle, learnable even if it involves judging specific characteristics of the opponent (though a good strategy will win more often than a bad one even if played against unknown opponents). There can't be a "perfect" strategy with the amount of randomness involved, but good strategy can be learned.

Consider what sort of medical advances might fall to a learning computer. Some diagnostics (identifying breast cancer from breast scan images, for example) have very clear datasets and fairly clear known outcomes (not 100% certain outcomes, though, as we know from epidemiological analysis that we send far too many suspect cancers for surgery than we should). But we can certainly compare a computer driven classification algorithm for suspected cancers to the work of experienced radiologists and we can check their work with other clinical results. But most medicine is not like that. Consider trying to train an AI to do the job of a GP. The inputs are vague and ambiguous (interactions with patients are often rambling and even the clinical history of the patient may be full of errors). The outputs–a diagnosis and a treatment–are hard to test for correctness and may also, frequently, be wrong. Worse still we may lack any good way to check the diagnoses or treatments to test whether they are correct. So, even if we trained an AI to mimic an existing GP we might not be able to tell whether we had created a Harold Shipman or a Florence Nightingale (OK, she was a nurse and a statistician not a GP, but you get the point). 

In short, there is little hope of training effective AIs when the inputs are unconstrained, vague and variable; or where the outputs cannot be readily verified to be correct or even unambiguously good. We find this a tough problem even for human doctors.

And medicine is a narrow field of human activity. Life is much bigger and more variable. As is society, where individual lives interact in ways that create exponentially (in the proper mathematical sense) more unpredictable interactions. 

So how can AIs learn how to take over the world? There is no way for them to learn how to do so when there are no patterns to observe and learn from.

This touches on a philosophical debate that has had too little impact on either computer science or politics. Karl Popper called  it The Poverty of Historicism. His book was a powerful demolition of the idea that history has predictable, teleological patterns. In particular, he was determined to show that the political philosophies that rely on history flowing towards a specific destination were both wrong and extremely dangerous. Many more people agree with the second part of that than have bothered to understand the first: historicism isn't just dangerous, it is wrong.

He argues:
"My proof consists of showing that no scientific predictor–whether a human scientist or a calculating machine–can possibly predict by scientific methods, its own future results."

And, while I'm simplifying Popper's argument a lot, the basic idea holds. An AI cannot learn how to run the world by recognising patterns and outcomes because there are no consistent patterns to observe. Just as importantly, even if there were such patterns, an AI could not learn them quickly as the implications of an intervention in the world might take a whole human lifespan to become apparent. That's a pretty slow learning loop. 

To put it simply: the apparently magnificent achievements of current AI tools are based on dramatic improvements in the algorithms for pattern recognition. But they probably won't work at all and certainly won't show unconstrained exponential improvement when there are no patterns to observe.

So what?
The implications of this are not all pessimistic. There are problems where AI can give us better solutions than we currently have. Specific problems where there are clear patterns and clear outcomes may well fall to AI (eg interpreting breast scans to identify the early stages of cancer or eye scans to spot incipient eye disease before it gets too bad to fix). But these are a fairly small subset of problems in medicine or life. And many of the big problems in the human world have none of the characteristics that would enable an AI to recognise patterns and solve anything.

It would certainly help if we had a better idea about where AI investment should be directed. If we put effort into problems current AI techniques are likely to be able to solve, we could see some big benefits. But the hype train appears to be overwhelming our judgement. The recently announced NHS decision to spend £250m on AI appears to be driven by exactly the same naive optimism that has invested huge sums into AI research in the past. And failed. Every time. DARPA spent a lot in the 1950s & 1960s but gave up in despair in the 1970s. Japan had a huge related programme in the 1980s followed by the EU and the UK (with the Alvey programme). All these were written off as failures by the mid 1990s. All suffered from overambition and a failure to identify which problems could be tackled given the tools available. There is little sign that the current boom has learned anything from this history.

Overoptimism may be the most important risk. We trust outputs produced by computers even when they do not deserve our trust. As Meredith Broussard argues in Artificial Unintelligence:

"One recurrent idea in this book is that computers are good at some things and very bad at others, and social problems arise from situations in which people misjudge how suitable a computer is for performing the task."

Even when we apply big algorithms we understand (we often don't understand the details behind learning AIs) we suffer from this problem. We trawl big datasets seeking new patterns, for example on the relationship between diet and health. We find new patterns. We believe the new patterns because some clever new Big Data algorithm has found them. But the problem with big data is that many of the "patterns" are noise. The number of apparently significant statistical correlations in large data sets grows faster than the dataset and a lot faster than the true patterns. The vast majority of the "new" patterns we see turn out to be noise when properly tested (see Ioannidis' famous paper Why Most Published Research Findings are False). Or Broussard's point on why the even more hyped science of Big data won't help:

"Here’s an open secret of the big data world: all data is dirty. All of it. Data is made by people going around and counting things or made by sensors that are made by people. In every seemingly orderly column of numbers, there is noise. There is mess. There is incompleteness. This is life. The problem is, dirty data doesn’t compute."
If we automate the search for new patterns with clever computer algorithms or AI we dramatically inhibit our critical faculties in assessing the results. The clever computer said it, so it must be true. Even when the dataset the computer used would be thrown out as irredeemably corrupt by any self-respecting scientist.
The biggest risk from AI is not that the AI will try to take over the world. It is that we will lower our natural skepticism and we will trust what the computer says even when it does not deserve our trust. 
We are also at risk of wasting vast amounts seeking magic bullets to solve problems we could solve for less money using known techniques. Many medics in the NHS have pointed out that providing hospitals with computers that don't take 30mins to wake up in the morning and which don't require 10 separate logins for single clinic session might yield more good than £250m on speculative investment in AI. But the AI is headline friendly and gets the cash but boring improvements in basic IT are just not newsworthy and get no investment.
I've rambled on for far too long already. But my basic conclusion is this: AI can solve some (narrow) problems and, if we are going to spend money, that's where we should direct it. But we should inoculate ourselves against the hype: AI won't solve the big, human problems we have and we should not waste money on programmes that assume it will.
Oh, and the idea that a future AI could take over the world is nonsense. 


Friday, 28 June 2019

How busy are English GPs?

Our GPs are probably grossly overworked. There is plenty of evidence that this is true, but recent data collected by NHS Digital paints an ambiguous picture. Many GPs have reacted to my analysis of that data with incredulity, claiming NHS Digital don't know what they are doing. Or that the data is meaningless and too much of a burden to collect. The real situation is more complex and won't be fixed by NHS Digital alone but requires GPs to pay more attention to how they collect data and why it is collected. And it is critically important that the system does a better job of collecting data about their activity or the case for higher primary care funding will fail as soon as it is examined by the treasury. GPs, their system providers and NHS Digital all need to work together to create more useful data.


Recently released data poses some interesting questions for GPs


At the start of 2019, NHS Digital released the first public version of a new dataset on GP activity they had been collecting since the end of 2017 (and they are currently updating monthly). This dataset summarizes the number of consultations each day using data collected directly from the (major) providers of GP clinical systems (not all of the minor providers are covered yet). 


The public version of the dataset covers the CCG-level aggregate number of appointments and the status of those appointments in several different categories:

  • The appointment mode (face to face, telephone, visit, video etc.)
  • The appointment status (basically whether the patient turned up for a booked slot or not)
  • The type of staff who saw the patient (essentially whether the person who saw the patient was a GP or someone else)
  • The delay between the booking and when the appointment actually happened


The data coverage is good for most CCGs (only one has no data at all) and there are some big, interesting insights in the data. For example, the number of no-shows ("did not attend" or DNAs) is very strongly related to the length of time a patient has to wait for an appointment (see my analysis here and a discussion in the BMJ here including comments on why the results are different to some previous academic analyses of DNA causes).


But the analysis I did on the number of appointments per day done by GPs caused a much bigger kerfuffle. I want to be fair and present those criticisms alongside the analysis. It is possible that both the analysis and GPs criticisms of it are right, but, if that is correct, then GPs, their systems providers and NHS Digital need to make some significant changes to how they work.


Before we start, a thought experiment


To put the results in context, it is worth doing a simple thought experiment to put the results in perspective.


Imagine a GP who does nothing other than see patients. She fills her entire 8-hour working day with 10-minute appointments with no breaks. Let's also assume that she works 48 weeks out of 52 (including bank holidays in the 4 weeks of downtime). And let's assume that each day worked consists of 48 appointments (this is how many fit into an 8-hour day if she takes no breaks between appointments or for lunch and does any other work after hours or at weekends). That means she does 11,520 appointments per year or about 44 per weekday in the year (she doesn't work every weekday because of holidays).


That 44 appointments per day doesn't sound like a sustainable workload. So let's assume a 1hr lunch break and gaps between appointments that brings the average work done down to 5 appointments per hour. Now she does 8,400 appointments per year or about 32 per weekday averaged over the year. That's probably still not sustainable every year but it is a thought experiment and we can make the simplifying assumption that GPs are superhuman. Bear this number in mind when we look at the actual reported workload from the NHS Digital data.


And, don't forget, that a real GP also has a ton of paperwork to process plus they need time for professional development, time to train new GPs and other staff and time to manage the practice. So lots of other work to fit into the actual working day alongside the 8-hr shift spent actually doing appointments. And, remember this analysis simply takes the total number of appointments recorded in a CCG divided by the number of weekdays and the number of FTE GPs; it ignores all the other work and all the extra time GPs put in beyond normal hours or at weekends.


What does the NHS Digital data show?


The original claim that prompted me to look more closely at the NHS Digital data was a simple calculation about the number of appointments per GP based on the headline numbers. NHS Digital estimate that about 307m appointments were recorded in the year to April (they adjust the total for missing data and their adjustment seems to differ from mine by about 10m/year but this is not a huge gap given other uncertainties in the data). Taking their number per working day in the year and adding the March estimate that there are 33,425 FTE GPs in England suggests that the average GP does about 35 appointments per working day. This seems to confirm, when compared to my thought experiment above, that the typical workload is well into unsustainable territory with a side order of "if we keep this up we are all going to die of overwork".


But that isn't what the dataset actually says. GPs have a lot of other staff who handle >40% of the patient contacts according to the data (this is the national average but see the map below for the variation among different CCGs in the % seen by non-GPs).




If you take that into account, the data actually says the average GP does something closer to 20 appointments per day. 


20 appointments/day is the England average. But, usefully, we have the staff census data that tells us how many FTE GPs there are in each CCG and the appointment counts are available for each CCG as well. So I can do a CCG-level analysis and look at the variation across the country (taking into account the number of GPs in the CCG and the % of appointments with a GP). If we generously assume that activity where the staff type for the staff fulfilling the appointment is "unknown" is done by GPs we get the following distribution:




(Note: each block on this chart is a single CCG and the activity data is based on activity done on weekdays in the first 4 months of 2019. Also note that an interactive version of the dataset and many analyses is available here and you can test how varying some of my assumptions changes the results.)


In about 8 CCGs the GPs appear to be seeing an average of about 26-30 patients per day. The commonest CCG average is, however, closer to 20/day. Also, don't forget, the results are averaged over at least a month and reflect the average activity on all available days, not just the busy days where all the GP time is devoted to seeing patients.


Don't forget, though, that this is the average across a whole CCG: the variation among practices within each CCG is probably large so there will probably be some practices where the GPs do >30/day even when the CCG average is just 20.


Some GPs were indignant when I put an early version of this chart on Twitter. The typical response was "20 a day! I do far more work than that in a typical day. NHS Digital don't know what they are doing. The data is obviously rubbish…". I'm not entirely convinced that all the criticisms are fair, but it is worth looking at some of them to see if there are lessons to be learned.


Possible problems with the data


This just doesn't reflect my typical day
This could be because the data is just wrong, but we will come to that possibility later.


Another possible explanation is that what GPs remember about how busy they are isn't an accurate reflection of the average number of appointments across the whole month or year. If, for example, what GPs remember are the day when they are focussed on seeing patients but they also devote other days to practice management, paperwork, training or some other work that doesn't involve booked appointments, then their memory won't match the averages reported here: the results will reflect the total appointments divided by the total available time for all work not just the typical day when the GP is dedicated to seeing patients. 


This work (and my initial thought experiment) assumes a 40hr week. Many GPs spend a lot more than 8hrs a day working (perhaps doing all the admin in their unpaid overtime). This dataset is not recording that work or adjusting the available hours to account for it. This doesn't impact the results but it may make the perception of the results much less clear than they should be.


The data clearly is missing a lot of the work we do
Part of this is a perception problem as discussed above. But it could clearly be true that a lot of activity is not correctly recorded. 


But the blame here isn't on NHS Digital. The data is directly collected from the major clinical systems used in practices to record activity. If the numbers are a big understatement of the appointments being booked, then practices have a big problem not just NHS Digital. I suspect we are not understating the true number of appointments by a large margin.


But we could be understanding the amount of activity. Many patient contacts don't result in a booked face-to-face appointment. GPs do some work online or by telephone. Clinical systems are not uniformly good at recording that activity as I know from comparing incoming requests via askmyGP (an online tool that manages all incoming patient requests and helps GPs to completely change their workflow when responding to those requests in ways that improve the speed of responses to patients and lowers the degree of overwork for GPs). Successful online tools (and some phone-triage approaches) often result in only 1 in 3 requests leading to a face-to-face appointment. This reduces the number of appointments recorded in the clinical system while providing satisfactory responses to the other patients. But much of the activity for those other responses may never appear in the clinical systems. As GPs adopt new ways of working their clinical systems may become a lot less good at measuring their activity.


On the other hand, the NHS Digital data does record a lot of phone activity. Between 10 and 15% of all appointments are recorded as taking place online, by phone or by video. We know that recording of this is patchy, but so is uptake of alternative ways of responding to patient demand. Some GPs still think that face-to-face appointments are the only response despite mounting evidence that offering a wider range of alternatives can reduce GP workload and make patients happier because they get a faster response.


We have to admit, though, that traditional GP systems have not caught up with this change in practice and won't consistently record the shift in activity reliably.

NHS Digital will draw all the wrong conclusions from these results
I got the impression from some comments on my initial analysis that, not too exaggerate too much, central NHS bodies are just a bunch of malingering bureaucrats who exist just to malign hard working GPs by misrepresenting how hard they work by providing the NHS leadership with false, irrelevant data about what they do. And that any suggestion that this should be fixed by GPs spending more time to provide accurate data about what they do would be yet another straw added to the already broken camel's back that is the typical GPs unsustainable workload (or some such mixed metaphor for even more overwork).


But NHS Digital's motivation is not to burden GPs with more work or to undermine the case for more GPs. It is to understand what GPs are doing to develop better policy, and that can't be done without better data. Originally the idea was focussed on developing a better understanding of winter pressures but it should be useful for bolstering the case for more investment in primary care or for promoting better working processes. 


But, imagine the conversation the NHS would have with the treasury when bidding for the money to recruit another 5,000 GPs:


NHS "We need another 5,000 GPs."
HMRC "How much will that cost?"
NHS "A couple of billion pounds a year."
HMRC "That sounds like a lot. Why do you need it?"
NHS "They keep telling us they are overworked."
HMRC "That's what everyone says. How much work do they do? Show me some numbers."
NHS "They say they are doing unsustainable numbers of appointments a day."
HMRC "How many is that."
NHS "We don't know exactly but more than 30/day doesn't sound sustainable"
HMRC "So how many are they actually doing?"
NHS "No idea. But I'm sure it is a lot. We really need the money to recruit more GPs."
HMRC "Sure. You can have it when we cash the £350m/week we will save from Brexit."
NHS "Really?"
HMRC "No." 
HMRC "We were joking. Sod off and don't come back until you have some actual evidence."


OK, I'm parodying the case for more money for GPs and there are other hard sources of evidence that say they are overworked. But if the NHS as a whole doesn't understand what, in detail, GPs are actually doing any case for reform or extra cash is going to look weak.


There are more reasons than this to have reliable data. GPs need it to understand their own work or they won't have any idea how to do a better job. While the NHS Digital appointment data doesn't capture everything they do, it is at least a start providing a better understanding of a large part of their activity. And the main limitations that prevent the data being more useful are not NHS Digital's fault but the fault of the system providers and choices made by GP practices. 


The data isn't consistent or comprehensive
GP Clinical systems are built around a model of GP work that assumes everything revolves around a pre-booked face-to-face appointment. GPs can vary how they label the slots for activity in the system and they don't always assume every appointment is 10mins long (though their royal college doesn't seem to have caught up with this realisation). Systems usually allow the exact time taken for each appointment to be recorded (assuming GPs press the right buttons which they don't always do). Some systems make it easy to record activity in open-access clinics. But systems tend to assume every appointment slot is the same length and some are completely, irretrievably hopeless at measuring the time taken for consultations done by phone. 


One of the biggest problems faced by NHS Digital is the number of different descriptions given as labels for each slot. Practices are, essentially, free to label each slot however they want and one estimate says that there are tens of thousands of different labels in use across practices in England. So when NHS Digital try to acquire data from everyone it is a gargantuan task to correctly identify the different types of appointment and group them together in ways that make sense. Was that a face to face pre-booked appointment? Or an ad-hoc on the day f-to-f as part of a walk-in clinic. Or a slot where a nurse did flu-jabs?


And, as more activity is initiated online and GPs choose to triage patient requests before booking appointments, there are more varied ways for GPs to respond to patients. And the systems–designed as they are around the idea that most activity involves pre-booked slots–don't record all those interactions. In practices using online tools like askmyGP only around ⅓ of requests result in a face-to-face appointment (e-Consult, another online tool provider estimates a similar ratio). This means, potentially, that the majority of GP activity won't be recorded in the traditional clinical systems.


So it may well be true that what NHS Digital are collecting is not a reliable guide to what activity GPs are doing. But the fault can't be fixed by them: if we want better data the problem has to be addressed by the practices (who need more consistency is how they use their systems) and the suppliers (who need to design their systems to break the model that everything neatly fits into 10 or 15min slots).


Some suggestions about how to do a better job


It is not at all helpful to complain that the NHS Digital GP data is rubbish. If we want to do a better job in primary care (to improve how GPs work and make an unanswerable case for more investment) we need better data. 


An article in the Economist describes some improvements in primary care and how they were done:


When the four practices serving St Austell merged in 2015, it was an opportunity to reconsider how they did things. The GPs kept a diary, noting precisely what they got up to during the day. It turned out that lots could be done by others: administrators could take care of some communication with hospitals, physios could see people with bad backs and psychiatric nurses those with anxiety. So now they do. Only patients with the most complicated or urgent problems make it to a doctor.


The key point here is that only when they had reliable data about their activity could the GPs redesign the work they did in ways that would improve it.


As a recent Nuffield Trust comment on the GP dataset says:


In hospitals, routine datasets such as Hospital Episode Statistics (HES) give us insight into who gets treated, for what, by whom, and where. This basic knowledge of the activities and performance of hospitals is now essential for making evidence-based policy changes, monitoring existing policies and providing fundamental administrative data to run the system and allocate funds. There is a clear need for something similar in general practice…


But, they continue:


NHS Digital makes it clear that what has been released so far still needs improvement. As it stands, there are a number of problems with the data and these limit the extent to which we can use it as a resource for evidence-based policy. 

Having different practice IT systems gives GPs flexibility over the way they organise their work. But the absence of reporting standards in general practice means information is recorded differently in each system, making it difficult to combine the data, and information gets lost.


In the early 2000s, hospital activity data (HES) was in a similar state of disorder. Doctors took little interest in making sure that the amount of activity recorded was correct or the clinical details correctly coded. They too argued that this was just an unnecessary bureaucratic task required puerly for "feeding the machine" (even though the most basic metric of hospital success–not killing your patients too often–requires accurate coding). 


But HES got a lot better. One way it got better was to show medics what HES said about the work they did. The Royal College of Physicians ran some interesting experiments starting in 2002 (reported here and here). They took HES data and fed it back it to the clinicians whose activity it recorded. In one of the reports on this exercise they argue this:


A vicious circle ensues: routinely collected data is perceived as being of poor quality and unable to support the needs of the individual. Individual clinicians avoid the use of such readily available information... Centrally held datasets remain unchanged through neglect, clinicians failing to engage with the information process in their trusts and remaining ill at ease with the records of activity which result. It is clear that if this cycle is to be broken, steps must be taken to engage clinicians at a level whereby the information is made readily available, accessible in format and of use to clinical practice. By examining routine data from a clinical perspective and feeding issues of quality back to trust information departments the cycle can be reversed.


Hospital clinicians being shown their own data in a digestible form was a big part of the drive that improved the quality and utility of HES data. GPs and NHS Digital need to do something similar.


Perhaps NHS Digital could take the first step by making practice level data from the central collection available to practices in a useful, comprehensible format so GPs can see what the data actually thinks they did. This would be the first step in driving improvement in what is recorded.


Conclusion


In short, here are some simple ways for the different players in primary care to change what they do in ways that would promote real improvement for GPs and their patients.


NHS Digital should make a friendly, easy to comprehend, version of their dataset available to every individual practice so GPs can compare what they think they did to what NHS Digital's extract has recorded.


GP System Providers should make their systems more flexible in recording non-appointment activity and should work to support GPs to use more consistency in how they record and label their activity.


GPs and practices should not just dismiss the data as being unreliable and useless. They should strive to understand it and seek ways to work with NHS Digital to make the data useful for local purposes and more reliable for policy making.


Everyone should not dismiss data they don't like but engage with it and seek to make it both more reliable and more useful.