Do the best health care improvement initiatives generate the worst evidence?

With randomised controlled trials held up as the best way to find out if a policy is working or not, Jenny Neuburger looks at a recent hip fracture evaluation that was done differently, and explains that one size does not necessarily fit all.

Blog post

Published: 11/08/2017

In an age of concern about “post-truth” government and unprecedented scientific endeavour, there is keen interest in what counts as proper evidence for public policies.

One argument being promoted within government and academic circles, notably by the Cabinet Office’s Behavioural Insights Team, is that randomised controlled trials (RCTs) are the best way to measure improvement in public services – the ‘gold standard’ of evidence.

Classically, these trials work by randomly allocating similar people to either receive or not receive a particular intervention, and then comparing the difference in results between the two groups. The goal is to be able to isolate the precise impact of the intervention itself from other factors. But are they right for everything?

We recently carried out an evaluation of the UK National Hip Fracture Database (NHFD), an NHS initiative to improve the care of people with broken hips – often older people injured after a fall. This was not an RCT, but an observational study describing changes that had already occurred.

The NHFD was designed to measure the quality of care against agreed standards, feed back how teams in different hospitals were doing, and encourage them to improve. The data was collected by local clinical teams and uploaded to a national database.

Using external data, we found that standards of care and mortality improved over the four years after the NHFD’s launch in 2007, and that they improved more over these years than over the four years before. This evidence has been widely used by the international clinical community involved in hip fracture care.

But it has also been challenged on the grounds that the evidence is not as strong as it could be. This is true: other policy and demographic changes could account for the improvements observed between 2007 and 2011. It was suggested that the study would have been more credible if hospitals had been chosen randomly, in the form of a ‘cluster’ RCT. In practice, this was not an option since our study was retrospective. But even hypothetically, why wouldn’t we have thought such an approach was appropriate?

What would we have lost in an RCT?

One issue is that some things simply cannot be randomised. The NHFD programme included national clinical standards published by professional associations representing orthopaedic surgery and geriatric medicine. In these circumstances, it was not possible to prevent doctors from knowing about what their professional bodies were doing, or what their colleagues were hearing was the best way to treat patients.

In addition, randomisation could have meant losing some important aspects of the programme. The NHFD initiative organised annual meetings in each region for local managers and clinicians. Staff from non-participating hospitals were encouraged to attend. The attendees described improvements to their services and how these had worked. This provided a way to learn from former failings in care, and included talks from clinicians with different specialties to learn from one another. They also prompted an element of competition between clinicians and teams showcasing their improvements.

Programmes that use randomisation can be forced to work very differently. The NHS Blood and Transfusion audit developed an RCT to compare different methods for feeding data back to sites and supporting them to reduce unnecessary transfusions. In the study protocol, the support delivered to each arm was carefully limited. Existing regional transfusion committee meetings were identified as a “potential contamination threat”.

Another issue is that complicated programmes of improvement may not be the same everywhere – and that might be part of their value. One major use of the NHFD was that local teams could try out different ways to meet the national standards. For example, one standard recommended surgery within 48 hours of admission to hospital. A range of local strategies were used to achieve this, such as fast-tracking protocols in A&E, prioritisation of hip fracture patients for surgery on existing lists, and early medical input by doctors specialising in treating older people. Feedback was then provided – including online monthly performance trackers, and published in annual reports comparing all participating units.

By design, there was no single ‘treatment’ to which patients or hospitals could have been randomised. Traditionally, RCTs often involve clinicians themselves being “blind” to which patients are and are not receiving an intervention – for example, they are not made aware whose tablets genuinely contain a new medicine. But in this case, blinding those giving or receiving a specified standard of care – for example, early surgery – to the nature of that care was impossible.

Previous RCTs have run into resistance from some of those carrying out changes due to the need to restrict and standardise initiatives, even when not everything is required to be the same.


In short, core elements of the programme would have been sacrificed in order to randomise the NHFD by hospital. What can we learn from that?

One response might just be to try to randomise at an even larger scale. Evaluators have used different states as the unit of comparison in a US initiative. However, it is unclear how meaningful randomisation is at the state or regional level: there simply aren’t that many states or regions to compare.

Or we could ask if there are certain types of change that RCTs cannot fully capture.

The ideal subject for randomisation involves specifying a single intervention to be tested – an item or procedure that is defined in advance by experts and is either there or not there, like a medicine. For some interventions, it makes sense to randomly allocate to units larger than individuals, such as hospitals. But large-scale improvement initiatives like the NHFD function as processes of social change. It will often be desirable for the active ingredients to be developed and refined by networks of clinicians as they go along: to emerge from the bottom up as well as from the top down.

There is a risk that focusing on what can be clearly defined leads to “cargo cult” quality improvement. The analogy, originally from Richard Feynman and used by Mary Dixon-Woods, is with a cult in the South Pacific that built imitations of runways after World War Two in the hope that airplanes would arrive with provisions. In other words, there is a risk that people focus on the superficial form of a change to improve standards of care, such as using a safety checklist under prescribed conditions, rather than cultural or behavioural changes that actually make the difference.

Programmes that promote an organic, social way of changing services can be a nuisance for evaluators, but RCTs that stifle these elements may fail to effect long-term improvement and other study designs may be more appropriate.

Suggested citation

Neuburger, J (2017) ‘Do the best health care improvement initiatives generate the worst evidence?’. Nuffield Trust comment.