How to combat data errors in clinical trials


Anybody can agree that data lie at the heart of any research, including clinical study. Accurate data allows researchers to make insightful decisions. In case of clinical trials, clean and trusted data mean sensible judgments and subsequently contribute to improved healthcare. Hence, avoiding errors in the process is one of the most important tasks.

However, expecting an error-free study is foolish, especially in the later phases of the trials. Different from Phase I and Phase II where the patient pool is some dozens to some hundreds, Phase III and Phase IV usually include a ten times higher number of participants. The more data need to be tackled, the higher risk of making mistake there will be.

Plus, “data” at this stage cover more than just figures. In a lot of research, say cognitive-related, researchers must collect verbal descriptions of conditions from patients. Elements such as language difference may distort the process and make the data incorrect. In a survey conducted by Nagurney JT at Massachusetts General Hospital, about 30% of subjects give dissimilar answers to the same questions asked multiple times.

The impossibility of perfectly clean data gives way to the definition of quality data as “data strong enough to support conclusions and interpretations equivalent to those derived from error-free data.” This definition was proposed by the Institute of Medicine (IOM), basically saying that data are acceptable as long as they can serve the purpose of use.

In order to minimize the risk of faulty data, a data-centric process has been come up with that is consisted of six steps.

Identify data to be collected

Before going to the grocery store, you should list out things you are going to buy. The same goes for data in clinical trials, but on a higher level of specificity. These requirements are written in details in a protocol by investigators. For example, “taking health index of diabetes patients” is too general. Instead, a protocol had better say “measuring height and weight of type 2 diabetes patients.”

Being as specific as possible helps to not only reach a consensus among sites but also avoid redundancy. If the instructions are too general, some sites may try to prevent missing data by providing more details than needed. This, in turn, creates problems when organizing and processing data in later stages.

Identifying data to be collected is the first step in data collecting process, as it decides how the case report form (CRF) will be designed. Too vague instructions result in lengthy forms, leading to much time and financial burden being placed on data cleaning process. What’s more, if a patient is asked to fill out too many questions, quality answers may not be guaranteed.

Define data collection methods

Okay, so now you know exactly what you are going to collect – the question is now how. This step determines the specific design of the CRF, which means asking which questions to collect what type of data. For example, for a multiple choice question such as patients’ gender, the options include fill-in-the-blank, drop-down menu, checklists and radio buttons. Simple as it may seem, researchers need to decide beforehand to which level of specificity the data need to be to choose the most suitable option.

If too detailed as “choose among 20 shades of skin colors,” the question may cause much variety in data and cost a lot of time. On the other hand, a too general question like “choose between either high or low blood cholesterol level” can be of no help to the analysis stage. The key is to give the appropriate questions depending on the type of data needed to support the study.

One more thing to consider is convenience. Since researchers have to deal with multiple sorts of data when meeting one patient, it is best to simplify the cognitive task of the person completing the form. For instance, setting height unit to only meter may cause some confusion to those who are more familiar with inch scale. In this case, CRF should allow unit-free data entry.

Observe and/or measure data

Data can be generated through observation such as the alleviated symptoms of depression, or measurement, such as blood pressure. Errors are unavoidable at this stage, so an error checking process should be built into measurement and observation procedure. The methods can include measuring more than one value, taking an extra vial of blood in case of further examination, asking for the same information via different questions, and measuring one parameter with two different methods.

Beside constantly checking for errors, ongoing assessment increases the chances of correcting faulty values. Comparing the raw data with a range of possible values can help detect the abnormal in time for correction and re-measurement.

In terms of measuring methods and equipment, there should be consistency among different sites, because different tools can bring in variability in values captured. Old equipment can have undesired large deviations. These can all affect the quality of data.

Record data

The very first place the piece of data is written down – whether on scratch paper, paper form or computer form – is called the source of data. Of course, according to U.S Food and Drug Administration, the data source must be clearly identified, protected from any kind of alteration, and comply with good document practices. Most clinical trials up to this point still keep a paper resource of medical report forms as the source of the data so that investigators can always come back to check if any confusion comes up.

Data quality at this stage depends largely on the person who writes down the data. Due to some disagreement on recording specifics, such as the number of significant digits and the rule of rounding, variability can occur among datasets in different sites. Once again, clarity and consistency among all the sites in the study must be the most important thing to care about when dealing with data in clinical trials.

Process data

In the old days where specialized software hasn’t come into sight, double data entry was required to ensure the accuracy of data entered into the system. Nowadays, there have been other approved methods as well, such as single data entry, optical scanning methods, abstraction, batch data cleaning and no batch data cleaning. Though the complexity of data, skill level of staff and number of manual steps have been proposed as factors contributing to the inaccuracy of data at this stage, no regulations have been put forward with regards to these.

Analyzing data, reporting status and reporting results

These are the very last steps in the long process of tackling data in clinical trials, and normally have no impact on the quality of data since clean data should have been locked before being transferred for analysis and reporting.

To sum up, planning for an error-free trial is out of the question. The main focus is to create an efficient process so that data error can be detected and minimized as much as possible.