Methods for assessing and improving data quality
The 3 V’s for big data – volume, variety, and velocity – were quickly succeeded by the 4 V’s, adding in “veracity”. It was quickly recognised that the most sophisticated big data analytics could not overcome limitations of poorly captured data. Increasing electronic capture and storage of information does not, unfortunately, guarantee good data quality.
There is a relative paucity of methodological work to assess and improve the quality of data in big data settings. However, better detection of errors, leading to enhanced chances of correcting erroneous data, is essential for the validity of subsequent analysis.
Various approaches for detecting likely errors in data have been proposed. In the context of longitudinal data within routinely collected primary care data, one promising method developed in collaboration with members of our theme uses an iterative approach of fitting mixed models, identifying likely outliers, and re-fitting the model after removal of outliers (Welch et al, 2012).
Some relevant references:
Welch C, Petersen I, Walters K, Morris RW, Nazareth I, Kalaitzaki E, White IR, Marston L, Carpenter J. Two-stage method to remove population- and individual-level outliers from longitudinal data in a primary care database. Pharmacoepidemiology and Drug Safety, 2012; 21: 725-732.
Missing data
The challenge of missing data is not restricted to the context of large datasets of routinely or semi-automatically collected data. However, missing data in such settings raises complex and often novel challenges; we highlight two below.
The first is referred to as ‘data dependent sampling’ – in other words the process you are trying to collect data on controls – to some extent – the data you are able to collect. To give two examples:
- using wearable devices to measure activity can over-estimate usual activity due to participants choosing to leave their device at home on low activity days; this is a form of measurement error
- in routinely collected primary care data, clinical and therapeutic information is collected only when the patient chooses to visit their general practitioner – and then only for reasons specifically relevant to the consultation.
The second challenge arises because of the sheer volume of the data. While imputation and related approaches are a flexible and powerful approach and offer much potential, they must be adapted to meet these challenges, which can violate their underpinning assumptions.
Some recent work within the group that has addressed some of these challenges:
- two-fold imputation, an adaption of multiple imputation which attempts to simplify the problem by conditioning only on measurements which are local in time (Welch et al, 2014); and
- a paper correcting misconceptions on the use of multiple imputation to handle missing data in propensity score analyses (Leyrat et al, 2017)
Some relevant references:
Welch C, Petersen I, Bartlett J, White IR, Marston L, Morris RW, Nazareth I, Walters K, Carpenter J. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Statist. Med. 2014, 33:3725–3737.
Leyrat C, Seaman SR, White IR, Douglas I, Smeeth L, Kim J, Resche-Rigon M, Carpenter JR, Williamson EK. Propensity score analysis with partially observed covariates: How should multiple imputation be used? Stat Methods Med Research, 2017, doi: 10.1177/0962280217713032. [Epub ahead of print]
Causal inference
Assessing causal relationships from non-randomised data poses many methodological challenges, particularly related to confounding and selection bias. These are exacerbated in studies conducted using routinely collected data: data not collected for the primary purpose of research tend to be less regular and less complete than traditional data sources used to address such questions.
In comparative effectiveness studies of medications, there is often a wealth of information available regarding previous diagnoses, medications, referrals and therapies. However, how best to incorporate this information into analyses remains unclear. The high-dimensional propensity score (Schneeweiss et al, 2009) is an empirical algorithm to select potential confounders, prioritise candidates, and incorporate selected variables into a propensity-score based statistical model. This algorithm was developed in the context of US claims data; the validity of its application to different settings, such as routinely collected primary care data in the UK, remains unclear.
An alternative approach to the incorporation of a large number of potential confounders into a causal model is offered by Targeted Maximum Likelihood Estimation (TMLE). This approach has been applied to UK primary care data to investigate the association between statins and all-cause mortality (Pang et al, 2016), with the authors concluding that a deeper understanding of the comparative advantages and disadvantages of this approach was needed within this big-data setting.
To begin to address this knowledge gap, members of our theme have developed a free open source online tutorial to introduce TMLE for Causal Inference, which can be found here.
They have also created and made available a free open source Stata program to implement double-robust methods for causal inference, including Machine Learning algorithms for prediction (see links below).
A promising approach to poorly measured, or unmeasured, confounding is offered by self-controlled designs. The self-control risk interval, case-crossover and self-controlled case series, for example, the self-controlled case series uses individuals as their own control, thus removing time-invariant confounders.
Some relevant references:
Franklin JM, Schneeweiss S, Solomon DH. Assessment of Confounders in Comparative Effectiveness Studies From Secondary Databases. Am J Epidemiol. 2017; 185(6): 474-478. doi: 10.1093/aje/kww136.
Franklin JM, Eddings W, Austin PC, Stuart EA, Schneeweiss S. Comparing the performance of propensity score methods in healthcare database studies with rare outcomes. Stat Med. 2017; 36(12): 1946-1963. doi: 10.1002/sim.7250.
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009; 20(4): 512-522.
Pang M, Schuster T, Filion KB, Eberg M, Platt RW. Targeted maximum likelihood estimation for pharmacoepidemiologic research. Epidemiology. 2016; 27(4): 570-577.
Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007: 523-39.
Schuler MS, Rose S. Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. American Journal of Epidemiology. 2016. doi: 10.1093/aje/kww165.
S Gruber and MJ van der Laan. tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software. 2012; 51(13).
Gruber. Targeted Learning in Healthcare Research. Big Data. 2016; 3(4), 211-218. DOI:10.1089/big.2015.0025.
Software (open source):
Author: Dr. Miguel Angel Luque-Fernandez, LSHTM.
https://github.com/migariane/meltmle
https://github.com/migariane/weltmle
Online tutorial:
Author: Dr. Miguel Angel Luque-Fernandez, LSHTM.
https://migariane.github.io/TMLE.nb.html
Linkage
Linkage has been described as “a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record” (Organisation for Economic Co-operation and Development (OECD) Glossary of Statistical Terms).
Linking health related datasets offers the opportunity to improve data quality, by improving ascertainment of key risk-factors and outcomes, allowing inconsistencies to be identified and resolved. It is a cost-effective means of assembling a dataset, exploiting existing resources. However, challenges associated with data linkage include the lack of unique identifiers for linkage, leading to possible errors in the linkage, and data security considerations.
Small amounts of linkage error can result in substantially biased results. False matches introduce variability and weaken the association between variables, often resulting in bias to the null, and missed matches reduce the sample size and result in a loss of statistical power and potential selection bias. Evaluating the potential impact of linkage error on results is vital (Harron et al, 2014).
Some relevant references:
Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H. Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology, 2014, 14: 36. DOI: 10.1186/1471-2288-14-36.