Loading…

Loading grant details…

Completed RESEARCH GRANT UKRI Gateway to Research

Creating longitudinal datasets for linked administrative data research using synthetic data

£1.61M GBP

Funder Economic and Social Research Council
Recipient Organization University College London
Country United Kingdom
Start Date Jan 01, 2021
End Date Jul 10, 2022
Duration 555 days
Number of Grantees 5
Roles Co-Investigator; Principal Investigator
Data Source UKRI Gateway to Research
Grant ID ES/V005448/1
Grant Description

Administrative data hold great potential for informing public policy. However, this potential is not yet being realised due to restrictions around data access, linkage, and privacy protection. Governance procedures and approvals lead to long timescales and tight restrictions on data access, which can jeopardise publicly funded research.

One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.

Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data.

We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).

Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs.

Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.

We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.

The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required.

We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data.

Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.

All Grantees

University College London

Advertisement
Apply for grants with GrantFunds
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant