Reusing data collected in clinical studies is of strong scientific and economic interest. Anonymisation has emerged as a solution to extend data life, while providing regulatory compliance and privacy protection. Nevertheless, true anonymisation of complex data such as compositional microbiome data is challenging, due to high-dimensionality, relatively small sample sizes, high variability, and an intrinsic complexity.
We hereby present AnonyMine, a deep learning-based platform for joint anonymisation of microbiome and clinical data with repeated measurements. In addition, its simulation core also allows for informed decisions on study design and hypothesis exploration around the microbiome.
GDPR LIMITATIONS
The European General Data Protection Regulation (GDPR) governs the processing of personal data within the EU, making compliance essential for clinical research. This inherently involves handling participants’ personal data – researchers must define the processing, assess risks, implement security measures, inform individuals, and guarantee their rights. While the valuable data obtained from research should be shared and reused to advance scientific knowledge, such reuse must comply with the GDPR, which can complicate or hinder research implementation. Only necessary and relevant data should be collected and used solely for the purposes such as specified in the consent form. These constraints protect volunteers but pose ethical, scientific, and economic challenges, such as complicating data collaboration and impacting research investment returns. Anonymising data is a solution, as it exempts the data from GDPR restrictions, but this process must ensure high irreversibility and prevent re-identification of individuals.
WHAT IS IT AnonyMine?
AnonyMine is essentially a simulator of realistic microbiome and clinical data, that takes real data as input, and produces synthetic samples whose global statistical properties closely resemble those of the true ones. It can deal with modern shotgun metagenomics data, in combination with clinical repeated measurements data.
The goal of AnonyMine is to find a good balance between privacy and quality, meaning the ability to produce a synthetic dataset that still allows for truthful answers in downstream analyses, without leaking sensitive information.
HOW DOES AnonyMine WORK?
AnonyMine leverages the latest developments in Generative Adversarial Networks to produce synthetic data that closely resembles the true data, all the while preserving key statistical and biological information. It uses a deep neural network to learn how to generate simulated data that can confound an expert classifier whose task is to differentiate synthetic samples and true samples. After the model is trained, many synthetic samples can be drawn and compared to the true ones to choose a subset with the best tradeoff between privacy protection and statistical quality.
A key component of the system is a compression step that produces a lower-dimensional representation of microbiome data that is also free of compositional constraints. The taxonomic hierarchy is leveraged to define a transformation of the data that allows us to reduce the dimensionality without filtering out individual components. The full approach mimics proven strategies for image compression (JPEG 2020) and enables microbiome information to be properly processed with state-of-the-art deep learning models.
A KEY PROPERTY: VARIABILITY PRESERVATION
By design, the system then does not focus on individual observations to produce a synthetic surrogate, but on the full sample to learn how to simulate similar ones. This makes it easier to preserve the statistical properties of the original data, particularly its variability. Many alternatives for anonymisation work by producing local averages. Although averaging makes it difficult to connect the real sample with its synthetic counterpart when sufficient data points are included in the computation, it typically reduces the variability of the data. This can have a strong impact on clinical research, i.e. a biased estimation of power or required sample size for a confirmatory study.
QUALITY EVALUATION OF ANONYMISED DATA
The quality of any anonymised dataset must be evaluated in terms of statistical quality and privacy protection. From a statistical point of view, the main results and conclusions obtained on the true dataset must be preserved in the proposed synthetic one. Thus, the same analytic pipeline must be applied to both datasets, and the obtained results compared. For privacy protection, many proposed metrics allow to assess reidentification risk, as required by the European G29 Data Protection Working Party guidelines: individualisation, correlation, and inference.
BEYOND ANONYMISATION
As microbiome-related features become major endpoints in clinical research, they must be included in the experimental design questions: can we estimate a required sample size in a rational way? What is the attainable statistical power related to a prescribed sample size? How does it vary as a function of different types of effects?
Despite their importance in assessing the chance of success of a study, these questions are barely addressed when investigation of microbiome features is just exploratory alongside clinical main endpoints. Relying on published data is often not possible due to lack of comparable settings reported in the literature and technical mismatches. Simulation–guided decisions are a better alternative, but the quality of the simulation is critical: parametric models typically produce synthetic samples that are easily distinguishable from the true ones, making any decision based on them questionable. More recent simulators seeded with real data often cannot deal with the full richness of modern microbiome data nor with repeated measurements or integration of clinical data.
AnonyMine sorts out these limitations and allows us to make experimental design decisions based on extensive simulation of target scenarios. For instance, imagine that in the context of a confirmatory study, you need to decide whether to focus on a subpopulation to study the efficacy of your product. It is possible to train an AnonyMine model with microbiome plus clinical and metadata, produce many synthetic samples, and pick those that correspond to the target subpopulation. Efficacy could be assessed on this subset as a function of sample size to estimate power or make decisions on sample size.
CONCLUSION
The AnonyMine platform, which ensures GDPR-compliant anonymisation and simulation for high-dimensional data, offers a solution to share and reuse data to promote added value of clinical trials and collaborative research. This platform leverages AI simulation approaches to anonymise data while preserving key statistical and biological characteristics, for further research and applications, and allowing for informed decisions when designing a clinical study around the microbiome.
For more information, contact biofortis-contact@biofortis.fr