Synthetic data generation

Synthetic data serves two purposes: protecting sensitive data and providing more data in data-poor scenarios. Sensitive data is often necessary to develop ML solutions, but can put vulnerable data at risk of disclosure. In other scenarios, there is insufficient data to explore modeling approaches and acquiring more data is cost or time prohibitive. In both instances, synthetic data can provide a safe and cost-effective resource for model training, evaluation, and testing.

Generating synthetic data can conserve and multiply the utility of the original data without compromising privacy. This process involves multiple scenarios - for images, it means generating new images and for tabular data it means generating scalar values of multiple types. The synthetic data ideally has similar statistical properties to the real data in ways relevant to the model, while excluding sensitive aspects.

Some synthetic data generators