DataCebo launches enterprise version of popular open source synthetic data library

December 7, 2023 ndowd

Long before most of us were thinking about large language models, DataCebo co-founders Kalyan Veeramachaneni and Neha Patki were creating an open source library called Synthetic Data Vault, or SDV for short. The company’s roots go back to 2016 when both were working in the MIT Data to AI Lab. They had a notion that beyond generating text, images and code, you could also create data with generative AI.

For companies, which need to use quality business data in large language models (and for other purposes) but who can’t necessarily use PII to do it, this is an intriguing idea. Today, the company emerged after taking a couple of years to build an enterprise commercial version of SDV, along with $8.5 million in seed funding.

This ability to create synthetic data from relational and tabular databases is what sets the company apart from other generative AI creation tools, says CEO Veeramachaneni. “Our software allows our customers to build a custom generative AI model on prem. And then they can use that synthetic data for a variety of use cases,” he told TechCrunch. This could work in healthcare, financial services or anywhere it was imperative to hide sensitive data for testing and model building purposes.

He says that companies have traditionally had to create synthetic data manually, a highly tedious process that’s difficult to scale and prone to error. By putting generative AI to work on the problem, you can simply describe the kind of data you need, the software looks at the characteristics of the actual dataset, and then creates a quality fake set for testing purposes without exposing any sensitive information.

The founders began by creating an open source tooling, one that proved extremely popular and helped them test the various core pieces of the software. “We’ve had over a million downloads and a lot of people who are active in our community,” VP of product Patki said. In fact, they have a Slack channel with over 1,000 people participating.

“And through that, I think first we get a lot of validation of our core algorithms. We have the confidence that it works, and if there’s a bug or anything our public open source users find them immediately and we’re able to address any issues,” she said.

The big difference between the open source version and the commercial enterprise one is scale. The enterprise version can handle up to 100 tables, while the open source is designed to handle just a few tables. So far, customers have been building models based on upwards of 20 to 30 tables.

The company currently has 11 employees and plans to hire in the next year to get up to around 20, depending on how the business grows.

The startup’s $8.5 million in seed funding was led by Link Ventures and Zetta Venture Partners, with participation from Uncorrelated Ventures.

source