Democratized Data is Democratized Knowledge

How DataFabrica is Bridging the Sensitive Data Gap

Sadrach Pierre, Ph.D.
DataFabrica

--

Image by Lukas on Pexels

Synthetic data usage is growing at a rapid pace. Gartner predicts that 60% of the data used to train machine learning models will be synthetic by 2024. Currently, most of synthetic data usage is for training deep learning models. This makes sense as these algorithms typically require much more data than is avaiable to perform well. Despite this, the need for synthetic data is much broader than training deep learning models. High quality synthetic data can aid companies develop product prototypes, research intuitions test novel machine learning algorithms, and educational institutions and platforms have access to a wider variety of industry specific use-cases. With that in mind, industry specific and use-case specific data should be democratized as it will aid the progress of how we solve problems for businesses, perform cutting edge research and teach the next generation data scientists.

Synthetic data is artificially generated information typically used to represent real life events and scenarios. Oftentimes the data that is needed to analyze real life events is either sensitive, limited, or entirely unavailable. These limitations can stall the development lifecyle for data science, machine learning and engineering projects. Further, limited access to data prevents people who are otherwise interested from learning how to gain insight into certain types of business use cases.

Consider the healthcare example of predicting emergency room readmission. This use case requires highly sensitive patient data, electronic health records (EHR), which is mostly inaccessible to the public. This is for good reason since patient data is protected by the Health Insurance Portability and Accountability act (HIPAA). Unfortunately, this limits who can access the data as well as who can learn about existing use-cases in the healthcare space. Another example is medical diagnostic imaging data, which is used for tasks such as rare disease detection. Even for researchers in the space with access to this data, instances of patients with rare diseases are often too few to train a machine learning algorithm. Further, medical image data is also highly sensitive and is inaccessible to the public.

DataFabrica is an online data marketplace that aims to fill the sensitive data gap. DataFabrica provides affordable, ready-to-use, realistic synthetic data in retail, healthcare and finance industry verticals. At DataFabrica we design and curate our data sets to accurately represent real business scenarios. In retail this includes business use cases such as customer segmentation, customer churn analysis, product recommendation and more. Within the healthcare industry we provide synthetic patient payer data and diagnostic imaging data for rare disease detection.

Check out the following articles which explore some of the available datasets on DataFabrica.

Healthcare Analytics

Exploring Healthcare Patient Payer Data in Python

Image by Pixabay on Pexels

Unlike many other types of data, payer claim data is protected by the Health Insurance Portability and Accountability act (HIPAA). This makes it difficult for many innovative health tech start ups to develop and innovate in the healthcare space. Synthetic payer claims data is a good option for small players in the space as it can enable companies to build out proofs of concepts (PoCs) without the hassle of acquiring sensitive patient data.

This blog tutorial walks through how to perform exploratory data analysis on the Synthetic Healthcare Patient Payer Claims data available on DataFabrica. The free tier is free to download, modify, and share under the Apache 2.0 license.

Using Pareto Analysis to Analyze Patient Readmission

Image Created by Author (This plot is for illustrative purposes and does not reflect real data)

This blog walks through a tutorial on how to identify top causes of patient readmissions using Pareto Analysis. The Synthetic Healthcare Emergency Room Readmission data is available on DataFabrica. The free tier is free to download, modify, and share under the Apache 2.0 license.

Retail Analytics

Image by Andrea Piacquadio on Pexels

Customer Segmentation with Credit Card Transaction Data

Image Created by Author

This blog post walks through how to use recency, frequency and monetary (RFM) scores to generate customer segments using the Synthetic Credit Card Transaction data available is on DataFabrica. The data contains synthetic credit card transaction amounts, credit card information, transaction IDs and more. The free tier is free to download, modify, and share under the Apache 2.0 license.

Personalized Marketing with Customer Segmentation and Collaborative Filtering

This blog post discuss how to use customer segmentation and collaborative filtering to generate personalized product recommendations. It also utilizes the Synthetic Credit Card Transaction data.

Feel free to download and explore the free tier versions of the data!

Coming Soon…

We have many new data sets releasing soon including E-commerce sales data and Medical diagnostic imaging data for Niemann-Pick Disease detection. Stay tuned!

Conclusions

Anyone interested in learning industry specific business use-cases should be able to, at the least, access data that realistically represents these use-cases. At DataFabrica, we believe that data democratization is also knowledge democratization. Whether you are a health tech start up, a research institution, or a new student of machine learning, realistic synthetic data can help power your use-case.

Feel free to leave comments below, specifying which use-cases or types of synthetic data you would like us to design in the future. Thank you for reading!

--

--

Sadrach Pierre, Ph.D.
DataFabrica

Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.