openclean - An Open-Source Data Cleaning Library for Python
Speaker: Heiko Mueller, New York University
The negative impact of poor data quality on the widespread and profitable use of machine learning makes data cleaning essential in many data science projects [1]. Improving data quality requires data profiling and exploration to gain an understanding of quality issues, and data cleaning to transform the data into a state that is fit for purpose. This process is tedious and costly as it can require a significant amount of time [2].
Over the years, many tools for profiling, preparing, and cleaning data have been developed. These approaches were developed in isolation and in different programming languages with no standardized interfaces. Thus, it is difficult for data scientists to combine existing tools in their data processing pipelines.
In this presentation, I start by motivating the need for data profiling and data cleaning. Based on examples from Open Data sources (e.g., New York City Open Data [3]), I discuss approaches for solving some of the common data quality issues encountered in these data sets. I will demonstrate the use of openclean, an open-source Python library for data profiling and data cleaning [4]. Inspired by the wide adoption of generic machine learning frameworks such as scikit-learn, TensorFlow, and PyTorch, openclean aims to provide a unified framework for practitioners that brings together open source data profiling and data cleaning tools into an easy-to-use environment. By doing so, o will not only become easier for users to access state-of-the-art algorithms for their data cleaning efforts, but also allow researchers to integrate their work and evaluate its effectiveness in practice. We therefore envision openclean as a first step to build a community of practitioners and researchers in the field.
[1] https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless
[3] https://opendata.cityofnewyork.us/
[4] https://github.com/VIDA-NYU/openclean