Nikolaus Parulian: Collaborative Data Cleaning

Topic: Collaborative Data Cleaning 
Session Lead: Nikolaus Parulian
Time:  2022-10-12, Wednesday, 11 am – 12 pm (CDT)
Location: Zoom

Data cleaning and preparation are essential parts of data curation lifecycles and scientific workflow. It is also known that exploratory data mining and data cleaning takes up 80% of the scientific research pipeline. However, a data cleaning task can be very tedious for a single user, involving lots of exploration and iteration, and prone to error, especially when a curator finds various problems in the dataset. Nevertheless, single-user data cleaning can also introduce bias where the cleaning quality will only be as good as their knowledge. Therefore, we can define collaboration as assigning a data cleaning task to multiple data curators to work on the same dataset and purpose. However, a data cleaning task involving multiple users can introduce new problems, such as planning or dividing tasks, data change disagreement, and conflicting process dependency. Understanding these variations and analyzing the combined workflow is important for data curation to evolve the data cleaning workflow and improve the dataset’s quality. Hopefully, the model and framework we will discuss can help improve the data-cleaning pipeline through collaboration.