Moritz Thaddaus Staudinger: Can you cite databases?

Topic: Can you cite databases? A study on reproducible querying in evolving, schema-changing database 
Session Lead: Moritz Thaddaus Staudinger
Time:  2022-10-26, Wednesday, 11 am – 12 pm (CDT)
Location: Zoom
Box-folder: [Link]

A desirable characteristic of research is to allow the reproduction of experiments. Everything starts with researchers collecting and analyzing data, to perform scientific experiments, which are essential for expanding the global knowledge and the state of scientific research. As research is more and more shifting to data-intensive approaches, it becomes more and more important to cite datasets, which have been used, as well.

Therefore, the FAIR principles(Wilkinson et al., 2016) have been established by the scientific community to make data findable, accessible, interoperable and reusable. Following these principles, it is possible by providing a dataset in a file-based data repository as InvenioRDM, but when the underlying data is evolving regularly (e.g. sensor data is added hourly), this would cause a huge storage load increase, as the whole database dump needs to be stored. Therefore, other approaches need to be evaluated and possibly implemented, to minimize the storage overhead by storing all different versions. Some valid approaches are the usage of Git, to track only the difference between two files, or the usage of tuple/record based versioning, which tracks changes per entry in a database. To assist with the implementation of custom approaches for persistent identification of arbitrary subsets of evolving datasets, the Research Data Alliance published 14 recommendations for Data Citation in dynamic environments(Rauber et al., o. J.).

In this talk we will explore options, how we can solve these problems and provide a state- of-the-art solution to this problem, on an actual research database and also talk if this approach is generalizable and what different aspects you need to account for.

References

Rauber, A., Asmi, A., van Uytvanck, D., & Pröll, S. (o. J.). Data Citation of Evolving Data. 2.

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), Art. 1. https://doi.org/10.1038/sdata.2016.18