An approach to extracting thematic views from highly heterogeneous sources of a data lake

Diamantini, C.; Lo Giudice, P.; Musarella, L.; Potena, D.; Storti, E.; Ursino, D.

In the last years, data lakes are emerging as an effective and efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. One of the main issues to address in this scenario consists in the extraction of thematic views from the (very heterogeneous and generally unstructured) data sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structure” unstructured data. Finally, we define a technique to extract thematic views from the sources of a data lake, based on similarity and other semantic relations among the metadata of data sources

An approach to extracting thematic views from highly heterogeneous sources of a data lake / Diamantini, C., Lo Giudice, P., Musarella, L., Potena, D., Storti, E., Ursino, D.. - 2161:(2018). (The 26th Italian Symposium on Advanced Database Systems (SEBD 2018) Castellaneta Marina (TA) Giugno 2018).