In the last years, data lakes are emerging as an effective and efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. One of the main issues to address in this scenario consists in the extraction of thematic views from the (very heterogeneous and generally unstructured) data sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structure” unstructured data. Finally, we define a technique to extract thematic views from the sources of a data lake, based on similarity and other semantic relations among the metadata of data sources
An approach to extracting thematic views from highly heterogeneous sources of a data lake / Diamantini, C.; Lo Giudice, P.; Musarella, L.; Potena, D.; Storti, E.; Ursino, D.. - 2161:(2018). (Intervento presentato al convegno The 26th Italian Symposium on Advanced Database Systems (SEBD 2018) tenutosi a Castellaneta Marina (TA) nel Giugno 2018).
An approach to extracting thematic views from highly heterogeneous sources of a data lake
C. Diamantini;D. Potena;E. Storti;D. Ursino
2018-01-01
Abstract
In the last years, data lakes are emerging as an effective and efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. One of the main issues to address in this scenario consists in the extraction of thematic views from the (very heterogeneous and generally unstructured) data sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structure” unstructured data. Finally, we define a technique to extract thematic views from the sources of a data lake, based on similarity and other semantic relations among the metadata of data sourcesI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.