Sustainable Data Lakes for Extreme-Scale Analytics

Data lakes are raw data ecosystems, where large amounts of diverse data are retained and coexist. They facilitate self-service analytics for flexible, fast, ad hoc decision making. SmartDataLake enables extreme-scale analytics over sustainable big data lakes. It provides an adaptive, scalable and elastic data lake management system that offers: (a) data virtualization for abstracting and optimizing access and queries over heterogeneous data, (b) data synopses for approximate query answering and analytics to enable interactive response times, and (c) automated placement of data in different storage tiers based on data characteristics and access patterns to reduce costs. The data lake’s contents are modelled and organised as a heterogeneous information network, containing multiple types of entities and relations. Efficient and scalable algorithms are provided for: (a) similarity search and exploration for discovering relevant information, (b) entity resolution and ranking for identifying and selecting important and representative entities across sources, (c) link prediction and clustering for unveiling hidden associations and patterns among entities, and (d) change detection and incremental update of analysis results to enable faster analysis of new data. Finally, interactive and scalable visual analytics are provided to include and empower the data scientist in the knowledge extraction loop. This includes functionalities for: (a) visually exploring and tuning the space of features, models and parameters, and (b) enabling large-scale visualizations of spatial, temporal and network data. The results of the project are evaluated in real-world use cases from the business intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making. SmartDataLake will foster innovation and enable European SMEs to capitalize on the value of their own data lakes.

Challenges:

  1. Handling data heterogeneity
    How can I achieve flexibility for handling heterogeneous data with different models and formats, and at the same time offer high-performance queries and analytics?
  2. Reducing storage costs
    How can I take advantage of emerging storage tiering opportunities to reduce storage costs by optimizing data placement under dynamically changing data characteristics, access patterns and business needs?
  3. Making sense of the data
    How can I resolve different types of entities across multiple sources, mine different types of relations and associations, and find patterns in the data?
  4. Monitoring changes
    How can I detect changes resulting from newly collected data, and their impact on my analysis?
  5. Support the human in the loop
    How can I visually and interactively explore the data to extract insights, formulate hypotheses, try different analyses, and compare the effects of different parameters?

Results

  1. Virtualized, Adaptive and Transparent Data Access and Storage Tiering Engine
    A distributed and elastic data management system for in situ query processing, adaptive indexing, data summarization, approximate query answering, and automatic storage tiering.
  2. Heterogeneous Information Network Mining
    A software library for mining Heterogeneous Information Networks, including algorithms for entity resolution, similarity search, entity ranking, link prediction, community detection, and change detection.
  3. Scalable and Interactive Visual Analytics
    A visual analytics engine for generating different types of scalable and interactive visualizations for geospatial, temporal and graph data.

Partners

Funding

dfg_logo

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 825041.