• The Big Data landscape continues to evolve. Until recently Big Data was focused on processing massive amounts of simple, flat data. But now, there is a growing need to fuse complex data that comes from both inside and outside of companies such as:

Complex Types

    • Databases
    • Documents
    • Spreadsheets
    • CRM/ERP
    • Relational databases
    • Public/industry repositories
    • News feeds
    • Client databases
    • Social networks
    • Complex Characteristics
    • Multiple databases, sources or silos
    • Unstructured (free form) text and documents
    • Complex schemas
    • 1,000s of tables
    • Data inside and outside corporation
    • Conceptually and contextually rich
    • Events: Geo-locations, temporal & social networks
    • Inconsistent naming conventions
    • Prediction/uncertainty/probability/learned behavior
    • Complex data inter-relationships of arbitrary length & depth
    • Inferred relationships, connections, patterns
    • Significant variety
    • Has 1,000s of classes/categories

Less expensive & more intelligent analytic frameworks

  • This evolution is driving the need for new, less expensive and more intelligent analytic frameworks to make better business decisions.
  • For many years, to support business analytic needs IT departments invested substantial time and money to pre-process data from various internal data sources into data warehouses and data marts. With the addition of big and complex data this approach is proving to be too slow, too inflexible and with a Total Cost of Ownership (TCO) that has exploded. Additionally, data warehouses struggle to integrate data from outside the enterprise. The warehouse approach is broken for businesses that need better, faster analytics.
  • Data lakes are relatively new to the space and built largely to help address that TCO of data warehouses and the onslaught of Big Data. Unlike data warehouses, data lakes use the concept of pre-processing as little of the data as possible beforehand and to literally toss all the data into the data lake in its native form and fish out what is needed later. Essentially wait to the last possible moment to Extract, Transform, Load (ETL) and integrate the data – so called late binding. That all sounds great, but tossing everything into the lake in native formats has a number of challenges that need to be addressed:

More is needed for:

  • Semantics for consistent taxonomy
  • Meta-data management
  • Linking or integration of data – otherwise silos stay silos
  • Entity level access control/governance
  • Curation, provenance and known quality of the data

Gartner Group

Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls.