- The Big Data landscape continues to evolve. Until recently Big Data was focused on processing massive amounts of simple, flat data. But now, there is a growing need to fuse complex data that comes from both inside and outside of companies such as:
- Relational databases
- Public/industry repositories
- News feeds
- Client databases
- Social networks
- Complex Characteristics
- Multiple databases, sources or silos
- Unstructured (free form) text and documents
- Complex schemas
- 1,000s of tables
- Data inside and outside corporation
- Conceptually and contextually rich
- Events: Geo-locations, temporal & social networks
- Inconsistent naming conventions
- Prediction/uncertainty/probability/learned behavior
- Complex data inter-relationships of arbitrary length & depth
- Inferred relationships, connections, patterns
- Significant variety
- Has 1,000s of classes/categories
Less expensive & more intelligent analytic frameworks
- This evolution is driving the need for new, less expensive and more intelligent analytic frameworks to make better business decisions.
- For many years, to support business analytic needs IT departments invested substantial time and money to pre-process data from various internal data sources into data warehouses and data marts. With the addition of big and complex data this approach is proving to be too slow, too inflexible and with a Total Cost of Ownership (TCO) that has exploded. Additionally, data warehouses struggle to integrate data from outside the enterprise. The warehouse approach is broken for businesses that need better, faster analytics.
- Data lakes are relatively new to the space and built largely to help address that TCO of data warehouses and the onslaught of Big Data. Unlike data warehouses, data lakes use the concept of pre-processing as little of the data as possible beforehand and to literally toss all the data into the data lake in its native form and fish out what is needed later. Essentially wait to the last possible moment to Extract, Transform, Load (ETL) and integrate the data – so called late binding. That all sounds great, but tossing everything into the lake in native formats has a number of challenges that need to be addressed:
More is needed for:
- Semantics for consistent taxonomy
- Meta-data management
- Linking or integration of data – otherwise silos stay silos
- Entity level access control/governance
- Curation, provenance and known quality of the data
Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls.
- The late binding approach of Data Lakes jams the heavy lifting of integrating the data later in the application data use cycle. It saves money upfront, but does nothing to reduce total costs or to solve the key business issue, that being: Make it easier and less costly to get information from data.
The question becomes: How to you build a data lake for complex, linked or integrated sets of information without going broke?
The answer is in a unique patent-pending approach from Franz that combines the scale of standard Hadoop and the intelligence of the AllegroGraph Semantic Graph to create a Semantic Data Lake system. To learn how the Franz Semantic Data Lake economically and effectively addresses the shortcomings of data lakes and data warehouses, click > AllegroGraph and Hadoop – The Semantic Data Lake