Semantics at scale – Hadoop and AllegroGraph

Los Alamos National Labs

Los header hairline

Los Alamos National Labs built a Big Data solution combining Hadoop and the AllegroGraph semantic graph platform to identify people, their social networks and connectedness across cultural and linguistic backgrounds.


Their Goal:

Build a scalable Social Network Analysis application for processing terabytes of social data using a demonstration dataset of bibliographic metadata to resolve authors, co-authors, all their associated publications, and shared affiliations.

Their Challenges:

• A Big Data problem that cannot be solved with Hadoop alone

• Disambiguation of people’s names – for spelling variants, nick-names, misspellings, abbreviations

• Semi-structured data

• Scale to terabytes of content spanning multiple repositories and forms

• Uncover relationships not discoverable by traditional name matching

The Solution:

Los Alamos were able to achieve a 99% accuracy in identifying and disambiguating people across terabyte size data sets.

• Hadoop platform for:
• Large dataset processing
• Semi-structured data processing
• Economical scalability of data storage and processing
• Map-Reduce framework
• Creation of semantic triples
• Mahout machine learning platform for:
• Extraction of blocks of metadata from large XML fields
• Machine learning for field information
• Streaming input  to Hadoop file system (HDFS)
• Map-Reduce framework
• AllegroGraph Semantic Graph platform:
• RDF, triple store and ontology platform
• Resolution of ambiguous names, abbreviations
• Identify affiliations/relationships
• Threshold-based matching of people relationships
• Analysis of connectedness, clusters of people and centrality

The Benefits:

• Enables sophisticated techniques to connect people and the information about them when names do not match exactly
• System learns over time
• Architecture scales to real world needs

AllegroGraph turns complex data into actionable business insights