Description

Title RDFVault: A Compact String Pool For RDF Data
Abstract Resource Description Framework (RDF) is a data model for representing knowledge and resources over the World Wide Web in a directed graph form. Each RDF statement consists of three terms in the form of a triple (s, p, o) where predicate p, represents an edge (a relationship) between node s (as subject), and node o (as object) in the graph, where usually each node represents a resource on the web. The simplicity of this data model, along with assigning universally unique identifiers (called IRIs) to resources across the web, makes the integration of data from different sources much easier. However, this simplicity comes at the expense of high redundancy and bulkiness in RDF datasets, which can result in wasteful memory usage at processing time. Currently the best approach to address this issue is to use string pool. A String pool is simply a table which keeps track of all existing strings in the main memory, and all new strings are checked against this table to avoid the generation of duplicate strings in memory. Existing hash table based string pools can mitigate the redundancy level by eliminating duplicate terms, but they can not use the high level of similarities and common prefixes abound in RDF data to improve the memory efficiency. Therefore, we propose RDFVault, which is a compact trie-based string pool for the storage and retrieval of RDF terms in the main memory which exploits the characteristics of RDF data to improve memory efficiency as much as possible. Our experiments show that RDFVault consumes up to 5 times less memory than hash table based string pools with small performance overhead.