Context Associations in RDF

Living Document,

This version:
https://w3id.org/context-associations
Issue Tracking:
GitHub

Abstract

This specification defines a minimal generic RDF pattern for defining annotations of contextual information to concrete sets of target data in RDF using named graph links. It defines a method-specific encoding stategy, and associated uniform SPARQL-based decoding strategy to encode other annotation methods into the Context Association format for uniform storage, exchange, discovery and querying.

1. Introduction

Different application in the RDF domain make use of different vocabularies, annotation models (reification, triple terms, named graphs, singleton properties, resource referencing) and annotation methods (concrete applications of models with added behavior, such as VCs, nanopublications, ...), forming a heterogeneous ecosystem of annotations when integrating contextualized data from different applications into local RDF knowledge graphs.

The aim of Context Associations, is to present a minimal annotaiton method for RDF, that can be reused by other annotation methods for defining concrete associations of contextual information to well-defined target boundaries in the form of named graphs within the bounds of the local RDF knowledge graph.

The specification defines

  1. A core data model, that defines the rules in generating named graphs, and the terms used to link them.
  2. An encoding guide, that defines how annotation methods in RDF can be converted to or made compatible with the Context Association format.
  3. A decoding guide, specifying the SPARQL setup necessary to retrieve all data graphs and associated metadata graphs for which a user query holds.

This way, Context Associations defines a minimal pattern for exchange, storage and query of statements and their associated meta-information in local RDF graphs. The integration of remote resources through URI references are disregarded, as no uniform semantics or approach exists as to how these should be managed.

2. Terminology

The following terminology is used throughout the specification document:

3. Structural Vocabulary and Predicate Stratification (normative)

Context Associations defines a closed set of structural predicates. These predicates are used exclusively to describe relationships between graphs, encodings, and context units.

Structural predicates MUST NOT contribute to the data content of a graph. They exist solely to describe structure, provenance, or grouping, and MUST be ignored when interpreting the data semantics of RDF graphs.

3.1. anchor properties

3.2. preservation properties

4. Minimal semantics (normative)

Let SP be the set of all structural predicates defined above.

For any named graph G, the content triples of G, denoted CT(G), are defined as:

CT(G) = { ⟨s,p,o⟩ ∈ G | p ∉ SP }

Given an RDF dataset D:

A context graph MAY itself be referenced as a target graph. In that case, its interpretation is still based on its content triples CT(C).

5. Examples (informative)

5.1. Simple metadata about a data graph

_:dataG {
  <https://example.org/x> <https://example.org/y> <https://example.org/z> .
}
_:ctxG {
  _:ctxG <https://w3id.org/context-associations#aboutGraph> _:dataG .
  _:dataG <https://example.org/pred/retrievedFrom> <https://source.example/> .
}

5.2. Chaining (context about context) and avoiding aliasing

_:dataG {
  <https://example.org/x> <https://example.org/y> <https://example.org/z>  .
}

_:policyG {
  _:policyG <https://w3id.org/context-associations#aboutGraph> _:dataG  .
  <https://example.org/policy/pol1> <https://example.org/ns#type> <https://example.org/ns#Policy> .
  <https://example.org/policy/pol1> <https://example.org/ns#allowedPurpose> <https://example.org/ns#Analytics> .
}

_:sigG {
  _:sigG <https://w3id.org/context-associations#aboutGraph> _:policyG .
  <https://example.org/sig/sig1> <https://example.org/ns#type> <https://example.org/ns#Signature> .
  <https://example.org/sig/sig1> <https://example.org/ns#signatureValue> "..." .
  <https://example.org/sig/sig1> <https://example.org/ns#verificationKey> <did:example:123#k1> .
}

6. Querying to Context Associations

Context Associations provide a uniformly queryable annotation method in RDF. For a given target BGP, we can construct the associated context to this BGP as follows:

PREFIX ca: <https://w3id.org/context-associations#>
PREFIX ex: <http://example.org/ns#>

CONSTRUCT {
  GRAPH ?source {
    ?s ?p ?o
  }
}
WHERE {

  # 1. Find target graph(s) via BGP
  GRAPH ?target {
    { TARGET BGP }
  }

  # 2. Recursive closure over aboutGraph using the union graph
  GRAPH <urn:x-arq:UnionGraph> {

    # backward closure: source →* target
    {
      ?source ca:aboutGraph* ?target .
    }

    UNION

    # forward closure: target →* source
    {
      ?target ca:aboutGraph* ?source .
    }

    UNION

    # include target itself
    {
      BIND(?target AS ?source)
    }
  }

  # 3. Extract full graph content
  GRAPH ?source {
    ?s ?p ?o
  }
}

7. Graph identifiers and exchange profiles

Graph names may be blank nodes. W3C

This specification recommends using blank nodes for graph names participating in context associations when the goal is to avoid unintended co-reference/merge across unrelated datasets.

7.2. Skolemized identifiers (optional)

Implementations may replace blank nodes with Skolem IRIs, and systems doing so should mint globally unique IRIs for each blank node. W3C

Skolemization can improve addressability and compatibility with systems/protocols that require graph IRIs (for example, managing graphs via the SPARQL Graph Store Protocol assumes IRI-addressable graphs over HTTP). W3C

Import guidance (non-normative):

8. Encoding source annotation methods to Context Associations (normative)

Note: The following algorithm makes the presumption that references to identifiers used as named graph identifiers can be interpreted as references to the contents of said named graph. In cases where this is not desired behavior, or contextual associations are linked through other means, step 3 of the following algorithm should be manually reviewed or utilize a different logic.

For any input RDF Dataset Dsource, we generate the dataset D by replacing every Blank Node identifier B in Dsource by a newly minted skolem identifier SkolemB. From this, we build the resulting encoded dataset D' as follows:

# Step 1. Encode the default graph
If the default graph of `D` is not empty:

    Mint a new skolem identifier `Sdg`

    For all quads `q` of `D` with the default graph as graph term:

        Add the quad `q.subject, q.predicate, q.object, Sdg` to `D'`

    Add the quad `Sdg ca:originalName ca:DefaultGraph Sdg` to `D'`

# Step 2. Encode named graphs
For each graph term `G` in `D`:

    Mint a new skolem identifier `Sng`

    For all quads `q` with graph term `I`:

        Add the quad `q.subject, q.predicate, q.object, Sng` to `D'`

    Add the quad `Sng ca:originalName I Sng`


# Step 3. Define context associations

    For each graph term `G` in in `D'`:

        For each quad `q` with graph term `G`:

            If exists `X ca:originalName N X` in `D'`, where `sameTerm(q.subject, N)` OR `sameTerm(q.object, N)`:

                Add the quad `G ca:aboutGraph X G` to `D'`

    If we do not scope locally, output `D'`

# Step 4. Local scoping (optional)

    For each graph term `G` in `D'`:

        Mint a blank node identifier `B`

        For each quad `q` in `D'`:

            if `sameTerm(q.graph, G)`:

                if `sameTerm(q.predicate, ca:originalName)`:

                    Add quad `B, q.predicate, q.object, B` 

                else if `sameTerm(q.predicate, ca:aboutGraph)`: 

                    Let `B'` be the blank node identifier minted for the graph term `G'` with `sameTerm(q.object, G')`:

                        Add quad `B, q.predicate, B', B` 

                else:

                    Add quad `q.subject, q.predicate, q.object, B` 

This can be achieved with the following ARQ query:

PREFIX ca: <http://example.org/ca#>

CONSTRUCT {
  GRAPH ?Gid {
    ?s ?p ?o .
    ?Gid ca:originalName ?OriginalName .
    ?Gid ca:aboutGraph ?TargetGid .
  }
}
WHERE {

  ########################################
  # STEP A — global graph mapping (skolem IDs)
  ########################################
  {
    SELECT DISTINCT ?G ?Gid ?OriginalName WHERE {

      # Named graphs
      {
        GRAPH ?G { ?s ?p ?o }

        BIND(IRI(CONCAT("urn:graph:", ENCODE_FOR_URI(STR(?G)))) AS ?Gid)
        BIND(?G AS ?OriginalName)
      }

      UNION

      # Default graph (only if non-empty)
      {
        ?s ?p ?o .
        FILTER NOT EXISTS { GRAPH ?g { ?s ?p ?o } }

        BIND("DEFAULT" AS ?G)
        BIND(IRI("urn:graph:default") AS ?Gid)
        BIND(ca:DefaultGraph AS ?OriginalName)
      }
    }
  }

  ########################################
  # STEP B — assign triples to encoded graphs
  ########################################
  {
    # Named graphs
    {
      GRAPH ?G { ?s ?p ?o }
    }

    UNION

    # Default graph
    {
      ?s ?p ?o .
      FILTER NOT EXISTS { GRAPH ?g { ?s ?p ?o } }
      BIND("DEFAULT" AS ?G)
    }
  }

  ########################################
  # STEP C — context associations
  ########################################
  OPTIONAL {

    # Find graph name N mentioned in triple
    GRAPH ?N { ?nS ?nP ?nO }

    FILTER (
      sameTerm(?s, ?N) ||
      sameTerm(?o, ?N)
    )

    {
      SELECT DISTINCT ?N ?TargetGid WHERE {
        GRAPH ?N { ?x ?y ?z }

        BIND(IRI(CONCAT("urn:graph:", ENCODE_FOR_URI(STR(?N)))) AS ?TargetGid)
      }
    }
  }
}

Note: The use of blank node identifiers with subqueries in SPARQL has undefined outcomes, leading to problematic behavior when processing. Hence the SPARQL conversion being constrained to steps 1 -> 3.

9. Decoding Context Associations to their original dataset.

A conforming CA processor must implement the following:

1.  Parse the input as an RDF dataset.

2.  For each named graph `C` with graph name `c`, find triples `(c, ca:aboutGraph, t)` **within `C`**.

3.  If there is exactly one such triple, treat `C` as a context graph targeting `t`. Otherwise, `C` is not a valid context graph under this specification.

4.  The association output is a set of pairs `(c -> t)` plus the set of context triples in `C` excluding the anchor triple.

5.  A processor must not assume that `t` is globally meaningful outside the dataset scope unless an application profile explicitly says so  
    (e.g., trusted skolem IDs).

Note (informative): Applications that traverse context-association chains should avoid infinite recursion by tracking already-visited graph references.

This can be achieved by a combination of two queries, a first one to reconstruct the default graph of the original dataset, and a second one to reconstruct the original named graphs:

PREFIX ca: <https://w3id.org/context-associations#> 

# Reconstructing the default graph
CONSTRUCT {
  ?s ?p ?o .
}
WHERE {
  GRAPH ?G {
    ?G ca:originalName ca:DefaultGraph .
    ?s ?p ?o .
  }

  FILTER (
    !sameTerm(?p, ca:originalName) &&
    !sameTerm(?p, ca:aboutGraph)
  )
}
PREFIX ca: <https://w3id.org/context-associations#> 

# Reconstructing the named graphs
CONSTRUCT {
  GRAPH ?name {
    ?s ?p ?o .
  }
}
WHERE {
  GRAPH ?G {
    ?G ca:originalName ?name .
    ?s ?p ?o .
  }

  FILTER (
    !sameTerm(?p, ca:originalName) &&
    !sameTerm(?p, ca:aboutGraph) &&
    !sameTerm(?name, ca:DefaultGraph)
  )
}

Appendix A: Considerations

This section discusses some considerations as to why certain decisions were made.

Why named graphs instead of reification / rdf-star / triple terms

The choice for named graphs is both a pragmatic choice of choosing an existing RDF standard that should be supported in all RDF 1.1 compatible tooling, that provides inherent support of annotating sets of triples instead of individual triples. Where reification and triple terms can be modeled as part of a collection or other entity that defines a selection of triples, these do not provide an inherent boundary within the RDF dataset, but is fully reliant on the documentation of the approach to indicate the intended boundary of that collection entity. Named graphs does not suffer from this problem, and provides an inherent boundary of its contained triples from the default graph and the other named graphs of an RDF dataset

Additionally, both Evaluation of Metadata Representations in RDF stores as well as an unpublished paper from our side found that named graphs provide competitive performance with other annotation methods.

Syntactic scope of the named graph

There are two possible interpretations in which a referenced named graph can be interpreted in RDF, as the set of quads where the term equals the given graph name, or the set of triples found by stripping the graph term from this set of quads.

To process a referenced graph as a set of triples, in a pass-by-value way, an unpacking of the named graph is required, such as using the GRAPH keyword in SPARQL to unpackage a named graph in triple graph, or in notation3 using graph terms.

However, the core RDF specification does not provide such an unpacking mechanism. Therefore, in order to enforce working with the graph in a by-value approach, an approach such as SPARQL is required that allows both the use of the graph identifier and working with the unpackaged triples at the same time.

CONSTRUCT {
    ?s ?p ?o.
    ?g :issuer ?issuer.
} WHERE {
    GRAPH ?g {
        ?s ?p ?o.
    }
    ?g :issuer ?issuer.
}

So unless a processing approach such as SPARQL can be enforced, the syntactic interpretation of named graphs must be constrained to its set of quads to retain consistency and functionality throughout the processing pipeline.

Semantic and syntactic interpretation of RDF named graphs

The semantics of named graphs have had extensive discussion previously in the semantic web community, much of which has been collected in a document published by the RDF working group On Semantics of RDF Datasets.

The evaluation of entailment regimes over RDF graphs is closely tied to the unpacking of said graphs in the RDF dataset. Therefore, it is left out of scope for this specification.

Working with remote references

For practical purposes, we restrict the interpretation of graph references to local graphs. This goes for blank nodes, dereferenceable URIs and non-dereferenceable URIs.

The integration of remote graphs through the dereferencing requires a more holistic approach that can resolve inconsistencies in the resulting dataset following a merge operation. A similar approach can be seen with the use of owl:imports in Notation3 reasoning.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119