# Provenance and SHACL

Note: This post is an adaptation of Section 1.4 of my PhD thesis, and serves as a general intro to our paper on the subject. The purpose of publishing it as a standalone blogpost is to further disseminate my writings. It is part of a short series of posts representing the Introduction of my thesis:

- Section 1.1: SHACL in a nutshell
- Section 1.2: Expressiveness
- Section 1.3: Recursion
- Section 1.4: Provenance (this post)

In the previous posts, we have primarily discussed the fundamental semantics for SHACL. Here, we will explore another useful semantics: data provenance for SHACL. Some elements have already been discussed in a previous post.

Given a SHACL shapes graph and a data graph that conforms to that
shapes graph, we want to know what subset of the data graph is
*relevant* to decide that it conforms. This subset of the data is
called the *data provenance*.

There are many intuitions of what data is relevant. The provenance
could be seen as the subset of the graph "traced out" by the shape by
following the values of `sh:path`

in the property shapes. Another
intuition is that we would like to have a subset of the graph that the
validator "looked at" while validating the data. These notions are
useful but imprecise, as will be demonstrated later in this
section. Our approach is to at least take into consideration the
so-called *sufficiency* property of the provenance. The sufficiency
property states, informally, that the resulting provenance still
conforms to the shapes graph. Or, at the level of shapes (without
targets), all nodes from the input data graph still satisfies the same
shapes in the provenance. Sufficiency tells us that our provenance is
actually relevant in a very precise way — clearly, data that
contributes to conformance must be relevant.

Our discussion will focus on the provenance of a shape without target
declarations. So, given a data graph *G*, a shape *s*, and a node *n* from
\(G\) that conforms to \(s\), we will define the provenance which we call
the *neighborhood* of *n* in *G* according to *s*. This can then later be
generalized for shapes graphs (including target declarations) as
opposed to just shapes.

## Neighborhoods by example

We will use the following data graph for our examples:

:user_a a :Admin; :accesses :resource1 . :user_b a :Admin; :accesses :resource1 ; :accesses :resource2 . :user_c a :User; :accesses :resource1 ; :accesses :resource2 . :resource1 a :Resource . :resource2 a :Resource .

Suppose we have a shape defining a standard resource as one that is accessed by at least one user (or admin):

:standardResource a sh:NodeShape ; sh:class :Resource ; sh:property [ sh:path [ sh:inversePath :accesses ] ; sh:minCount 1 ; ] .

Clearly, both `:resource1`

and `:resource2`

satisfy the shape. We will
look at the neighborhood of `:resource1`

. Looking at the intuition
behind neighborhoods, the triple stating that `:resource1`

is a
`:Resource`

is definitely in the neighborhood, as well as *at least one*
of the triples stating that it is accessed by a user. However, there
is no order on the triples defined in RDF, so picking any of the three
triples is somewhat arbitrary. We could define provenance to be
non-deterministic, and say that there are three
neighborhoods. However, our approach is to have a *deterministic*
provenance definition and thus, we choose to define the neighborhood
as:

:resource1 a :Resource . :user_a :accesses :resource1 . :user_b :accesses :resource1 . :user_c :accesses :resource1 .

Note that this does not follow the intuition that we want to have the
subgraph that the validator looks at. Any reasonable validator would
only need to consider one `:accesses`

triple to know that `:resource1`

conforms to `:standardResource`

.

We will continue with another example. Consider the shape defining unpopular resources as resources that are accessed by at most two users:

:unpopularResource a sh:NodeShape ; sh:class :resource ; sh:property [ sh:path [ sh:inversePath :accesses ] ; sh:maxCount 2 ; ] .

Only `:resource2`

satisfies this shape. There are multiple options again
for what the neighborhood could be. There are two deterministic
options. Either the neighborhood contains all `:accesses`

triples or
none! According to our intuition of wanting to "trace out" the shape
the former is the suiting definition. However, our definition will
choose to not include any `:accesses`

triples in this case. Our design
philosophy here is that neighborhoods should be somehow
*minimal*. Choosing not to include any `:accesses`

triples is minimal, and
still makes `:resource2`

conform to the shape. Although this may be
counterintuitive, this design choice makes sense in light of
sufficiency. We will demonstrate a similar situation with a slightly
more complex shape containing the same ideas:

:specialResource a sh:NodeShape ; sh:class :resource ; sh:property [ sh:path [ sh:inversePath :accesses ] sh:minCount 1 ; sh:qualifiedValueShape [ sh:not [ sh:class :admin ]] ; sh:qualifiedMaxCount 1 ; ]

This shape consists out of three constraints. First, the node must be a resource. Second, the resource must be accessed by at least one user. Finally, at most one user that accesses this resource may not be an admin.

The resources described by this shape are special in the sense that they may only be accessed by admins, with at most one exception.

It is clear from the previous discussion that the neighborhood at least includes the following:

:resource2 a :resource . :user_b :accesses :resource2 . :user_c :accesses :resource2 .

This is due to the first two constraints. However, what do we do with the qualified max count? To include what is mentioned in the shape would result in adding the following triple to the neighborhood:

:user_c a :user .

Indicating that `:user_c`

is *not* an `:admin`

. However, this is exactly the
opposite of the information we need for conformance. What we *do*
need is the triple stating that `:user_b`

is indeed an
`:admin`

. The idea here is that we want to include in our
provenance all data proving that a certain node conforms to a
shape. Furthermore, in a shape where only the third constraint is
specified, returning nothing also seems to be a possibility for the
definition of neighborhood. However, we designed our neighborhood to
contain all necessary information to decide conformance, even when
other parts of the graph are added. Indeed, adding the triples

:user_b :accesses :resource2 . :user_c :accesses :resource2 .

would break this property. Having this property is technically necessary but also useful — it allows for neighborhood engines to be flexible and add more to the provenance, while never breaking sufficiency.

This example illustrates the intricacies of defining provenance that is sufficient. Here, we discussed the (qualified) cardinality constraint components, but defining neighborhoods for all shapes requires thinking about every feature in SHACL and choosing a suitable definition for each of them while keeping in mind sufficiency and our two design principles: determinism and minimality.

## Shape Fragments

This definition of neighborhoods can be viewed as an additional
semantics for SHACL defining a retrieval mechanism. We call these
retrieval semantics *shape fragments*.

Given a shape and a data graph, the shape fragment is simply the neighborhood of all nodes conforming to that shape. When there are targeting declarations, we only consider the nodes that are targeted (and include the information that targets them). It can similarly be defined for sets of shapes, or for shapes graphs.

One can imagine the usefulness of such a retrieval mechanism. When dealing with large RDF graphs, a shapes graph may only describe part of it that is relevant to the intended usage. Retrieving the shape fragment of the data graph then gives us a (possibly) smaller RDF graph which is easier to process, and still satisfies the constraints while containing relevant data.