# Shape Fragments and Subgraphs

In this post I'd like to explain some of the ideas from our recent researchpaper, starting small to illustrate the design choices and considerations. The paper is titled Data Provenance for SHACL (formerly Shape Fragments) and in its essence, it describes how the constraint language SHACL can be used to specify subsets of an RDF graph.

The idea described previously seems simple. We take a data graph, we take a shapes graph, then we "trace out" the shapes from the shapes graph from the data graph and, voila, we're done with it. Of course, this leaves out what it means to "trace out" shapes from a graph. Consider the following example shape:

:SocialShape a sh:PropertyShape ; sh:targetClass foaf:Person ; sh:path foaf:knows ; sh:minCount 1 .

Intuitively, you want only consider nodes that adhere to the shape and
you want to "follow properties" to create a subgraph. In this example,
you take all triples where the subject is the focus node and which
also have `foaf:knows`

as a predicate. Consider the following data
graph:

:maxime a foaf:Person ; foaf:givenName "Maxime" ; foaf:knows :thomas ; foaf:knows :anastasia ; foaf:knows :jan .

Using the ideas formulated earlier, we obtain the subgraph:

:maxime foaf:knows :thomas ; foaf:knows :anastasia ; foaf:knows :jan .

We believe this to be a simple and intuitive definition for the given
shape. You may argue that because `:SocialShape`

mentioned the
`rdf:class`

, we also want it in the subgraph. We would agree and have
defined it as such in the paper.

You can imagine trying to define subgraphs for every possible SHACL
construct, but few definitions are as straightforward as the one
discussed above. The one defined above isn't even that obvious. You
could ask yourself: why does the subgraph contain *all* outgoing
`foaf:knows`

edges? One philosophy could be to let the subgraph only
contain "just enough". After all, `:SocialShape`

only says there needs
to be at least one.

This brings us to one of the principles we followed when defining the
subgraphs: **determinism**. In the `:SocialShape`

case, because triples in
an RDF graph are not sorted, the only choice we have is returning all
the triples.

Let's consider another example:

:AntisocialShape a sh:PropertyShape ; sh:targetClass foaf:Person ; sh:path foaf:knows ; sh:maxCount 2 .

Here, you are antisocial when you know at most two others. Consider a new data graph:

:bob a foaf:Person ; foaf:givenName "Bob" ; foaf:knows :alice ; foaf:knows :tim .

What would be a natural definition for a subgraph given our
`:AntisocialShape`

? Keeping in mind our principle of determinism, we are
left with two choices. Both contain the triple `:bob a foaf:Person`

as
we discussed earlier. The first option is the subgraph constaining all
`foaf:knows`

triples where `:bob`

is the subject. The second option
contains only the above mentioned triple. Both options seem
reasonable, and we opted for the latter.

The reason being that this somehow comes closer to another underlying
principle: **minimality**. We chose the smallest subgraph we can while
somehow still maintaining the essence of the original data graph.

This leads to the following observation: the empty subgraph also is minimal and deterministic. This is the essence of the major open problem within this work: we want formally defined "postulates" which lead us to a definition of Shape Fragments. The principles of determinism and minimality are just design guidelines.

Nevertheless, I believe these design guidelines together with our
proposal for the definition of subgraphs are reasonable. This believe
is strengthened by one of the formal contributions of the paper: the
Sufficiency Lemma (and its corollary). The notion of Sufficiency is
borrowed from work on database Provenance, which is highly
relevant. Informally, the lemma states that for every SHACL construct
(like the ones demonstrated by `:SocialShape`

and `:AntisocialShape`

)
defines a subgraph which contains enough triples such that the shape
used for defining it still holds for the same nodes in the
subgraph. Even when you add more triples from the original graph to
the subgraph. In short, shapes that hold in the original graph, also
hold in every subgraph that contains *at least* the triples provided by
our definition. The "at least" is important here. For example, it
means that the choice of minimality in the subgraph definition of
`:AntisocialShape`

is not necessary for our result to hold.

Finally I would like to note that the Suffiency property of our definition captures some intuition that subgraphs given by shapes need to still adhere to these shapes. This gives us a stronger link to the definition of SHACL as a constraint language.

Hopefully this short post raises your interest in our work. There are interesting problems to solve here, both for more theoretically minded people (I'm thinking of the above mentioned postulates and other properties of the subgraphs) and more practically minded peope (Relating to the implementation and use-case side of things). The paper also discusses some preliminary results on implemenatation which may be of interest.