# Shape Fragments and Subgraphs

In this post I'd like to explain some of the ideas from our recent researchpaper, starting small to illustrate the design choices and considerations. The paper is titled Data Provenance for SHACL (formerly Shape Fragments) and in its essence, it describes how the constraint language SHACL can be used to specify subsets of an RDF graph.

The idea described previously seems simple. We take a data graph, we take a shapes graph, then we "trace out" the shapes from the shapes graph from the data graph and, voila, we're done with it. Of course, this leaves out what it means to "trace out" shapes from a graph. Consider the following example shape:

```:SocialShape a sh:PropertyShape ;
sh:targetClass foaf:Person ;
sh:path foaf:knows ;
sh:minCount 1 .
```

Intuitively, you want only consider nodes that adhere to the shape and you want to "follow properties" to create a subgraph. In this example, you take all triples where the subject is the focus node and which also have `foaf:knows` as a predicate. Consider the following data graph:

```:maxime a foaf:Person ;
foaf:givenName "Maxime" ;
foaf:knows :thomas ;
foaf:knows :anastasia ;
foaf:knows :jan .
```

Using the ideas formulated earlier, we obtain the subgraph:

```:maxime foaf:knows :thomas ;
foaf:knows :anastasia ;
foaf:knows :jan .
```

We believe this to be a simple and intuitive definition for the given shape. You may argue that because `:SocialShape` mentioned the `rdf:class`, we also want it in the subgraph. We would agree and have defined it as such in the paper.

You can imagine trying to define subgraphs for every possible SHACL construct, but few definitions are as straightforward as the one discussed above. The one defined above isn't even that obvious. You could ask yourself: why does the subgraph contain all outgoing `foaf:knows` edges? One philosophy could be to let the subgraph only contain "just enough". After all, `:SocialShape` only says there needs to be at least one.

This brings us to one of the principles we followed when defining the subgraphs: determinism. In the `:SocialShape` case, because triples in an RDF graph are not sorted, the only choice we have is returning all the triples.

Let's consider another example:

```:AntisocialShape a sh:PropertyShape ;
sh:targetClass foaf:Person ;
sh:path foaf:knows ;
sh:maxCount 2 .
```

Here, you are antisocial when you know at most two others. Consider a new data graph:

```:bob a foaf:Person ;
foaf:givenName "Bob" ;
foaf:knows :alice ;
foaf:knows :tim .
```

What would be a natural definition for a subgraph given our `:AntisocialShape`? Keeping in mind our principle of determinism, we are left with two choices. Both contain the triple `:bob a foaf:Person` as we discussed earlier. The first option is the subgraph constaining all `foaf:knows` triples where `:bob` is the subject. The second option contains only the above mentioned triple. Both options seem reasonable, and we opted for the latter.

The reason being that this somehow comes closer to another underlying principle: minimality. We chose the smallest subgraph we can while somehow still maintaining the essence of the original data graph.

This leads to the following observation: the empty subgraph also is minimal and deterministic. This is the essence of the major open problem within this work: we want formally defined "postulates" which lead us to a definition of Shape Fragments. The principles of determinism and minimality are just design guidelines.

Nevertheless, I believe these design guidelines together with our proposal for the definition of subgraphs are reasonable. This believe is strengthened by one of the formal contributions of the paper: the Sufficiency Lemma (and its corollary). The notion of Sufficiency is borrowed from work on database Provenance, which is highly relevant. Informally, the lemma states that for every SHACL construct (like the ones demonstrated by `:SocialShape` and `:AntisocialShape`) defines a subgraph which contains enough triples such that the shape used for defining it still holds for the same nodes in the subgraph. Even when you add more triples from the original graph to the subgraph. In short, shapes that hold in the original graph, also hold in every subgraph that contains at least the triples provided by our definition. The "at least" is important here. For example, it means that the choice of minimality in the subgraph definition of `:AntisocialShape` is not necessary for our result to hold.

Finally I would like to note that the Suffiency property of our definition captures some intuition that subgraphs given by shapes need to still adhere to these shapes. This gives us a stronger link to the definition of SHACL as a constraint language.

Hopefully this short post raises your interest in our work. There are interesting problems to solve here, both for more theoretically minded people (I'm thinking of the above mentioned postulates and other properties of the subgraphs) and more practically minded peope (Relating to the implementation and use-case side of things). The paper also discusses some preliminary results on implemenatation which may be of interest.