-
Notifications
You must be signed in to change notification settings - Fork 0
Page
; tldr: see HATEOS-based solution
Citation trees for <pb/> and other milestone-like markup are
complicated. This document is an investigation on options that we have.
For all approaches there's test case in
test/john.xml.
| approach | technically sound¹ cit. tree | dinky² cit. tree | levels | $ref returns page correctly | $start+$end returns page correctly |
|---|---|---|---|---|---|
| Simple | ❌ | ✅ | 1 | ❌ | ❌ (3) |
| One-level intersection | ❌ | ❌ | 1 | ❌ | ❌ |
| Element construction ❓ | 1 | ||||
| Two-level intersection | ❌ | ✅ | 2 | ❌ | ❌ |
| Second-level start end | ❌ | ✅ | 2 | ❌ | ✅ |
| HATEOS-based | ✅ | 2 | ✅ |
Notes
- A technically sound citation tree has one member per page on the first level and offers a client enough information, to query the contents of page N.
- A dinky citation tree has one member per page on a distinct (the first) level and thus looks nice, but falls short in some aspekt.
-
<pb/>of next page is contained in output
The most simple approach is to simply select the <pb/> elements:
<refsDecl n="page-simple">
<citeStructure unit="page" match="//body//pb" use="@n" delim="p."/>
</refsDecl>This gives us a citation tree, that first looks nice:
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/document.xsl -s:test/john.xml tree=page-simple "member": [
{
"level": 1,
"identifier": "p.1",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 1,
"identifier": "p.2",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 1,
"identifier": "p.3",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
}
],What can we get from specifying ref?
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/document.xsl -s:test/john.xml tree=page-simple ref=p.1<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0"><dts:wrapper xmlns:dts="https://w3id.org/api/dts#"><pb n="1"/></dts:wrapper></TEI>That's only the milestone-like <pb/>.
What can we get from specifying start through end? It gives us the
first page's content.
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/document.xsl -s:test/john.xml tree=page-milestones start=p.1 end=p.2<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0"><dts:wrapper xmlns:dts="https://w3id.org/api/dts#"><pb n="1"/>
<head>The book of John</head>
<milestone unit="theme" xml:id="creation-start"/>
<l n="1">In the beginning was the Word, and the Word was with God, and the Word was
God.</l>
<l n="2">He was with God in the beginning.</l>
<l n="3">Through him all things were made; without him nothing was made that has been
made.</l>
In him was life, and that life was the light<pb n="2"/></dts:wrapper></TEI>However, specifying from p.1 to p.2 when we only want p.1 is not obvious. We shouldn't even think of establishing it as a convention.
Thus, let's have a look at expressions, that could get the nodes between two milestone-like elements.
In order to get the nodes between to empty elements, we can use the intersection pattern:
$start-node/following::node() intersect $end-node/preceding::node()
In order to include the <pb/>, that demarcates the page beginning,
we can simple prepend it to the node sequence.
$start-node, $start-node/following::node() intersect $end-node/preceding::node()
Since intersection returns all nodes in kind of exploded fashion, we
pass the result to
fn:outermost(...):
outermost(($start-node, $start-node/following::node() intersect $end-node/preceding::node()))
BTW, the cut.xsl of DTS Transformations uses a similar
technique to select the node ranges for the document endpoint.
Let's try this approach.
Since the intersective node range must be constructed for each <pb/>
(but the last), we als need a for generator expression lie for $pb in //pb return .... Here's our cite structure declaration:
<refsDecl n="page-content-by-intersection">
<citeStructure unit="page"
match="for $pb in //pb return (let $next:=$pb/following::pb[1] return outermost(($pb, $pb/following::node() intersect $next/preceding::node())))"
use="@n" delim="p."/>
</refsDecl>But the resulting citation tree looks really weird with tons of members with empty-string identifiers and so on.
Want's going wrong here?
Sequence-flatting is the party pooper! The for ... return ends up in
a single sequence, that encompasses all nodes following the p.1 mark.
So this could only work, if we could wrap the intersection-based content nodes of each page in wrapper elements.
However, an XQuery-like element construction with element {NAME} {CHILDREN} does not work in XPath:
<refsDecl n="page-content-by-intersection">
<citeStructure unit="page"
match="for $pb in //pb return element {ab} {(let $next:=$pb/following::pb[1] return outermost(($pb, $pb/following::node() intersect $next/preceding::node())))}"
use="@n" delim="p."/>
</refsDecl>target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/navigation.xsl -s:test/john.xml tree=page-content-xquery-likeError at char 7 in expression in xsl:evaluate/@xpath on line 176 column 52 of tree.xsl:
XTDE3160 Static error in XPath expression supplied to xsl:evaluate: Node constructor
expressions are allowed only in XQuery, not in XPath. Expression: {for $pb in //pb return
element {ab} {(let $next:=$pb/following::pb[1] return outermost(($pb,
$pb/following::node() intersect $next/preceding::node())))}}
...
There are two further steps to try with element construction:
- By passing the sequence to an XQuery function, which is dynamically
loaded with
fn:load-xquery-module(...). However, this would introduce external dependencies. So this won't be a self-contained solution. - By passing the sequence to an XSLT stylesheet, which is dynamically
loaded with
fn:transform(). Sincefn:transform(...)allows us to pass an inline stylesheet via thestylesheet-textproperty of the map that has to be passed into the function, this could be made self-containing!
I haven't tried that yet ...
... because I think other approaches more promising. Moreover, with element construction, we would loose the single most important feature of the DTS transformation: node identity und node tracking.
<refsDecl n="page-content-by-intersection-2">
<citeStructure unit="page-beginning" match="//body//pb" use="@n" delim="p.">
<citeStructure unit="page-content"
match="let $pb:=self::pb, $next:=$pb/following::pb[1] return (($pb, $pb/following::node() intersect $next/preceding::node()))"
use="''" delim=" content"/>
</citeStructure>
</refsDecl>This improves the citation tree on level 1, but level 2 still suffers from the same problem as the one-level intersection approach.
At this point, we have to shift away from the idea to get the page
content simply by providing the page identifier in the ref
parameter. This will not work, because <pb/> is a milestone-like
element which does not contain the content and because providing a
sequence of nodes via citeStructure/@match will result in a sequence
of members made up from this sequence of nodes.
Instead of putting all the nodes from the intersection into the second level, we could only put there the first and the last node of this sequence:
<refsDecl n="page-level2-start-end">
<citeStructure unit="page" match="//body//pb" use="@n" delim="p.">
<citeStructure unit="page-start" match="self::pb"
use="''" delim=".start"/>
<citeStructure unit="page-end" match="self::pb/following::pb[1]/preceding::node()[1]"
use="''" delim=".end"/>
</citeStructure>
</refsDecl>We get a reasonable citation tree from this:
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/navigation.xsl -s:test/john.xml tree=page-level2-start-end "member": [
{
"level": 1,
"identifier": "p.1",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.1.start",
"parent": "p.1",
"citeType": "page-start",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.1.end",
"parent": "p.1",
"citeType": "page-end",
"@type": "CitableUnit"
},
{
"level": 1,
"identifier": "p.2",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.2.start",
"parent": "p.2",
"citeType": "page-start",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.2.end",
"parent": "p.2",
"citeType": "page-end",
"@type": "CitableUnit"
},
{
"level": 1,
"identifier": "p.3",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.3.start",
"parent": "p.3",
"citeType": "page-start",
"@type": "CitableUnit"
}
],We do not expect any good news from querying ref=p.1. But, let's see
what we get from p.1 to p.1.end:
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/document.xsl -s:test/john.xml tree=page-level2-start-end start=p.1 end=p.1.endThe result is correct:
<?xml version="1.0" encoding="UTF-8"?><TEI xmlns="http://www.tei-c.org/ns/1.0"><dts:wrapper xmlns:dts="https://w3id.org/api/dts#"><pb n="1"/>
<head>The book of John</head>
<milestone unit="theme" xml:id="creation-start"/>
<l n="1">In the beginning was the Word, and the Word was with God, and the Word was
God.</l>
<l n="2">He was with God in the beginning.</l>
<l n="3">Through him all things were made; without him nothing was made that has been
made.</l>
In him was life, and that life was the light</dts:wrapper></TEI>But there's still a problem: The citation tree is not technically
sound! How does a client now, that the content of page N is available
via start=p.N.start&end=p.N.end?
Let'take the perspective of a client. We're kind of dumb. For the default citation tree we get something like
"member": [
{
"level": 1,
"identifier": "John",
"parent": null,
"citeType": "book",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "John:1",
"parent": "John",
"citeType": "chapter",
"@type": "CitableUnit"
},
{
"level": 3,
"identifier": "John:1:1",
"parent": "John:1",
"citeType": "verse",
"@type": "CitableUnit"
},
/* ... */
] And for the page citation tree we get something like
"member": [
{
"level": 1,
"identifier": "p.1",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.1.start",
"parent": "p.1",
"citeType": "page-start",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.1.end",
"parent": "p.1",
"citeType": "page-end",
"@type": "CitableUnit"
},
{
"level": 1,
"identifier": "p.2",
"parent": null,
"citeType": "page",
"@type": "CitableUnit"
},In both cases, there's nothing but information about levels, about
cite types and the member's identifier. But in the first case, the
contents of a unit are retrieved by ref and in the other by start
and end. Can we build on the citeType information alone, to
distinguish between the two types of follow-up requests for getting
content?
According to the LOD
@context,
the value of the citeType is just of type string literal; and
accordingly, the value of
citeStructure/@unit
is just a tei.enumerated, but not a tei.pointer. Thus, the turtle
serialization of a member of the page citation tree looks like this:
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/navigation.xsl -s:test/john.xml tree=page-level2-start-end | target/bin/riot.sh --syntax=jsonld --out=ttl...
_:b0 a <https://w3id.org/dts/api#CitableUnit> ;
<https://w3id.org/dts/api#citeType> "page" ;
<https://w3id.org/dts/api#identifier> "p.1" ;
<https://w3id.org/dts/api#level> 1 .
_:b1 a <https://w3id.org/dts/api#CitableUnit> ;
<https://w3id.org/dts/api#citeType> "page-start" ;
<https://w3id.org/dts/api#identifier> "p.1.start" ;
<https://w3id.org/dts/api#level> 2 ;
<https://w3id.org/dts/api#parent> "p.1" .
...Building the follow-up processing of a client on a string literal instead of a LOD term, i.e. a URI, would be a bit weak.
And even if we did this, we would have to build on conventions for getting the right start and end members of a page, but not on explicit information.
DTS builds on HATEOAS, and this is a strength of the specs.
Let's take the Second-level start end approach a step further and add some explicit information to the citation tree members, which a client can evaluate to get the content of a page:
<refsDecl n="page-hateoas">
<citeStructure unit="page" match="//body//pb" use="@n" delim="p.">
<citeData use="'p.' || @n || '.start'" property="https://w3id.org/dts/api#startMember"/>
<citeData use="'p.' || @n || '.end'" property="https://w3id.org/dts/api#endMember"/>
<citeStructure unit="page-start" match="self::pb"
use="''" delim=".start"/>
<citeStructure unit="page-end" match="self::pb/following::pb[1]/preceding::node()[1]"
use="''" delim=".end"/>
</citeStructure>
</refsDecl>This shifts from convention to information: It's now explicit, that the
range of page N start a member p.N.start and ends at p.N.end
(inclusive). Here's the page-hateoas citation tree:
target/bin/xslt.sh -config:saxon.he.xml -xsl:xsl/navigation.xsl -s:test/john.xml tree=page-hateoas "member": [
{
"level": 1,
"dts:startMember": "p.1.start",
"identifier": "p.1",
"parent": null,
"citeType": "page",
"@type": "CitableUnit",
"dts:endMember": "p.1.end"
},
{
"level": 2,
"identifier": "p.1.start",
"parent": "p.1",
"citeType": "page-start",
"@type": "CitableUnit"
},
{
"level": 2,
"identifier": "p.1.end",
"parent": "p.1",
"citeType": "page-end",
"@type": "CitableUnit"
},
/*...*/
]That's the minimum required information. We could deposit in an OWL
ontology, that a dts:CitableUnit which has dts:startMember and
dts:endMember properties has kind of virtual constructed subforrest
encompassing the subtrees constructed for the member given as
dts:startMember to the subtree constructed for the member given as
dts:endMember. Alternatively, we could also express this in the
members properties by kind of rdf:type or so. But at least, a client
needs the member names explicitly, like in this rather minimal, but
technically sound citation tree.
TODO: Suggest this solution to the DTS team!
TODO: problem with last page
How can we further improve this approach? How can we get the page out
of the endpoint by specifying ref=p.1?
Breaking the barrier seems only possible by bringing the specification forward again. There are several ways:
- make the algorithmic part of the specs suitable for milestone-like markup (but keep the set of elements and attributes as is)
- add information about the 'situation' in the output of the navigation endpoint which a conforming client can evaluate (HATEOS-based solution)
- specify new attributes (and/or elements) on
<citeStructure>and how they are to be processed (TEI-based solution)
The following stays in the border of 1, because telling the client what to do next in the guiding principle of HATEOAS and it's very simple.
It introduces a new term: virtual sub-forrest.
The algorithm for processing citeStructure is defined by example in
SACRCS
and in the reference for
citeStructure/@match
More important, let's focus on the spec for ref on the document end
point:
The string identifier of a single node in the CitationTree for the Resource, used as the root for the sub-tree to be reconstructed.
Let's introduce a new term, which is opposed to sub-tree:
Def: A node in the CitationTree at the level N node has virtual sub-trees as opposed to its (real) sub-tree, when a level N+1 child reconstructs a node outside of the single node's (real) sub-tree reconstructed from the level N node.
Def: The virtual sub-forrest of a level N node in the citation tree encompasses all document nodes starting from the first through the last node in document order which are reconstructed from its first and last level N+1 children in the citetation tree.
Non-normative comment:
The notion of virtual sub-forrest de-couples the hierarchy introduced by a citation tree from the document hierarchy. This is useful for citation trees constructed from milestone-like markup, e.g. TEI's
<pb n="42"/>.
Let's modify this from the specs of the document endpoint:
The string identifier of a single node in the CitationTree for the Resource, used as the root for the sub-tree to be reconstructed.
We could extend this with the underlined phrase:
The string identifier of a single node in the CitationTree for the Resource, used as the root for the sub-tree or the virtual sub-forrest to be reconstructed.
This would turn the Second level start end
approach into $ref-✅!
<refsDecl n="page-level2-start-end">
<citeStructure unit="page" match="//body//pb" use="@n" delim="p.">
<citeStructure unit="page-start" match="self::pb"
use="''" delim=".start"/>
<citeStructure unit="page-end" match="self::pb/following::pb[1]/preceding::node()[1]"
use="''" delim=".end"/>
</citeStructure>
</refsDecl>The document node reconstructed from page-end are virtual sub-trees
of the sub-tree constructed from page members. Thus, the virtual
sub-forrest of a page member is all the nodes from page start to page
end.
- OLD and REWRITTEN text
If the sub-tree identified by
$refon level N encompasses a single node that has no children (no non-attribute children) and if there are exactly two children members of$refon level N+1, then and only then, on request withref=Xand X on level N, the document endpoint may return the virtual sub-forrest. The virtual sub-forrest is the sequence of document nodes between the left-most and the right-most nodes (encluding these two) identified by the N+1 level child members.
This would turn the Second level start end
approach into $ref-✅!
Isn't this what we want regarding milestone-like markup? We get a
reasonable cite structure and a reasonable schema for using it in the
document end point using ref. If child members de-mark a bigger
portion of the document than the member itself, why not return that
bigger portion on ref requests?
virtual sub-forrest: This term carries out the two involved
reference systems: The hierarchy of the members in the citation tree
is reflected in sub-. The hierarchy of the XML document is reflected
in forrest; the resulting node set is a forrest, not a sub-tree (at
least in general). There is no sub in that forrest regarding the
$referenced node, that's why it's virtual.