-
Notifications
You must be signed in to change notification settings - Fork 0
Named Entities
Linking named entities is a common aspect of digital scholarly
editions. It's very common, to encode persons, places, organizations
or events mentioned or circumscribed in a text by means of
<persName>, <placeName> etc. or by <rs> and to link them either
directly to authority files like LOC
NAF or
GND
or indirectly via a local register file (personography, place-ography,
etc.).
With XTriples, adding this kind of entity linking to the extracted knowledge graph is straight forward.
Imagine a TEI document, where places and periphrasis of places are linked to places from the project's registry of places.
bla bla <rs type="place" ref="plc:Darayya">...</rs> and <placeName ref="plc:gillaq-Damascus">...</placeName> ...Here's what we want from this in our knowledge graph:
@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix edplc: <https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/place#> .
<document> crm:P67_refers_to
edplc:gillaq-Damascus .We get it with the configuration below, which is run on every TEI encoded text in the project.
<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<!-- XTriples configuration for extracting RDF triples from the poems in Diwan -->
<xtriples>
<configuration>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="common.xml" xpointer="vocabs"/>
<triples>
<!-- other triples about the document
...
-->
<!-- all placeName elements -->
<statement>
<subject>/base-uri(.)</subject>
<predicate prefix="crm">P67_refers_to</predicate>
<object prefix="edent" type="uri">/TEI/text//(placeName | rs[type = 'place'])/@ref ! tokenize(.) !
replace(., '^[^:]+:', 'place#')</object>
</statement>
<!-- similar for persons and events -->
</triples>
</configuration>
<collection uri="../..?select=*.tei.xml">
</collection>
</xtriples>Note, that the vocabularies (prefixes) are
included from common.xml.
The <statement> part of the configuration is straight forward. It's
a nice example for generating multiple statements with the same
subject and predicate, but with varying objects. These objects are
made up from the IDREFs in ref attributes of <placeName>
elements.
Probably, you want to replace the fabrication of the subject URI
<document>, which currently contains the absolute path on the local
machine. The subject URI involves conceptional questions: Is the
reference to the place a feature of the work, the expression, or the
manifestation? It's up to you to decide this according to your
project's needs and according to open ontologies. (Cf. FRBRized
Metadata.) Either is possible: The domain of
crm:P67_refers_to
is a
crm:E89_Propositional_Object,
which again is a superclass of LRMoo F1 (Work), F2 (Expression) or F3
(Expression) (via
crm:E73_Information_Object).
Note, that this configuration may result in duplicate triples, because
the same entity may be linked several times in the same document. Once
read with a RDF tool like Jena
RIOT,
the duplicates will disappear. So this extraction is not suitable for
keeping track of the frequency or density of a reference. If you want
to track frequency, you could restructure the subject: Instead of the
whole document (work, expression, or manifestation), let a passage
of the document be the subject. The LINCS project suggests the
formalization of a text passage as annotation
source,
which is also a crm:E73_Information_Object and which is suggested to
be aligned with a oa:SpecificResource from
WADM. For
extracting such passages with XTriples, one would first unnest all
passages (<placeName> and <rs type="place"> elements by means of
xtriples/collection/resource/@uri and thus a separate configuration
file. A similar technique is explained for the central registries in
the next section.
Your registry file with places may look like this:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:scdh="http://scdh.wwu.de/oxygen#ALEA" xml:lang="en">
<teiHeader>
<!-- ... -->
</teiHeader>
<text>
<body xml:lang="en">
<div>
<head>Places</head>
<listPlace>
<!-- ... -->
<place xml:id="Darayya">
<placeName>Dārayyā</placeName>
<desc>a village near Damascus</desc>
<idno type="locnaf" xml:base="http://id.loc.gov/authorities/names/">nr93005092</idno>
<listBibl>
<bibl corresp="bibl:YaqutBuldan"><biblScope>2:431-432</biblScope></bibl>
</listBibl>
</place>
<!-- ... -->
<place xml:id="gillaq-Damascus">
<placeName>Ǧillaq</placeName>
<placeName xml:lang="ar">جِلَّق</placeName>
<desc>Ǧillaq or Ǧilliq is a suburb of Damascus. In poetry it serves also as
a toponym of Damascus.</desc>
<listBibl>
<bibl corresp="bibl:YaqutBuldan"><biblScope>2:154-155</biblScope></bibl>
</listBibl>
</place>
<!-- ... -->
</listPlace>
</div>
</body>
</text>
</TEI>Let's first have a look at the desired output:
@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix edplc: <https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/place#> .
edplc:Darayya
a crm:E53_Place ;
owl:sameAs <http://id.loc.gov/authorities/names/nr93005092> ;
crm:P87_id_identified_by [
rdf:value "Dārayyā"@en ;
a crm:E44_Place_Appellation
] .
edplc:gillaq-Damascus
a crm:E53_Place ;
crm:P87_is_identified_by [
rdf:value "Ǧillaq"@en ;
a crm:E44_Place_Appellation
] ,
[
rdf:value "جِلَّق"@ar ;
a crm:E44_Place_Appellation
] .You get that from the following XTriples configuration, where
vocabularies are again included from
elsewhere.
<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<!-- XTriples configuration for extracting RDF from TEI place registry -->
<xtriples>
<configuration>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="common.xml" xpointer="vocabs"/>
<triples>
<statement>
<subject prefix="edplc">/@xml:id</subject>
<predicate prefix="rdf">type</predicate>
<object prefix="crm" type="uri">E53_Place</object>
</statement>
<!-- linking authority files -->
<!-- <idno xml:base="URI">IDENTIFIER</idno> -->
<statement repeat="/count(idno[normalize-space(.) ne '???' and @xml:base])">
<subject prefix="edplc">/@xml:id</subject>
<predicate prefix="owl">sameAs</predicate>
<object type="uri">/(let $idno := idno[normalize-space(.) ne '???'][$repeatIndex]
return concat($idno/@xml:base, $idno/text()))</object>
</statement>
<!-- english name -->
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
<subject prefix="edplc">/@xml:id</subject>
<predicate prefix="crm">P87_is_identified_by</predicate>
<object type="bnode">/concat('placename-en-', @xml:id)</object>
</statement>
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
<subject type="bnode">/concat('placename-en-', @xml:id)</subject>
<predicate prefix="rdf">value</predicate>
<object type="literal" lang="en">/(persName[ancestor-or-self::*/@xml:lang =
'de'])[1] => normalize-space()</object>
</statement>
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
<subject type="bnode">/concat('placename-en-', @xml:id)</subject>
<predicate prefix="rdf">type</predicate>
<object type="uri" prefix="crm">E44_Place_Appellation</object>
</statement>
<!-- arabic name -->
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
<subject prefix="edplc">/@xml:id</subject>
<predicate prefix="crm">P87_is_identified_by</predicate>
<object type="bnode">/concat('placename-ar-', @xml:id)</object>
</statement>
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
<subject type="bnode">/concat('placename-ar-', @xml:id)</subject>
<predicate prefix="rdf">value</predicate>
<object type="literal" lang="ar">/(persName[ancestor-or-self::*/@xml:lang =
'de'])[1] => normalize-space()</object>
</statement>
<statement>
<condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
<subject type="bnode">/concat('placename-ar-', @xml:id)</subject>
<predicate prefix="rdf">type</predicate>
<object type="uri" prefix="crm">E44_Place_Appellation</object>
</statement>
</triples>
</configuration>
<collection uri="../..?select=[Pp]lace*.xml">
<resource uri="{/TEI/text//listPlace/place[@xml:id and not(ancestor::place)]}"/>
</collection>
</xtriples>Note, that we include the vocabularies from common.xml.
Let's discuss the configuration in detail.
First let's look at the collection. The configuration is
located in
resources/graph/places.xml while the projects registry of places is
in places.xml. So we say <collection uri="../..?select=[Pp]lace*.xml"> in order to extract from
places.xml two parent directories above our XTriples configuration
file. The regular expression [Pp] is just there for catching upper
case file names as well as lower case file names; the star * is
there for catching places.xml als well as place.xml.
Obviously, a registry of places contains multiple places and we want
to extract information for all of them. So we do not take the registry
document as one single resource, but extract the same set of triples
for every place in the registry. Thus, we say <resource uri="{/TEI/text//listPlace/place}"/>. To be more precise, we only
want those places, that have an identifier; we thus say: <resource uri="{/TEI/text//listPlace/place[@xml:id]}"/>. Furthermore, we
exclude nested places, which may be rare, but may be present; thus we
add an other constraint to the XPath predicate: ```.
Now, let's look at the extracted triples:
All triples have the same subject: /@xml:id. That is, the XML
Identifier makes the special part of the place resource's URI, which
takes edplc prefix, i.e.,
https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/#place.
The first statement says, that such a resource is a crm:E53_Place.
The second <statement> element is responsible for linking to
resources from authority files. The element may result in multiple RDF
statements, since it has a repeat attribute: Thus, we get a
statement for every <idno> nested in a <place> resource. The
[normalize-space(.) ne '???' and @xml:base] predicate is only there
for ruling out bad data, where the xml:base attribute is missing
from the <idno> or its contents is the ??? place holder.
The subject of all the resulting statements is the URI of the place
again, the predicate is owl:sameAs and the object resource URI is
made up from the <idno> element being iterated over. In the object,
we again rule out the same bad <idno> elements as in the repeat
attribute of the statement, otherwise the index ($repeatIndex) might
point to a bad <idno>.
The subsequent statements add a name (label/title) to the place
resource, in fact a name in German language and a name in Arabic
language, if there is one. Things only get complicated here, since the
name literal is not directly atttributed to the place resource, but
indirectly through a crm:E44_Place_Appellation, which we treat
through a bnode here.
Now, it's up to you, to further translate the description in the TEI registries to RDF triples. The example also make clear, that the free-text description often contains knowledge, the representation of which in RDF would be desirable.