Skip to content

Named Entities

Christian Lück edited this page Sep 24, 2025 · 19 revisions

Named Entities

Linking named entities is a common aspect of digital scholarly editions. It's very common, to encode persons, places, organizations or events mentioned or circumscribed in a text by means of <persName>, <placeName> etc. or by <rs> and to link them either directly to authority files like LOC NAF or GND or indirectly via a local register file (personography, place-ography, etc.).

With XTriples, adding this kind of entity linking to the extracted knowledge graph is straight forward.

Linked Entities

Imagine a TEI document, where places and periphrasis of places are linked to places from the project's registry of places.

  bla bla <rs type="place" ref="plc:Darayya">...</rs> and <placeName ref="plc:gillaq-Damascus">...</placeName> ...

Here's what we want from this in our knowledge graph:

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix edplc: <https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/place#> .

<document> crm:P67_refers_to
    edplc:gillaq-Damascus .

We get it with the configuration below, which is run on every TEI encoded text in the project.

<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<!-- XTriples configuration for extracting RDF triples from the poems in Diwan -->
<xtriples>
    <configuration>
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="common.xml" xpointer="vocabs"/>
        <triples>
            <!-- other triples about the document
               ...
            -->

            <!-- all placeName elements -->
            <statement>
                <subject>/base-uri(.)</subject>
                <predicate prefix="crm">P67_refers_to</predicate>
                <object prefix="edent" type="uri">/TEI/text//(placeName | rs[type = 'place'])/@ref ! tokenize(.) !
                    replace(., '^[^:]+:', 'place#')</object>
            </statement>
            <!-- similar for persons and events  -->

        </triples>
    </configuration>
    <collection uri="../..?select=*.tei.xml">
    </collection>
</xtriples>

Note, that the vocabularies (prefixes) are included from common.xml.

The <statement> part of the configuration is straight forward. It's a nice example for generating multiple statements with the same subject and predicate, but with varying objects. These objects are made up from the IDREFs in ref attributes of <placeName> elements.

Probably, you want to replace the fabrication of the subject URI <document>, which currently contains the absolute path on the local machine. The subject URI involves conceptional questions: Is the reference to the place a feature of the work, the expression, or the manifestation? It's up to you to decide this according to your project's needs and according to open ontologies. (Cf. FRBRized Metadata.) Either is possible: The domain of crm:P67_refers_to is a crm:E89_Propositional_Object, which again is a superclass of LRMoo F1 (Work), F2 (Expression) or F3 (Expression) (via crm:E73_Information_Object).

Note, that this configuration may result in duplicate triples, because the same entity may be linked several times in the same document. Once read with a RDF tool like Jena RIOT, the duplicates will disappear. So this extraction is not suitable for keeping track of the frequency or density of a reference. If you want to track frequency, you could restructure the subject: Instead of the whole document (work, expression, or manifestation), let a passage of the document be the subject. The LINCS project suggests the formalization of a text passage as annotation source, which is also a crm:E73_Information_Object and which is suggested to be aligned with a oa:SpecificResource from WADM. For extracting such passages with XTriples, one would first unnest all passages (<placeName> and <rs type="place"> elements by means of xtriples/collection/resource/@uri and thus a separate configuration file. A similar technique is explained for the central registries in the next section.

Central Registries

Your registry file with places may look like this:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:scdh="http://scdh.wwu.de/oxygen#ALEA" xml:lang="en">
    <teiHeader>
        <!-- ... -->
    </teiHeader>
    <text>
        <body xml:lang="en">
            <div>
                <head>Places</head>
                <listPlace>
                    <!-- ... -->
                    <place xml:id="Darayya">
                        <placeName>Dārayyā</placeName>
                        <desc>a village near Damascus</desc>
                        <idno type="locnaf" xml:base="http://id.loc.gov/authorities/names/">nr93005092</idno>
                        <listBibl>
                            <bibl corresp="bibl:YaqutBuldan"><biblScope>2:431-432</biblScope></bibl>
                        </listBibl>
                    </place>
                    <!-- ... -->
                    <place xml:id="gillaq-Damascus">
                        <placeName>Ǧillaq</placeName>
                        <placeName xml:lang="ar">جِلَّق</placeName>
                        <desc>Ǧillaq or Ǧilliq is a suburb of Damascus. In poetry it serves also as
                            a toponym of Damascus.</desc>
                        <listBibl>
                            <bibl corresp="bibl:YaqutBuldan"><biblScope>2:154-155</biblScope></bibl>
                        </listBibl>
                    </place>
                    <!-- ... -->
                </listPlace>
            </div>
        </body>
    </text>
</TEI>

Let's first have a look at the desired output:

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix edplc: <https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/place#> .

edplc:Darayya
    a crm:E53_Place ;
    owl:sameAs <http://id.loc.gov/authorities/names/nr93005092> ;
    crm:P87_id_identified_by [
        rdf:value "Dārayyā"@en ;
        a crm:E44_Place_Appellation
        ] .

edplc:gillaq-Damascus
    a crm:E53_Place ;
    crm:P87_is_identified_by [
        rdf:value "Ǧillaq"@en ;
        a crm:E44_Place_Appellation
        ] ,
        [
        rdf:value "جِلَّق"@ar ;
        a crm:E44_Place_Appellation
        ] .

You get that from the following XTriples configuration, where vocabularies are again included from elsewhere.

<?xml-model uri="https://xtriples.lod.academy/xtriples.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<!-- XTriples configuration for extracting RDF from TEI place registry -->
<xtriples>
    <configuration>
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="common.xml" xpointer="vocabs"/>
        <triples>
            <statement>
                <subject prefix="edplc">/@xml:id</subject>
                <predicate prefix="rdf">type</predicate>
                <object prefix="crm" type="uri">E53_Place</object>
            </statement>
            <!-- linking authority files -->
            <!-- <idno xml:base="URI">IDENTIFIER</idno> -->
            <statement repeat="/count(idno[normalize-space(.) ne '???' and @xml:base])">
                <subject prefix="edplc">/@xml:id</subject>
                <predicate prefix="owl">sameAs</predicate>
                <object type="uri">/(let $idno := idno[normalize-space(.) ne '???'][$repeatIndex]
                    return concat($idno/@xml:base, $idno/text()))</object>
            </statement>
            <!-- english name -->
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
                <subject prefix="edplc">/@xml:id</subject>
                <predicate prefix="crm">P87_is_identified_by</predicate>
                <object type="bnode">/concat('placename-en-', @xml:id)</object>
            </statement>
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
                <subject type="bnode">/concat('placename-en-', @xml:id)</subject>
                <predicate prefix="rdf">value</predicate>
                <object type="literal" lang="en">/(persName[ancestor-or-self::*/@xml:lang =
                    'de'])[1] => normalize-space()</object>
            </statement>
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'en'])</condition>
                <subject type="bnode">/concat('placename-en-', @xml:id)</subject>
                <predicate prefix="rdf">type</predicate>
                <object type="uri" prefix="crm">E44_Place_Appellation</object>
            </statement>
            <!-- arabic name -->
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
                <subject prefix="edplc">/@xml:id</subject>
                <predicate prefix="crm">P87_is_identified_by</predicate>
                <object type="bnode">/concat('placename-ar-', @xml:id)</object>
            </statement>
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
                <subject type="bnode">/concat('placename-ar-', @xml:id)</subject>
                <predicate prefix="rdf">value</predicate>
                <object type="literal" lang="ar">/(persName[ancestor-or-self::*/@xml:lang =
                    'de'])[1] => normalize-space()</object>
            </statement>
            <statement>
                <condition>/exists(persName/ancestor-or-self::*[@xml:lang = 'ar'])</condition>
                <subject type="bnode">/concat('placename-ar-', @xml:id)</subject>
                <predicate prefix="rdf">type</predicate>
                <object type="uri" prefix="crm">E44_Place_Appellation</object>
            </statement>
        </triples>
    </configuration>
    <collection uri="../..?select=[Pp]lace*.xml">
        <resource uri="{/TEI/text//listPlace/place[@xml:id and not(ancestor::place)]}"/>
    </collection>
</xtriples>

Note, that we include the vocabularies from common.xml.

Let's discuss the configuration in detail.

First let's look at the collection. The configuration is located in resources/graph/places.xml while the projects registry of places is in places.xml. So we say <collection uri="../..?select=[Pp]lace*.xml"> in order to extract from places.xml two parent directories above our XTriples configuration file. The regular expression [Pp] is just there for catching upper case file names as well as lower case file names; the star * is there for catching places.xml als well as place.xml.

Obviously, a registry of places contains multiple places and we want to extract information for all of them. So we do not take the registry document as one single resource, but extract the same set of triples for every place in the registry. Thus, we say <resource uri="{/TEI/text//listPlace/place}"/>. To be more precise, we only want those places, that have an identifier; we thus say: <resource uri="{/TEI/text//listPlace/place[@xml:id]}"/>. Furthermore, we exclude nested places, which may be rare, but may be present; thus we add an other constraint to the XPath predicate: ```.

Now, let's look at the extracted triples:

All triples have the same subject: /@xml:id. That is, the XML Identifier makes the special part of the place resource's URI, which takes edplc prefix, i.e., https://scdh.zivgitlabpages.uni-muenster.de/hees-alea/edition-ibn-nubatah/#place.

The first statement says, that such a resource is a crm:E53_Place.

The second <statement> element is responsible for linking to resources from authority files. The element may result in multiple RDF statements, since it has a repeat attribute: Thus, we get a statement for every <idno> nested in a <place> resource. The [normalize-space(.) ne '???' and @xml:base] predicate is only there for ruling out bad data, where the xml:base attribute is missing from the <idno> or its contents is the ??? place holder.

The subject of all the resulting statements is the URI of the place again, the predicate is owl:sameAs and the object resource URI is made up from the <idno> element being iterated over. In the object, we again rule out the same bad <idno> elements as in the repeat attribute of the statement, otherwise the index ($repeatIndex) might point to a bad <idno>.

The subsequent statements add a name (label/title) to the place resource, in fact a name in German language and a name in Arabic language, if there is one. Things only get complicated here, since the name literal is not directly atttributed to the place resource, but indirectly through a crm:E44_Place_Appellation, which we treat through a bnode here.

Now, it's up to you, to further translate the description in the TEI registries to RDF triples. The example also make clear, that the free-text description often contains knowledge, the representation of which in RDF would be desirable.

Clone this wiki locally