Skip to content

nverwer/SPEAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPEAT - Simple Pipelines of Event API Transformers

SPEAT logo

Introduction

SPEAT arose from the need to process the text content of (structured) XML documents in a linear (unstructured) way. The standard XML processing APIs, such as DOM and SAX (and LINQ, if you use .net) view an XML document, including its text content, as a tree. The same is true for XSLT, XPath and XQuery. An XML document tree consists of nodes, where the text content is distributed across different nodes at different levels in the tree. With the tree representation, it is possible to work on the complete text content of a document (or element), but this will lose the structure of the document.

In order to be able to process the document text in a linear way, and retain (and modify) the structure and markup, SMAX (Separated Markup API for XML) was designed. As the name indicates, SMAX treats the markup (element structure) and text of an XML document separately. A typical use case for SMAX is text analysis and the addition of additional markup around recognized text fragments. Examples are named entity recognition and parsing.

SMAX is an event API, like SAX. Event APIs implement 'push parsing'. In this approach, a processing component accepts events generated by another component, which may be a parser. By contrast, in 'pull parsing', the application decides when to ask for more data (from another component like a parser). Unlike SAX, which provides methods for fine-grained events such as startElement or characters, SMAX has just one event, process, which operates on the markup (structure) and the content (unstructured, linear text) of a document as separate properties.

XML document processing is often done using pipelines, facilitated by technologies like XProc and Apache Cocoon. SPEAT is a Java library for building pipelines from event APIs like SAX and SMAX. Each step in a SPEAT pipeline is a transformer from one event API type to another. Examples are SaxToSmaxAdapter, which transforms from SAX to SMAX, and NamedEntityRecognizer which transforms from SMAX to SMAX. A transformer is a (one-step) pipeline, and a pipeline can be used as a transformer in another pipeline.

SMAX

The main purpose of SMAX is to separate the markup of an XML document from its text content, hence the name Separated Markup API for XML. SMAX is also an event API, with just one event process(SmaxDocument document). In this way, it combines aspects of event-driven APIs like SAX, but also of the DOM (document object model), which provides access to random parts of the XML document.

The class SmaxDocument is the SMAX representation of an XML document or document-fragment. It contains a SmaxElement, which is the root element of the markup of document (-fragment) and a SmaxContent, which is the text of the document.

A SmaxElement has methods to access its name, namespace, attributes, parent and children. It does not have methods to access its text content, because that is separated from the markup. Instead, it has a start- and end-position in the text content. The start- and end-positions are points between characters. Position n is just before the n th character in the StringBuffer of a SmaxContent.

A SmaxDocument may be a sub-document of a larger SmaxDocument, and use (a subset of) the same SmaxElements. To avoid the use of position offsets in a sub-document, the root element of a SmaxDocument does not necessarily have start-position 0.

The SmaxContent class implements CharSequence, like String and other character sequences in Java. Its main purpose is to provide sub-document views with zero-based indexes, without copying any content. It also provides all methods from StringBuffer to manipulate the underlying character sequence efficiently. The start-position and end-position of every SmaxElement in a SmaxDocument are relative to the underlying SmaxContent. This makes it easier to create sub-documents of a SmaxDocument without changing start and end positions.

The SmaxDocument class has a method insertMarkup(SmaxElement newNode, Balancing balancing, int startPos, int endPos) which inserts a new XML element into the document, with a character span from startPos to endPos. This method modifies the document structure (markup), but not the content. It can be used to mark ranges of text, and is at the basis of text recognition transformers. Surrounding a character span by a new XML element at arbitrary start- and end-positions might lead to unbalanced, or non-well-formed XML markup. Therefore, a Balancing strategy must be specified. This tells the insertMarkup method how to deal with potentially unbalanced markup. See the javadoc for available balancing strategies.

SPEAT

SPEAT is built around three interfaces; EventHandler, EventSupplier, and Pipeline. A Pipeline<S, T> transforms events of type S into events of type T. It extends both EventHandler<S> and EventSupplier<T>, which are the input and output side of the pipeline.

classDiagram
  EventHandler~S~ --|> Pipeline~S‚T~
  EventSupplier~T~ --|> Pipeline~S‚T~
Loading

The interface EventHandler<Api> is only a wrapper for Api, where Api is an event API, such as Sax or Smax. This interface is necessary because Java does not allow Pipeline<S, T> extends S, EventSupplier<T> (cannot refer to type parameter as supertype). There is one method Api getHandler() in EventHandler<Api>, which returns the underlying API.

The interface EventSupplier<Api> supplies Api events. It needs a handler for these events, which is set by setHandler(Api handler).

Pipelines can be chained together using their EventHandler and EventSupplier interfaces. The Pipeline<S, T> interface contains a default method append(Pipeline<T, U> next) that returns a Pipeline<S, U>.

flowchart LR
  StoT[S -> T] --> TtoU[T -> U]
Loading

Typically, a pipeline is built by starting with a Pipeline<InputSource, ApiIn> for some ApiIn, and then appending transformers. The output of a pipeline is usually captured by appending a Pipeline<ApiOut, OutputSource>, which handles the ApiOut events and serializes them to some destination.

Pipeline components run synchronously in a single thread. This means that when an event API method is called, the method returns when its event and any events produced by the event have been processed.

Event APIs

The two main event APIs for XML processing in SPEAT are Sax and Smax. These are interfaces that specify all the event handling methods in these event APIs.

The Sax event API extends all of the org.xml.sax interfaces ContentHandler, DTDHandler, EntityResolver, ErrorHandler, DeclHandler, EntityResolver2, and LexicalHandler. In order to facilitate making classes implementing EventHandler<Sax>, the SaxEventHandler class provides default implementations for the methods in these interfaces.

The Smax event API was described earlier. Currently, no helper classes are needed for Smax.

There are two event APIs for plain text documents. TextDocumentApi has one event process(CharSequence text), which processes a complete text document. To process a text document line by line, use TextLineStreamApi, which defines events for startDocument(), endDocument(), and line(String line). It also provides events to process streams of strings sequentially or in parallel.

Pipeline input and output

The InputSource interface specifies an event API for input sources. It has no events, because it is always the beginning of a pipeline. It specifies methods to get properties like last modified date-time, encoding, and for getting a Reader or InputStream. This interface is similar to org.xml.sax.InputSource, but not specific for XML sources.

The InputSourceBase class provides an abstract implementation of InputSource, and serves as a base class that actual input sources extend. There are several InputSource implementations, such as StringInputSource, FileInputSource, and others.

The abstract class InputSourceReader<Api> implements Pipeline<InputSource, Api>, transforming InputSource events into Api events. Therefore, it is a parser for Api. It has an abstract readInputAndSendEvents() method, that must be implemented by sub-classes to read the inputSource source and send events to the handler. This method is called by the read() method in InputSourceReader<Api>.

Implementations of InputSourceReader<Api> include:

  • SaxReader, which extends InputSourceReader<Sax> and parses XML from the input source.
  • HtmlSaxReader, which extends InputSourceReader<Sax> and parses HTML (using the TagSoup parser).
  • TextDocumentReader, which extends InputSourceReader<TextDocumentApi> and provides the whole document as one event.
  • TextLineStreamReader, which extends InputSourceReader<TextLineStreamApi> and generates one event for each line in the input document.
classDiagram
  InputSource <|-- InputSourceBase
  InputSourceBase <|-- FileInputSource
  InputSourceBase <|-- StringInputSource
  InputSourceBase <|-- xyzInputSource
  InputSourceReader~Api~ o-- InputSource
  InputSourceReader~Api~ o-- Api
  InputSourceReader~Api~ <|-- SaxReader
  SaxReader o-- InputSource
  SaxReader o-- Sax
  Api .. Sax
Loading

At the end of a pipeline, events of some Api can be serialized by a Pipeline<Api, OutputSource>. The OutputSource class has a base implementation OutputSourceBase, which has further implementations such as StringOutputSource, FileOutputSource, and others.

Implementations of Pipeline<Api, OutputSource> include:

  • OutputSourceWriter<Api>, a basic implementation. It has no write() method or something similar, because it receives incoming Api events. Subclasses must implement Api to handle incoming events.
  • SaxWriter, which serializes Sax events to an OutputSource. This supports different serializations and other properties, set by the setOutputProperty(name, value) method. See also javax.xml.transform.OutputKeys.
  • TextDocumentWriter, which serializes TextDocumentApi events.
classDiagram
  OutputSource <|-- OutputSourceBase
  OutputSourceBase <|-- xyzOutputSource
  OutputSourceBase <|-- StringOutputSource
  OutputSourceBase <|-- FileOutputSource
  Pipeline~Api‚OutputSource~ <|-- OutputSourceWriter~Api~
  OutputSourceWriter~Api~ o-- Api
  OutputSourceWriter~Api~ o-- OutputSource
Loading
classDiagram
  OutputSource <|-- OutputSourceBase
  OutputSourceBase <|-- xyzOutputSource
  OutputSourceBase <|-- StringOutputSource
  OutputSourceBase <|-- FileOutputSource
  Pipeline~Sax‚OutputSource~ <|-- SaxWriter
  SaxWriter o-- OutputSource
  SaxWriter o-- Sax
Loading

About

Simple Pipelines of Event API Transformers

Resources

License

MIT, MPL-2.0 licenses found

Licenses found

MIT
LICENSE
MPL-2.0
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages