SPEAT arose from the need to process the text content of (structured) XML documents in a linear (unstructured) way. The standard XML processing APIs, such as DOM and SAX (and LINQ, if you use .net) view an XML document, including its text content, as a tree. The same is true for XSLT, XPath and XQuery. An XML document tree consists of nodes, where the text content is distributed across different nodes at different levels in the tree. With the tree representation, it is possible to work on the complete text content of a document (or element), but this will lose the structure of the document.
In order to be able to process the document text in a linear way, and retain (and modify) the structure and markup, SMAX (Separated Markup API for XML) was designed. As the name indicates, SMAX treats the markup (element structure) and text of an XML document separately. A typical use case for SMAX is text analysis and the addition of additional markup around recognized text fragments. Examples are named entity recognition and parsing.
SMAX is an event API, like SAX.
Event APIs implement 'push parsing'.
In this approach, a processing component accepts events generated by another component, which may be a parser.
By contrast, in 'pull parsing', the application decides when to ask for more data (from another component like a parser).
Unlike SAX, which provides methods for fine-grained events such as startElement or characters,
SMAX has just one event, process, which operates on the markup (structure) and the content
(unstructured, linear text) of a document as separate properties.
XML document processing is often done using pipelines, facilitated by technologies like
XProc and Apache Cocoon.
SPEAT is a Java library for building pipelines from event APIs like SAX and SMAX.
Each step in a SPEAT pipeline is a transformer from one event API type to another.
Examples are SaxToSmaxAdapter, which transforms from SAX to SMAX, and
NamedEntityRecognizer which transforms from SMAX to SMAX.
A transformer is a (one-step) pipeline, and a pipeline can be used as a transformer in another pipeline.
The main purpose of SMAX is to separate the markup of an XML document from its text content,
hence the name Separated Markup API for XML.
SMAX is also an event API, with just one event process(SmaxDocument document).
In this way, it combines aspects of event-driven APIs like SAX, but also of the DOM (document object model),
which provides access to random parts of the XML document.
The class SmaxDocument is the SMAX representation of an XML document or document-fragment.
It contains a SmaxElement, which is the root element of the markup of document (-fragment) and
a SmaxContent, which is the text of the document.
A SmaxElement has methods to access its name, namespace, attributes, parent and children.
It does not have methods to access its text content, because that is separated from the markup.
Instead, it has a start- and end-position in the text content.
The start- and end-positions are points between characters.
Position n is just before the n th character in the StringBuffer of a SmaxContent.
A SmaxDocument may be a sub-document of a larger SmaxDocument, and use (a subset of) the same SmaxElements.
To avoid the use of position offsets in a sub-document,
the root element of a SmaxDocument does not necessarily have start-position 0.
The SmaxContent class implements CharSequence, like String and other character sequences in Java.
Its main purpose is to provide sub-document views with zero-based indexes, without copying any content.
It also provides all methods from StringBuffer to manipulate the underlying character sequence efficiently.
The start-position and end-position of every SmaxElement in a SmaxDocument
are relative to the underlying SmaxContent.
This makes it easier to create sub-documents of a SmaxDocument without changing start and end positions.
The SmaxDocument class has a method
insertMarkup(SmaxElement newNode, Balancing balancing, int startPos, int endPos)
which inserts a new XML element into the document, with a character span from startPos to endPos.
This method modifies the document structure (markup), but not the content.
It can be used to mark ranges of text, and is at the basis of text recognition transformers.
Surrounding a character span by a new XML element at arbitrary start- and end-positions might lead to
unbalanced, or non-well-formed XML markup.
Therefore, a Balancing strategy must be specified.
This tells the insertMarkup method how to deal with potentially unbalanced markup.
See the javadoc for available balancing strategies.
SPEAT is built around three interfaces; EventHandler, EventSupplier, and Pipeline.
A Pipeline<S, T> transforms events of type S into events of type T.
It extends both EventHandler<S> and EventSupplier<T>, which are the input and output side of the pipeline.
classDiagram
EventHandler~S~ --|> Pipeline~S‚T~
EventSupplier~T~ --|> Pipeline~S‚T~
The interface EventHandler<Api> is only a wrapper for Api, where Api is an event API, such as Sax or Smax.
This interface is necessary because Java does not allow Pipeline<S, T> extends S, EventSupplier<T>
(cannot refer to type parameter as supertype).
There is one method Api getHandler() in EventHandler<Api>, which returns the underlying API.
The interface EventSupplier<Api> supplies Api events.
It needs a handler for these events, which is set by setHandler(Api handler).
Pipelines can be chained together using their EventHandler and EventSupplier interfaces.
The Pipeline<S, T> interface contains a default method append(Pipeline<T, U> next)
that returns a Pipeline<S, U>.
flowchart LR
StoT[S -> T] --> TtoU[T -> U]
Typically, a pipeline is built by starting with a Pipeline<InputSource, ApiIn>
for some ApiIn, and then appending transformers.
The output of a pipeline is usually captured by appending a Pipeline<ApiOut, OutputSource>,
which handles the ApiOut events and serializes them to some destination.
Pipeline components run synchronously in a single thread. This means that when an event API method is called, the method returns when its event and any events produced by the event have been processed.
The two main event APIs for XML processing in SPEAT are Sax and Smax.
These are interfaces that specify all the event handling methods in these event APIs.
The Sax event API extends all of the org.xml.sax interfaces
ContentHandler, DTDHandler, EntityResolver, ErrorHandler, DeclHandler,
EntityResolver2, and LexicalHandler.
In order to facilitate making classes implementing EventHandler<Sax>, the SaxEventHandler class
provides default implementations for the methods in these interfaces.
The Smax event API was described earlier. Currently, no helper classes are needed for Smax.
There are two event APIs for plain text documents.
TextDocumentApi has one event process(CharSequence text), which processes a complete text document.
To process a text document line by line, use TextLineStreamApi, which defines events for
startDocument(), endDocument(), and line(String line).
It also provides events to process streams of strings sequentially or in parallel.
The InputSource interface specifies an event API for input sources.
It has no events, because it is always the beginning of a pipeline.
It specifies methods to get properties like last modified date-time, encoding, and for getting a Reader or InputStream.
This interface is similar to org.xml.sax.InputSource, but not specific for XML sources.
The InputSourceBase class provides an abstract implementation of InputSource,
and serves as a base class that actual input sources extend.
There are several InputSource implementations, such as StringInputSource, FileInputSource, and others.
The abstract class InputSourceReader<Api> implements Pipeline<InputSource, Api>,
transforming InputSource events into Api events.
Therefore, it is a parser for Api.
It has an abstract readInputAndSendEvents() method, that must be implemented by sub-classes to
read the inputSource source and send events to the handler.
This method is called by the read() method in InputSourceReader<Api>.
Implementations of InputSourceReader<Api> include:
SaxReader, which extendsInputSourceReader<Sax>and parses XML from the input source.HtmlSaxReader, which extendsInputSourceReader<Sax>and parses HTML (using the TagSoup parser).TextDocumentReader, which extendsInputSourceReader<TextDocumentApi>and provides the whole document as one event.TextLineStreamReader, which extendsInputSourceReader<TextLineStreamApi>and generates one event for each line in the input document.
classDiagram
InputSource <|-- InputSourceBase
InputSourceBase <|-- FileInputSource
InputSourceBase <|-- StringInputSource
InputSourceBase <|-- xyzInputSource
InputSourceReader~Api~ o-- InputSource
InputSourceReader~Api~ o-- Api
InputSourceReader~Api~ <|-- SaxReader
SaxReader o-- InputSource
SaxReader o-- Sax
Api .. Sax
At the end of a pipeline, events of some Api can be serialized by a Pipeline<Api, OutputSource>.
The OutputSource class has a base implementation OutputSourceBase, which has further implementations such as
StringOutputSource, FileOutputSource, and others.
Implementations of Pipeline<Api, OutputSource> include:
OutputSourceWriter<Api>, a basic implementation. It has nowrite()method or something similar, because it receives incomingApievents. Subclasses must implementApito handle incoming events.SaxWriter, which serializesSaxevents to anOutputSource. This supports different serializations and other properties, set by thesetOutputProperty(name, value)method. See also javax.xml.transform.OutputKeys.TextDocumentWriter, which serializesTextDocumentApievents.
classDiagram
OutputSource <|-- OutputSourceBase
OutputSourceBase <|-- xyzOutputSource
OutputSourceBase <|-- StringOutputSource
OutputSourceBase <|-- FileOutputSource
Pipeline~Api‚OutputSource~ <|-- OutputSourceWriter~Api~
OutputSourceWriter~Api~ o-- Api
OutputSourceWriter~Api~ o-- OutputSource
classDiagram
OutputSource <|-- OutputSourceBase
OutputSourceBase <|-- xyzOutputSource
OutputSourceBase <|-- StringOutputSource
OutputSourceBase <|-- FileOutputSource
Pipeline~Sax‚OutputSource~ <|-- SaxWriter
SaxWriter o-- OutputSource
SaxWriter o-- Sax
