org.openxml.parser
Class HTMLParser

java.lang.Object
  |
  +--org.openxml.parser.BaseParser
        |
        +--org.openxml.parser.ContentParser
              |
              +--org.openxml.parser.HTMLParser

public final class HTMLParser
extends org.openxml.parser.ContentParser

Implements a parser for HTML documents and nodes. The HTML document is created with DOMFactory, loads the DTD document specified, and assures that HTML, HEAD and BODY elements exist in its structure.

Version:
$Revision: 1.9 $ $Date: 1999/04/18 01:53:32 $
Author:
Assaf Arkin
See Also:
ContentParser, SAXException

Fields inherited from class org.openxml.parser.ContentParser
_currentNode, _docType
 
Fields inherited from class org.openxml.parser.BaseParser
_curChar, _document, _tokenText, CR, EOF, LF, SPACE, TOKEN_CDATA, TOKEN_CLOSE_TAG, TOKEN_COMMENT, TOKEN_DTD, TOKEN_ENTITY_REF, TOKEN_EOF, TOKEN_OPEN_TAG, TOKEN_PE_REF, TOKEN_PI, TOKEN_SECTION, TOKEN_SECTION_END, TOKEN_TEXT
 
Constructor Summary
HTMLParser(java.io.Reader reader, java.lang.String sourceURI)
          Parser constructor.
HTMLParser(java.io.Reader reader, java.lang.String sourceURI, short mode, short stopAtSeverity)
          Parser constructor.
 
Method Summary
 Document parseDocument()
           
protected  void parseDTDSubset()
          Parser the external DTD subset.
protected  boolean parseNextNode(int token)
          Parses the next node based on the supplied token.
 Node parseNode(Node node)
          Parses a document fragment.
 
Methods inherited from class org.openxml.parser.ContentParser
getEntityContents, parseAttrEntity, parseAttributes, parseContentEntity, readTokenContent
 
Methods inherited from class org.openxml.parser.BaseParser
advanceLineNumber, canReadName, close, error, fatalError, getColumnNumber, getErrorHandler, getErrorReport, getLastException, getLineNumber, getLocator, getMode, getPublicId, getReader, getSourcePosition, getSourceURI, getSystemId, isClosed, isMode, isNamePart, isSpace, isTokenAllSpace, parseDocumentDecl, parseGeneralEntity, pushBack, pushBack, readChar, readTokenEntity, readTokenMarkup, readTokenName, readTokenPERef, readTokenQuoted, setEncoding, setErrorHandler, setErrorSink, slicePITokenText, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLParser

public HTMLParser(java.io.Reader reader,
                  java.lang.String sourceURI,
                  short mode,
                  short stopAtSeverity)
Parser constructor. Requires source text in the form of a Reader object and as an identifier. The parsing mode consists of a combination of MODE_.. flags. The constructor specifies the error severity level at which to stop parsing, either Parser.STOP_SEVERITY_FATAL, Parser.STOP_SEVERITY_VALIDITY or Parser.STOP_SEVERITY_WELL_FORMED.
Parameters:
reader - Any Reader from which entity text can be read
sourceURI - URI of entity source
mode - The parsing mode in effect
stopAtSeverity - Severity level at which to stop parsing

HTMLParser

public HTMLParser(java.io.Reader reader,
                  java.lang.String sourceURI)
Parser constructor. Constructor will operate in the default mode of Parser.MODE_HTML_PARSER with Parser.STOP_SEVERITY_FATAL.
Parameters:
reader - Any Reader from which entity text can be read
sourceURI - URI of entity source
Method Detail

parseDocument

public Document parseDocument()
                       throws SAXException

parseNode

public final Node parseNode(Node node)
                     throws SAXException
Parses a document fragment. A document fragment by definition does not contain a header or DTD and is not subject for validation. An empty document fragment (created from an existing document) must be supplied and the non-empty fragment is returned.
Parameters:
fragment - A DocumentFragment that is empty and compatible
Returns:
The same DocumentFragment object
Throws:
SAXException - A parsing error has been encountered, and based on it severity, an exception is thrown to terminate parsing

parseNextNode

protected boolean parseNextNode(int token)
                         throws SAXException,
                                java.io.IOException
Parses the next node based on the supplied token. This method is called with a read token, parses a node and appends it to ContentParser._currentNode. If plain text is read, it is accumulated and later on converted into a Text. If the node is an element, the element is created and it's full contents read (recursively).

The return value indicates if the current element (in ContentParser._currentNode) has been closed with a closing tag (false), or should parsing continue at the same level (true). False is also returned if the end of file has been reached.

The following rules govern how tokens are translated into nodes:

The proper way to use this method is:
 _currentNode = ...;
 token = readTokenContent();
 while ( parseNextNode( token ) )
     token = readTokenContent();
 
Parameters:
token - The last token read with ContentParser.readTokenContent()
Returns:
True if continue parsing, false if current element has been closed or reached end of file
Throws:
SAXException - A parsing error has been encountered, and based on it severity, an exception is thrown to terminate parsing
java.io.IOException - An I/O exception has been encountered when reading from the input stream
See Also:
ContentParser.parseAttributes(org.w3c.dom.Element, boolean), ContentParser.readTokenContent(), ContentParser._currentNode, #_orphanClosingTag

parseDTDSubset

protected final void parseDTDSubset()
                             throws SAXException,
                                    java.io.IOException
Parser the external DTD subset. A new DTDDocument is created, the external subset is optinally cached in memory, and public identifiers are possibly converted to URIs, as per the installed HolderFinder.

This method is called after '<!DOCTYPE' has been consumed and returns after the terminating '>' has been read.