Name

SGML::DTD - SGML DTD parser


Synopsis

    use SGML::DTD;

    $dtd = new SGML::DTD;
    $dtd->read_dtd(\*FILEHANDLE);

    $dtd = new SGML::DTD \*FILEHANDLE;

    SGML::DTD->set_ent_manager($entity_manager);
    $dtd = new SGML::DTD;
    $dtd->read_dtd(\*FILEHANDLE);

    $dtd = new SGML::DTD \*FILEHANDLE, $entity_manager;

Description

SGML::DTD is an SGML DTD parser. Either during object construction or by the read_dtd method, you pass a filehandle to SGML::DTD that contains the DTD you want parsed. To avoid package scoping problems, a reference to a filehandle should be passed. If passing a filehandle to object construction, undef will be returned if a parsing error occurs. If using the read_dtd method, 1 is returned when no errors occurred; 0 returned on an error.

When parsing the DTD, SGML::DTD builds up data structures that represent the information contained in the DTD. Various methods are provided to access DTD information. See Object Methods for the methods available.

For SGML::DTD to resolve external entity references, SGML::DTD uses an SGML::EntMan object. If no entity manager is passed to SGML::DTD, SGML::DTD uses the default construction rule of SGML::EntMan to create an entity manager to resolve external entity references. Normally, this will not be sufficient. Therefore, SGML::EntMan object should be created first with loaded DTD specific catalogs. Then instantiate an SGML::DTD object and pass the SGML::EntMan object to it. The SGML::EntMan object can be specified during SGML::DTD construction, or by the set_ent_manager class method.

The following describes the current limitations of SGML::DTD:


Class Methods

Class methods are methods that apply at the class level. Therefore, they may affect all instances of the SGML::DTD class. Class methods can be invoked like the following:

    SGML::DTD->set_ent_manager($entman);

or,

    set_ent_manager SGML::DTD $entman;

The following class methods are defined:


new

new SGML::DTD
new SGML::DTD \*FILEHANDLE
new SGML::DTD \*FILEHANDLE, $entman

new creates a new SGML::DTD object. An optional filehandle argument can be specified to cause new to automatically parse the DTD represented by the filehandle. If a filehandle is specified, and optional SGML::EntMan object may be specified for resolving any external entity references.


is_attr_keyword

is_attr_keyword SGML::DTD $word

is_attr_keyword returns 1 if $word is an attribute content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

Character case is ignored.


is_elem_keyword

is_elem_keyword SGML::DTD $word

is_elem_keyword returns 1 if $word is an element content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

Character case is ignored.


is_group_connector

is_group_connector SGML::DTD $char

DTDis_group_connector returns 1 if $char is an group connector, otherwise, it returns 0. The following values of $char will return 1:


is_occur_indicator

is_occur_indicator SGML::DTD $char

DTDis_occur_indicator returns 1 if $char is an occurence indicator, otherwise, it returns 0. The following values of $char will return 1:


is_tag_name

is_tag_name SGML::DTD $string

is_tag_name returns 1 if $string is a legal tag name, otherwise, it returns 0. Legal characters in a tag name are defined by the SGML::Syntax::$namechars variable. By default, a tag name may only contain the characters "A-Za-z_.-".


set_comment_callback

set_comment_callback SGML::DTD $coderef

Set a function to be called during parsing when a comment declaration is encountered. The comment callback function is invoked as follows:

    &$coderef(\$comment_txt);

set_debug_callback

set_debug_callback SGML::DTD $coderef

Set a function to be called when a debugging message is generated. The debug callback function is invoked as follows:

    &$coderef(@string_list);

Debugging messages are only generated if verbosity is set to true.


set_debug_handle

set_debug_handle SGML::DTD \*FILEHANDLE

Set the filehandle to send debugging messages. Messages are not sent to the filehandle if a debug callback function is registered. The default filehandle is STDERR.


set_ent_manager

set_ent_manager SGML::DTD $entman

Set the entity manager. The entity manager will be used to resolve any external identifiers during parsing. The entity manager should be of type SGML::EntMan.


set_err_callback

set_err_callback SGML::DTD $coderef

Set a function to be called during parsing when an error occurs The error callback function is invoked as follows:

    &$coderef(@string_list);

set_err_handle

set_err_handle SGML::DTD \*FILEHANDLE

Set the filehandle to send error messages. Messages are not sent to the filehandle if an error callback function is registered. The default filehandle is STDERR.


set_pi_callback

set_pi_callback SGML::DTD $coderef

Set a function to be called during parsing when a processing instruction is encountered. The pi callback function is invoked as follows:

    &$coderef(\$pi_txt);

set_tree_callback

set_tree_callback SGML::DTD $coderef

Set callback for printing a tree entry when the print_tree object method is invoked. The tree entry callback function is invoked as follows:

    &$coderef($iselem_flag, $string);

This method allows you to modify the text output of the print_tree method. However, it does require some understanding of the string passed into callback to do anything interesting with it. The method mainly exists for the use for a specific application, so its use is discouraged.


set_verbosity

set_verbosity SGML::DTD $boolean

The tells if SGML::DTD should output debugging messages as it parses a DTD.


Object Methods


read_dtd

$dtd->read_dtd(\*FILEHANDLE)

Parse a DTD from FILEHANDLE.

The following methods are applicable after a DTD has been parsed:


get_base_children

@list = $dtd->get_base_children($element, $andcon)

get_base_children returns an array of the elements in the base model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    $dtd->get_base_children(`foo')

will return

    ('x', 'y', 'z')

The call

    $dtd->get_base_children('foo', 1)

will return

    ('(','x', '|', 'y', '|', 'z', ')')

get_elem_attr

%attributes = $dtd->get_elem_attr($element)

Retrieve the attributes defined for $element. The return value is a hash where the keys are the attribute names, and the values is the definitions of the attributes. The definitions are stored as a list. The first list value the default value for the attribute (which may be an SGML reserved word). If the default value equals "#FIXED", then the next array value is the #FIXED value. The other array values are all possible values for the attribute.


get_elements

@elements = $dtd->get_elements($nosort)

Retrieve all elements defined in the DTD. If $nosort is true, the elements are returned in the order they were defined in the DTD. Otherwise, they are in sorted order.


get_elements_of_attr

@elements = $dtd->get_elements_of_attr($attr_name)

Retrieve all elements that have an attribute $attr_name defined in the DTD.


get_exc_children

@list = $dtd->get_exc_children($element, $andcon)

get_exc_children returns an array of the elements in the exclusion model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    $dtd->get_exc_children('foo')

will return

    ('m', 'n')

get_gen_ents

@entity_names = $dtd->get_gen_ents($nosort)

get_gen_ents returns an array of general entities. An optional flag argument can be passed to the routine to determine is elements returned are sorted or not: 0 => sorted, 1 => not sorted.


get_gen_data_ents

@entity_names = $dtd->get_gen_data_ents()

get_gen_data_ents returns an array of general data entities defined in the DTD. Data entities cover the following:


get_inc_children

@list = $dtd->get_inc_children($element, $andcon)

get_inc_children returns an array of the elements in the inclusion model group of $element. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    $dtd->get_inc_children('foo')

will return

    ('a', 'b')

get_parents

$dtd->get_parents($element)

Get all elements that may be a parent of $element.


get_top_elements

@elements = $dtd->get_top_elements()

Get the top-most elements defined in the DTD. Top-most elements are those elements that cannot be contained within another element or can only be contained within itself.


is_child

$dtd->is_child($element, $child)

is_child returns 1 if $child can be a legal child of $element. Otherwise, 0 is returned.


is_element

$dtd->is_element($element)

is_element returns 1 if $element is defined in the DTD. Otherwise, 0 is returned.


print_tree

$dtd->print_tree($element, $depth, \*FILEHANDLE)

print_tree outputs an ASCII tree structure of $element's content hierarchy to a depth of $depth to FILEHANDLE. See Element Trees for information on output created by print_tree.


reset

$dtd->reset()

Clear object data structures. Use this method if you want to use the same object to parse another DTD.


Element Trees

Once a DTD is parsed, the print_tree method can be used to output ASCII formatted trees of content hierarchies of elements. The print_tree method is invoked as follows:

$dtd->print_tree($element, $depth, \*FILEHANDLE)

$element is the element to print the tree for. $depth specifies the maximum depth of the tree. The root of the tree has a depth of 1. FILEHANDLE specifies where the output goes to.

The tree shows the overall content hierarchy for an element. Content hierarchies of descendents will also be shown. Elements that exist at a higher (or equal) level, or if the maximum depth has been reached, are pruned. The string "..." is appended to an element if it has been pruned due to pre-existance at a higher (or equal) level. The content of the pruned element can be determined by searching for the complete tree of the element (ie. elements w/o "..."). Elements pruned because maximum depth has been reached will not have "..." appended.

Example:

     |__section+)
         |_(effect?, ...
         |__title, ...
         |__toc?, ...
         |__epc-fig*,
         |   |_(effect?, ...
         |   |__figure,
         |   |   |_(effect?, ...
         |   |   |__title, ...
         |   |   |__graphic+, ...
         |   |   |__assoc-text?)
Note

Pruning must be done to avoid a combinatorical explosion. It is common for DTD's to define content hierarchies of infinite depth. Even with a predefined maximum depth, the generated tree can become very large.

Since the tree outputed is static, the inclusion and exclusion sets of elements are treated specially. Inclusion and exclusion elements inherited from ancestors are not propagated down to determine what elements are printed, but special markup is presented at a given element if there exists inclusion and exclusion elements from ancestors. The reason inclusions and exclusions are not propagated down is because of the pruning done. Since an element may occur in multiple contexts -- and have different ancestoral inclusions and exclusions in effect -- an element without "..." may be the only place of reference to see the content hierarchy of the element.

Example:

    D1
     |  {+} idx needbegin needend newline
     | 
     |_(head,
     |   | {A+} idx needbegin needend newline
     |   |  {-} needbegin needend
     |   | 
     |   |_(((#PCDATA |
     |   |____((acro |
     |   |       | {A+} idx needbegin needend newline
     |   |       | {A-} needbegin needend
     |   |       | 
     |   |       |_(((#PCDATA |
     |   |       |____((super | ...
     |   |       |______sub)))*)) ...

Ignoring the lines starting with {}'s, one gets the content hierachy of an element as defined by the DTD without concern of where it may occur in the overall structure. The {} lines give additional information regarding the element with respect to its existance within a specific context. For example, when an ACRO element occurs within D1,HEAD -- along with its normal content -- it can contain IDX and NEWLINE elements due to inclusions from ancestors. However, it cannot contain NEEDBEGIN and NEEDEND regardless of its defined content since an ancestor(s) excludes them.

Note
Exclusions override inclusions. If an element occurs in an inclusion set and an exclusion set, the exclusion takes precedence. Therefore, in the above example, NEEDBEGIN, NEEDEND are excluded from ACRO.

Explanation of {}'s keys:

{+}
The list of inclusion elements defined by the current element. Since this is part of the content model of the element, the inclusion subelements are printed as part of the content hierarchy of the current element after the base content model. Subelements that are inclusions will have {+} appended to the subelement entry.
{A+}
The list of inclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral inclusion elements are printed as part of the content hierarchy of the element.
{-}
The list of exclusion elements defined by the current element. Since this is part of the content model of the element, any subelement in the content model that would be excluded will have {-} appended to the subelement listing.
{A-}
The list of exclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral exclusion elements have any effect on the printing of the content hierarchy of the current element.

See Also

SGML::EntMan

perl(1)


Availability

This software is part of the perlSGML package; see (http://www.oac.uci.edu/indiv/ehood/perlSGML.html)


Author

Earl Hood
ehood@medusa.acs.uci.edu
Copyright © 1997

97/09/18 14:32:42