The following are whitespace: #x09 (tab), #x0C, #x20 (space) and #x200B
The following are linebreaks: #0x0A (LF), #x0D (CR) and CR-LF pair
Whitespace or line feed immediately after the opening tag and immediately before the closing tag are ignored (for all elements)
Line breaks are converted to whitespace and multiple whitespace consolidated except in the SCRIPT, STYLE and PRE elements
Entity closing ; (semicolon) may be omitted at end of line or before tag
Character entity references are case sensitive
CDATA is a sequence of characters: entities are replaced, line feeds are ignored, tab and CR replaced with a space, leading and trailing spaces ignored.
The entities (non-breaking space, #xA0) and ­ (soft hyphen, #xAD) are printed on output
ID and NAME tokens begin with [a-zA-Z] (all letters) and followed with any number of [a-zA-Z0-9-_:.] (all letters, all digits, hyphen, underscore, colon, period)
IDREF is a single token, IDREFS is a space separated list of tokens
NUMBER tokens must contain at least one of [0-9] (all letters)
Tag names are converted to upper case and attribute names are converted to lower case
Attribute values consist of [a-zA-Z0-9-.] (all letters, all digits, hypen, period)
Attribute values may contain whitespace if enclosing in single or double quotes
Single quotes can be used in double quoted attribute value and vice versa
All ID attributes in document must be unique
Some elements may have an optional end tag, in which case a repetition of the element's open tag or the close tag of the encapsulating element are implied end tags
The body is implemented by either the BODY or FRAMESET element
The closing tag for the HTML, HEAD and BODY elements is optional, as is the starting tag
An HTML document is identified as such by the "<!DOCTYPE HTML" in its beginning
In comments, white spaces are permitted between the - - (double hypen) and closing > (less-than)
In SCRIPT and STYLE, all markups and entities are treated as raw text, and contents terminate at first </ (less-than, slash)