ATagParser's Unicode Support

ATagParser can read Delphi Widestrings as well as UTF-8 and UTF-16LE files and streams.

While Unicode is important to some authors working in certain geographical areas of the world, the simple fact that a parser disassembles the content of a document, and doesn't actually know anything about rendering the document content, precludes any large scale need for unicode encodings to be embedded in the parser itself.

Since tags are never rendered, that is a job for a higher-level descendent component or rendering engine.

ATagParser automatically determines the type of stream or file about to be parsed and reacts accordingly to prepare it for the parsing process. The following is how ATagParser sees the ANSI, UTF-8 and UTF-16 variants.

·If ContentType = ctANSI, then ANSI is returned  
·If ContentType = ctUTF8, then UTF-8 is returned  
·If ContentType = ctUTF16LE or ctUTF16LE_NO_BOM, then UTF-8 is returned  

Many parsers will promote ANSI text to Unicode text before parsing. Oddly enough, the vast majority of web pages are written in ANSI and UTF-8 -- not in the different variations of UTF-16.

Important!
If you're parsing one of the 16bit Unicode types and rebuilding the file with the UTF-8 output, you'll need to change the "Content-Type" of the HTTP-EQUIV META tag or the browser may not render it correctly.

<meta http-equiv="Content-Type" content="text/html;charset=unicode"> 

..becomes..

<meta http-equiv="Content-Type" content="text/html;charset=utf-8
">

Example:

procedure TForm1.ATagParserTag(Sender: TObject; Tag: TTagElement;  var Abort: Boolean);
begin
  If Tag.ElementType = etComplexTag Then
    If FindTagID(Tag.Hash) = TID_META Then
    Begin
      If Tag.Attributes.IndexOfName('http-equiv'
) > -1 Then
        If Tag.Attributes.Values['content-type'
] > -1 Then
          { output the new META tag with the updated content type }

    End;
end;


Information on Unicode and web pages
http://www.unicode.org/faq/unicode_web.html

http://www.utf-8.com/


UTF-16LE stands for 16bit Unicode, Little Endian
BOM - Byte Order Mark




Copyright 2000 - 2006 John E McTaggart - All rights reserved worldwide