ATagParser's Unicode Support
ATagParser can read Delphi Widestrings as well as UTF-8 and UTF-16LE files and streams.
While Unicode is important to some authors working in certain geographical areas of the world, the simple fact that a parser disassembles the content of a document, and doesn't actually know anything about rendering the document content, precludes any large scale need for unicode encodings to be embedded in the parser itself.
Since tags are never rendered, that is a job for a higher-level descendent component or rendering engine.
ATagParser automatically determines the type of stream or file about to be parsed and reacts accordingly to prepare it for the parsing process. The following is how ATagParser sees the ANSI, UTF-8 and UTF-16 variants.
·
If ContentType = ctANSI, then ANSI is returned
·
If ContentType = ctUTF8, then UTF-8 is returned
·
If ContentType = ctUTF16LE or ctUTF16LE_NO_BOM, then UTF-8 is returned
Many parsers will promote ANSI text to Unicode text before parsing. Oddly enough, the vast majority of web pages are written in ANSI and UTF-8 -- not in the different variations of UTF-16.
Important!
If you're parsing one of the 16bit Unicode types and rebuilding the file with the UTF-8 output, you'll need to change the "Content-Type" of the HTTP-EQUIV META tag or the browser may not render it correctly.
procedure TForm1.ATagParserTag(Sender: TObject; Tag: TTagElement; var Abort: Boolean);
begin If Tag.ElementType = etComplexTag Then If FindTagID(Tag.Hash) = TID_META Then Begin If Tag.Attributes.IndexOfName('http-equiv') > -1Then If Tag.Attributes.Values['content-type'] > -1Then { output the new META tag with the updated content type } End;
end;