Book Home Programming PerlSearch this book

15.2. Effects of Character Semantics

The upshot of all this is that a typical built-in operator will operate on characters unless it is in the scope of a use bytes pragma. However, even outside the scope of use bytes, if all of the operands of the operator are stored as 8-bit characters (that is, none of the operands are stored in utf8), then character semantics are indistinguishable from byte semantics, and the result of the operator will be stored in 8-bit form internally. This preserves backward compatibility as long as you don't feed your program any characters wider than Latin-1.

The utf8 pragma is primarily a compatibility device that enables recognition of UTF-8 in literals and identifiers encountered by the parser. It may also be used for enabling some of the more experimental Unicode support features. Our long-term goal is to turn the utf8 pragma into a no-op.

The use bytes pragma will never turn into a no-op. Not only is it necessary for byte-oriented code, but it also has the side effect of defining byte-oriented wrappers around certain functions for use outside the scope of use bytes. As of this writing, the only defined wrapper is for length, but there are likely to be more as time goes by. To use such a wrapper, say:

use bytes ();   # Load wrappers without importing byte semantics.
...
$charlen =        length("\x{ffff_ffff}");   # Returns 1.
$bytelen = bytes::length("\x{ffff_ffff}");   # Returns 7.
Outside the scope of a use bytes declaration, Perl version 5.6 works (or at least, is intended to work) like this:

If you look in directory PATH_TO_PERLLIB/unicode, you'll find a number of files that have to do with defining the semantics above. The Unicode properties database from the Unicode Consortium is in a file called Unicode.300 (for Unicode 3.0). This file has already been processed by mktables.PL into lots of little .pl files in the same directory (and in subdirectories Is/, In/, and To/), some of which are automatically slurped in by Perl to implement things like \p (see the Is/ and In/ directories) and uc (see the To/ directory). Other files are slurped in by modules like the use charnames pragma (see Name.pl). But as of this writing, there are still a number of files that are just sitting there waiting for you to write an access module for them:

ArabLink.pl
ArabLnkGrp.pl
Bidirectional.pl
Block.pl
Category.pl
CombiningClass.pl
Decomposition.pl
JamoShort.pl
Number.pl
To/Digit.pl
A much more readable summary of Unicode, with many hyperlinks, is in PATH_TO_PERLLIB/unicode/Unicode3.html.

Note that when the Unicode consortium comes out with a new version, some of these filenames are likely to change, so you'll have to poke around. You can find PATH_TO_PERLLIB with the following incantation:

% perl -MConfig -le 'print $Config{privlib}'
To find out just about everything there is to find out about Unicode, you should check out The Unicode Standard, Version 3.0 (ISBN 0-201-61633-5).



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.