1. Strings

Contents:
Introduction
Accessing Substrings
Establishing a Default Value
Exchanging Values Without Using Temporary Variables
Converting Between ASCII Characters and Values
Processing a String One Character at a Time
Reversing a String by Word or Character
Expanding and Compressing Tabs
Expanding Variables in User Input
Controlling Case
Interpolating Functions and Expressions Within Strings
Indenting Here Documents
Reformatting Paragraphs
Escaping Characters
Trimming Blanks from the Ends of a String
Parsing Comma-Separated Data
Soundex Matching
Program: fixstyle
Program: psgrep

He multiplieth words without knowledge.

- Job 35:16

1.0. Introduction

Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair, though - Perl isn't a low-level language; lines and strings are easy to handle.

Perl was designed for text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6, Pattern Matching, and Chapter 8, File Contents, which discuss interesting techniques not covered here.

Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to other values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length (within the limits of your machine's virtual memory) and contain any data you care to put there - even binary data containing null bytes.

A string is not an array of bytes: You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow and shrink on demand. They get reclaimed by Perl's garbage collection system when they're no longer used, typically when the variables holding them go out of scope or when the expression they were used in has been evaluated. In other words, memory management is already taken care of for you, so you don't have to worry about it.

A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is undef. All other values are defined, even 0 and the empty string. Definedness is not the same as Boolean truth, though; to check whether a value is defined, use the defined function. Boolean truth has a specialized meaning, tested with operators like && and || or in an if or while block's test condition.

Two defined strings are false: the empty string ("") and a string of length one containing the digit zero ("0"). This second one may surprise you, but Perl does this because of its on-demand conversion between strings and numbers. The numbers 0., 0.00, and 0.0000000 are all false when unquoted but are not false in strings (the string "0.00" is true, not false). All other defined values (e.g., "false", 15, and \$x ) are true.

The undef value behaves like the empty string ("") when used as a string, 0 when used as a number, and the null reference when used as a reference. But in all these cases, it's false. Using an undefined value where Perl expects a defined value will trigger a run-time warning message on STDERR if you've used the -w flag. Merely asking whether something is true or false does not demand a particular value, so this is exempt from a warning. Some operations do not trigger warnings when used on variables holding undefined values. These include the autoincrement and autodecrement operators, ++ and --, and the addition and catenation assignment operators, += and .= .

Specify strings in your program either with single quotes, double quotes, the quote-like operators q// and qq//, or "here documents." Single quotes are the simplest form of quoting - the only special characters are ' to terminate the string, \' to quote a single quote in the string, and \\ to quote a backslash in the string:

$string = '\n';                     # two characters, \ and an n
$string = 'Jon \'Maddog\' Orwant';  # literal single quotes

Double quotes interpolate variables (but not function calls - see Recipe 1.10 to find how to do this) and expand a lot of backslashed shortcuts: "\n" becomes a newline, "\033" becomes the character with octal value 33, "\cJ" becomes a Ctrl-J, and so on. The full list of these is given in the perlop (1) manpage.

$string = "\n";                     # a "newline" character
$string = "Jon \"Maddog\" Orwant";  # literal double quotes

The q// and qq// regexp-like quoting operators let you use alternate delimiters for single- and double-quoted strings. For instance, if you want a literal string that contains single quotes, it's easier to write this than to escape the single quotes with backslashes:

$string = q/Jon 'Maddog' Orwant/;   # literal single quotes

You can use the same character as delimiter, as we do with / here, or you can balance the delimiters if you use parentheses or paren-like characters:

$string = q[Jon 'Maddog' Orwant];   # literal single quotes
$string = q{Jon 'Maddog' Orwant};   # literal single quotes
$string = q(Jon 'Maddog' Orwant);   # literal single quotes
$string = q<Jon 'Maddog' Orwant>;   # literal single quotes

"Here documents" are borrowed from the shell. They are a way to quote a large chunk of text. The text can be interpreted as single-quoted, double-quoted, or even as commands to be executed, depending on how you quote the terminating identifier. Here we double-quote two lines with a here document:

$a = <<"EOF";
This is a multiline here document
terminated by EOF on a line by itself
EOF

Note there's no semicolon after the terminating EOF. Here documents are covered in more detail in Recipe 1.11.

A warning for non-Western programmers: Perl doesn't currently directly support multibyte characters (expect Unicode support in 5.006), so we'll be using the terms byte and character interchangeably.


Acknowledgments		1.1. Accessing Substrings