I know two ways to do this. The deroff command was designed to strip outconstructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to if you want only one of each.
deroff has one major failing, though. It only considers a word to be a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A."
A substitute is, which can perform various kinds of character-by-character conversions.
The -c option "complements" the first string passed to tr; -s squeezes out repeated characters. This has the effect of saying: "Take any non-alphabetic characters you find (one or more) and convert them to newlines (\012)."
(Wouldn't it be nice if tr just recognized standard UNIX
-c A-Za-z, you'd say
It's not any less obscure, but at least it's used by other programs,
so there's one less thing to learn.)
Thehas slightly different syntax. You'd get the same effect with:
tr -cs '[A-Z][a-z]' '[\012*]' <