-->

Previous | Table of Contents | Next

Page 280

If the _s (strip) option is specified, words that are in the specified hash-file are removed from the word list. This can be useful with personal dictionaries.

The _l can be used to specify an alternate affix-file for munching dictionaries in languages other than English.

The _c option can be used to convert dictionaries that were built with an older affix file, without risk of accidentally introducing unintended affix combinations into the dictionary.

The _T option allows dictionaries to be converted to a canonical string-character format. The suffix specified is looked up in the affix file (_l switch) to determine the string-character format used for the input file; the output always uses the canonical string-character format. For example, a dictionary collected from TeX source FILES might be converted to canonical format by specifying _T tex.

The _w option is passed on to ispell.

findaffix

The findaffix shell script is an aid to writers of new language DESCRIPTIONs in choosing affixes. The given dictionary FILES (standard input if none are given) are examined for possible prefixes (_p switch) or suffixes (_s switch, the default). Each commonly occurring affix is presented along with a count of the number of times it appears and an estimate of the number of bytes that would be saved in a dictionary hash file if it were added to the language table. Only affixes that generate legal roots (found in the original input) are listed.

If the -c option is not given, the output lines are in the following format:


strip/add/count/bytes

where strip is the string that should be stripped from a root word before adding the affix, add is the affix to be added, count is a count of the number of times that this strip/add combination appears, and bytes is an estimate of the number of bytes that might be saved in the raw dictionary file if this combination is added to the affix file. The field separator in the output will be the tab character specified by the -t switch; the default is a slash (/).

If the _c (clean output) option is given, the appearance of the output is made visually cleaner (but harder to post process) by changing it to


-strip+add<tab>count<tab>bytes

where strip, add, count,and bytes are as before, and <tab> represents the ASCII tab character.

The method used to generate possible affixes will also generate longer affixes which have common headers or trailers. For example, the two words moth and mother will generate not only the obvious substitution +er but also -h+her and -th+ther (and possibly even longer ones, depending on the value of min). To prevent cluttering the output with such affixes, any affix pair that shares a common header (or, for prefixes, trailer) string longer than elim characters (default 1) will be suppressed. You may want to set elim to a value greater than 1 if your language has string characters; usually, the need for this parameter will become obvious when you examine the output of your findaffix run.

Normally, the affixes are sorted according to the estimate of bytes saved. The _f switch may be used to cause the affixes to be sorted by frequency of appearance.

To save output file space, affixes which occur fewer than 10 times are eliminated; this limit may be changed with the _l switch. The _M switch specifies a maximum affix length (default 8). Affixes longer than this will not be reported. (This saves on temporary disk space and makes the script run faster.)

Affixes which generate stems shorter than three characters are suppressed. (A stem is the word after the strip string has been removed, and before the add string has been added.) This reduces both the running time and the size of the output file. This limit may be changed with the _m switch. The minimum stem length should only be set to 1 if you have a lot of free time and disk space (in the range of many days and hundreds of megabytes).

The findaffix script requires a nonblank field-separator character for internal use. Normally, this character is a slash (/), but if the slash appears as a character in the input word list, a different character can be specified with the _t switch.

ispell dictionaries should be expanded before being fed to findaffix; in addition, characters that are not in the English alphabet (if any) should be translated to lowercase.

Previous | Table of Contents | Next