-->

Previous | Table of Contents | Next

Page 368

The split Command

The split command is probably one of the handiest commands for transporting large files around. One of its most common uses is to split up compressed source files (to upload in pieces or fit on a floppy). The basic syntax is


split [options] filename [output prefix]

where the options and output prefix are optional. If no output prefix is given, split uses the prefix of x and output files are labeled xaa, xab, xac, and so on. By default, split puts 1000 lines in each of the output files (the last file can be fewer than 1000 lines), but because 1000 lines can mean variable file sizes, the -b or --bytes option is used. The basic syntax is


-b bytes[bkm]



or


--bytes=bytes[bkm]

where bytes is the number of bytes of size:


b 512 bytes k 1KB (1024 bytes) m 1MB (1,048,576 bytes)

Thus,


split -b1000k JDK.tar.gz

will split the file JDK.tar.gz into 1000KB pieces. To get the output files to be labeled JDK.tar.gz., you would use the following:


split -b1000k JDK.tar.gz JDK.tar.gz.

This would create 1000KB files that could be copied to a floppy or uploaded one at a time over a slow modem link.

When the files reach their destination, they can be joined by using cat:


cat JDK.tar.gz.* > JDK.tar.gz

A command that is useful for confirming whether or not a split file has been joined correctly is the cksum command. Historically, it has been used to confirm if files have been transferred properly over noisy phone lines.

cksum computes a cyclic redundancy check (CRC) for each filename argument and prints out the CRC along with the number of bytes in the file and the filename. The easiest way to compare the CRC for the two files is to get the CRC for the original file:


cksum JDK.tar.gz > JDK.crc

and then compare it to the output cksum for the joined file.

Page 369

Counting Words

Counting words is a handy thing to be able to do, and there are many ways to do it. Probably the easiest is the wc command, which stands for word count, but wc only prints the number of characters, words, or lines. What about if you need a breakdown by word? It's a good problem, and one that serves to introduce the next set of GNU text utilities.

Here are the commands you need:

tr Transliterate; changes the first set of characters it is given into the second set of characters it is given; also deletes characters
sort Sorts the file (or its standard input)
uniq Prints out all the unique lines in a file (collapses duplicates into one line and optionally gives a count)

I used this chapter as the text for this example. First, this line gets rid of all the punctuation and braces, and so on, in the input file:


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc

This demonstrates the basic usage of tr:


tr `set1' `set2'

This takes all the characters in set1 and transliterates them to the characters in set2. Usually, the characters themselves are used, but the standard C escape sequences work also (as you will see).

I specified set2 as ` ` (the space character) because words separated by those characters need to remain separate. The next step is to transliterate all capitalized versions of words together because the words To and to, the and The, and Files and files are really the same word. To do this, tell tr to change all the capital characters `A-Z' into lowercase characters `a-z':


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z'

I broke the command into two lines, with the pipe character as the last character in the first line so that the shell (sh, bash, ksh) will do the right thing and use the next line as the command to pipe to. It's easier to read and cut and paste from an xterm this way, also. This won't work under csh or tcsh unless you start one of the preceding shells.

Multiple spaces in the output can be squeezed into single spaces with


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z' |  tr -s ` `

To get a count of how many times each word is used, you need to sort the file. In the simplest form, the sort command sorts each line, so you need to have one word per line to get a good sort. This code deletes all of the tabs (\t) and the newlines (\n) and then changes all the spaces into newlines:

Page 370


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' | tr ` ` `\n'

Now you can sort the output, so simply tack on the sort command:


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' | tr ` ` `\n' | sort

You could eliminate all the repeats at this point by giving the sort the -u (unique) option, but you need a count of the repeats, so use the uniq command. By default, the uniq command prints out "the unique lines in a sorted file, discarding all but one of a run of matching lines" (man page uniq). uniq requires sorted files because it only compares consecutive lines. To get uniq to print out how many times a word occurs, give it the -c (count) option:


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' |

tr ` ` `\n' | sort | uniq -c

Next, you need to sort the output again because the order in which the output is printed out is not sorted by number. This time, to get sort to sort by numeric value instead of string compare and have the largest number printed out first, give sort the -n (numeric) and -r (reverse) options:


tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |

tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' |

tr ` ` `\n' | sort | uniq -c | sort -rn

The first few lines (ten actually, I piped the output to head) look like this:


389 the

164 to

127 of

115 is

115 and

111 a

 80 files

 70 file

 69 in

 65 `

Note that the tenth most common word is the single quote character, but I said we took care of the punctuation with the very first tr. Well, I lied (sort of); we took care of all the characters that would fit between quotes, and a single quote won't fit. So why not just backslash escape that sucker? Well, not all shells will handle that properly.

So what's the solution?

The solution is to use the predefined character sets in tr. The tr command knows several character classes, and the punctuation class is one of them. Here is a complete list (names and definitions) of class names, from the man page for uniq:

alnum Letters and digits
alpha Letters

Previous | Table of Contents | Next