HTML 4.0 Sourcebook:Uniform Resource Locators (URLs)

To access the contents, click the chapter and section titles.

HTML 4.0 Sourcebook
(Publisher: John Wiley & Sons, Inc.)
Author(s): Ian S. Graham
ISBN: 0471257249
Publication Date: 04/01/98

Table of Contents

Note that the URL syntax for Gopher queries uses a plus (+) sign to separate different search strings. Therefore, if you want to include a literal plus sign within a string, it must be encoded (the encoding for a plus sign is %2B).

Client Construction of Query Strings

Inserting plus sign separators and converting plus signs in query strings into encoded values is done by the Web browser. When a user accesses a Gopher search from a Web browser, he or she is prompted for search strings. These are generally entered in a text box, using space characters to separate the different strings. When the search information is submitted, the search strings are appended, with appropriate encodings, to the URL. The client software is responsible for replacing space characters by plus signs and for encoding characters in the user’s search string that might be incorrectly interpreted.

The Gopher protocol supports additional features not discussed here. Please see the references at the end of this chapter for additional information.

HTTP URLs

Http URLs designate files, directories, or server-side programs accessible using the HTTP protocol. An http URL must always point to a file (text or program) or a directory. The general form is

<http://int.dom.nam:port/resource>

where the port number is optional (the default value is 80) and where resource specifies the resource. Resources are usually (but not always) files or directories. A directory is indicated by terminating the directory name with a forward slash, as in:

<http://www.utoronto.ca/webdocs/HTMLdocs/>

The following reference to this directory is an error, since it implies a reference to a file and not a directory:

<http://www.utoronto.ca/webdocs/HTMLdocs>

Most HTTP servers can detect this type of error and realize that the user wants to view the directory listing. In these cases, the server returns a server redirect HTTP response header, which contains the correct URL (with the trailing slash) and instructs the browser to try this URL instead. Server redirects are discussed in Chapter 9.

Note, however, that you can omit the trailing slash when referencing the root of a Web site. Thus, the following two URLs are both equivalent and correct:

<http://www.utoronto.ca/>
<http://www.utoronto.ca>

Special Characters in HTTP URLs

The forward slash (/), semicolon (;), question mark (?), and hash (#) are special characters in the path and query string portions of an http URL. The slash denotes a change in hierarchy (such as a directory), while the question mark ends the resource location path and indicates the start of a query string. The hash denotes the start of a fragment identifier. The semicolon is reserved for future use and should therefore be encoded in all cases where you intend a literal semicolon.

URL Encoding of Query Strings

Http URLs can contain query data to be passed to the server—these data are appended to the URL, separated from it by a question mark. Besides the character encodings required within URLs, query strings undergo additional levels of encoding to preserve information about the structure of the query data. This is necessary because certain characters in a query string are assigned special encoded meanings as part of the query—for example, the plus character (+) used to encode spaces, as noted earlier. There are several different ways these encodings are done, depending both on the mechanism by which the data are input by the user and on the mechanism by which the data are sent to the server.

Document authors do not usually have to worry about the encoding phase; browsers take ISINDEX or FORM data and do the encoding automatically. However, a gateway program author must explicitly decode these data to recover the original information; thus, he or she must understand the encoding in order to reverse the procedure. The following is a brief review of the encoding steps; you are referred to Chapters 6 (discussion of FORM elements) and 10 for more details.

URL Encoding for ISINDEX and FORM Data

The following steps outline the query string encoding process, elaborated to illustrate the important points. If the data are from an ISINDEX query, these steps apply to the encoding of the entire query input string; if the data are from FORM-based input, the encoding steps apply to each name and value string from the form’s user-input elements.

1. Percent characters (%) are converted into their URL encodings (%2f).

2. Plus signs (+) are converted into their URL encodings (%2b).

3. Ampersands (&) are converted into their URL encodings (%26).

4. Equals signs (=) are converted into their URL encodings (%3d).

5. The possibly special characters—namely # ; / ? : $ ! , ' ( )—are converted into their URL encodings.

6. Space characters are encoded as plus signs (+).

7. All non-ASCII characters (hex codes greater than 7f), all ASCII control characters (hex codes 00-1f, and 7f), and the unsafe ASCII characters listed in Table 6.1 are converted into their URL encodings (note that spaces have already been converted into plus signs).

At this point, all ASCII punctuation characters are encoded, except for the five characters:

_ - . * @

If the data are from an ISINDEX query, the encoding is complete. If they are from a FORM, only the individual name and value strings from each FORM input element have been encoded, as described in steps 1 through 7. These strings are then combined according to the following rules:

1. Each name and value pair is combined into a composite string of the form name=value. Note that the first encoding phase (steps 1–7) encoded all equals signs in the name and value strings, so that the only unencoded equals signs in the string are those used to separate a name from its associated value.

2. The name=value strings from all the FORM elements are combined into a single string, separated by ampersand (&) characters. For example:

 name1=value1&name2=value2

Note that the first encoding phase (steps 1–7) encoded all ampersands in the name and value strings, so that the only unencoded ampersands in the query string are those that separate name/value pairs.

Query-String Encoding MIME Type

Query string data encoded according to this algorithm are said to be URL-encoded. In fact, this encoding mechanism is assigned its own MIME type, namely:

Content-type: application/x-www-form-url-encoded

Note that you can easily tell if the data are from a FORM or ISINDEX query just by checking for unencoded equals signs. For example, the first of the following two URLs is from an ISINDEX query, the second from a FORM (the query string portion is in boldface):

<http://some.site.edu/cgi-bin/foo?arg1+arg2+arg3>
<http://some.site.edu/cgi-bin/program?name1=value1&name2=value2>

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.