|
|
|
To access the contents, click the chapter and section titles.
HTML 4.0 Sourcebook
(Publisher: John Wiley & Sons, Inc.)
Author(s): Ian S. Graham
ISBN: 0471257249
Publication Date: 04/01/98
Chapter 8 Uniform Resource Locators (URLs)
Uniform Resource Locators, or URLs, are a set of schemes for specifying Internet resources using a single line of typed ASCII characters: A URL simply indicates where a resource is and how to access it. The syntax for URL schemes is very flexible, and schemes exist for all the major Internet communications protocols, including FTP, Gopher, e-mail, HTTP, and WAIS. Within HTML documents, URLs reference the targets of a hypertext link. However, URLs are not restricted to the World Wide Web and can be used to communicate information about Internet resources in e-mail letters, handwritten notes, or even books.
This chapter begins with a general overview of URL properties and the rules for constructing valid URLs. This is followed by a detailed specification of the currently supported URL schemes. The chapter concludes with a discussion of some proposed, but not widely implemented, schemes, along with more general addressing issues relevant to the World Wide Web.
URL Overview and Syntax Rules
As mentioned, a URL is simply a scheme for referencing a particular Internet resource. In general, A URL generally contains the following four pieces of information, some of which are optional depending on the protocol:
- The protocol to use when accessing the server (e.g. HTTP, Gopher, WAIS). This is always required.
- The Internet domain name of the site on which the server is running, along with any required username and password information. This is not required for some protocols.
- The port number of the server, which can be present (it is optional) only if the URL requires a domain name. If absent, the browser assumes a default value dependent on the protocol. For example, the default value for HTTP is 80.
- The location of the resource on the serveroften a file or directory specification. This is sometimes optional, depending on the protocol.
Here is a typical example, in this case for the HTTP protocol:
<http://www.w3.org/pub/WWW/People/W3Cpeople.html>
This references the file W3Cpeople.html, in the directory /pub/WWW/People, accessible at the server www.w3.org using the HTTP protocol.
NOTE: Book Notation for http URLs
In this book, http URLs that reference Internet-accessible resources are given without the http:// portion of the string: for example
www.w3.org/pub/WWW/People/W3Cpeople.html
This takes advantage of the fact that most current browsers interpret strings typed into the Location text field at the top of the browser window or into the Open File... or Open Page... pop-up text input windows, as http URLs, if no other protocol is specified.
However, this is not true for http URLs embedded within HTML documents, and authors must not leave out the http:// portion when specifying a full HTTP address. The reasons for this are discussed in the section on Relative URLs later in this chapter.
Allowed Characters in URLs
A URL can, in principle, contain any ISO Latin-1 character, but must be written using only the printable ASCII characters from the bottom half of the ISO Latin-1 character set (as discussed on the companion Web site in Appendix A, excluding control characters). This restriction ensures that URLs can be sent by electronic mail, as many electronic mail programs cannot properly transmit messages containing characters from the upper half of the ISO Latin-1 character set. In a URL, non-ASCII characters (or, indeed, any ISO Latin-1 character) can be represented via a character encoding scheme. This is analogous to the character entities used with HTML. However, the schemes are distinctly differentthe URL encoding scheme is understood as one of the rules for writing URLs, whereas character entities are only understood inside an HTML document.
The encoding is simple: any character can be represented by the encoding
%xx
where the percent sign is the special character indicating the start of the encoding and where xx is the hexadecimal code for the desired ISO Latin-1 character (the x represents a hexadecimal digit in the range [0-9,A-F]). Table A.1 (on the companion Web site in Appendix A) lists all the ISO Latin-1 characters alongside their hexadecimal codes. As an example, the encoding for the character é (the letter e with an acute accent) is %E9.
Disallowed ASCII Characters
Several ASCII characters are disallowed in URLs and can be present only in encoded form. This is because these characters often have special meanings in a non-URL context. For example, HTML documents use the double quotation mark () to delimit a URL in a hypertext anchor, so that a quotation mark inside the URL would cause the browser to end the URL prematurely. Therefore, the double quote is disallowed. The space character is also disallowed, since many programs will consider the space as a break between two separate strings. For example, space characters often appear in Macintosh or Windows 95 file or folder names, as in the filename Network Info (there is a single space between the words Network and Info). In a URL, this name must be encoded as:
Network%20Info
Finally, all 33 control characters (hex codes 00 to 1F, and 7F) are disallowed.
Table 8.1 summarizes the disallowed printable characters, including TAB (although TAB, formally, is a control character). You will sometimes see disallowed characters (e.g., the tilde, ~) in a URL, without any special encoding. Such URLs will often work correctly, but to avoid possible problems you should use their encoded forms.
Special ASCII Characters
In a URL, several ASCII characters have special meanings. In particular, the percent character (%) is special, since it starts a URL character encoding sequence, while the forward slash character (/) is also special, denoting a change in hierarchy, such as a directory change. These special characters must be encoded if you want them to appear as regular characters and not be interpreted as special. Thus, to include the string
|
Table 8.1 ASCII Characters That Are Disallowed in URLs
|
|
Character
| Hex
| Character
| Hex
|
TAB
| 09
| SPACE
| 20
|
| 22
| <
| 3C
|
>
| 3E
| [
| 5B
|
\
| 5C
| ]
| 5D
|
^
| 5E
| `
| 60
|
{
| 7B
| |
| 7C
|
}
| 7D
| ~
| 7E
|
|
ian%euler
|