Internet Security Professional Reference:CGI Security

-->

How CGI Works

The CGI specification details only the means by which data is passed between programs. The basic model of a CGI looks like figure 14.1.

Figure 14.1 Data passing between browser, server, and CGI.

A CGI is designated “nph” (non-parsed headers) if the program name begins with nph-. The program can then bypass the server and output directly to the browser, which is necessary if the program needs to decide its own http response code or ensure that the server does not perform any buffering.

The CGI program can receive information in the following three ways, any of which can potentially be abused by a cracker attempting to subvert security:

• Command-line arguments. This is an older method, used only in ISINDEX queries, one of the earliest mechanisms for passing user-supplied data to the Web server. It has been made obsolete by the much more complete HTML forms specification and the GET and POST methods of passing data to CGI programs. Do not use this unless you must cater to extremely old clients; current versions of all clients in common usage can use a more recent mechanism.

• Environment variables. A number of environment variables are set by the server before executing a CGI. Of particular interest is the QUERY_STRING variable, which contains any data following a ? character in the URL (for example, http://machine/cgi-bin/CGIname?datatopass). This is the only means for passing data to a CGI when using the GET method in forms; the entire contents of the form are encoded, concatenated, and placed into QUERY_STRING.
Because there are usually built-in limits to the length of environment variables, the POST method is superior for most purposes. The advantage of the GET method is that CGIs can be called without an HTML form involved; the CGI program URL and any QUERY_STRING data can be embedded directly into a hyperlink.

• The standard input stream. To pass an arbitrary amount of data to a CGI, use the POST method. The form’s data is encoded as in GET, but it is sent to the server as the request body. The HTTP server receives the input and sends it to the CGI on the standard input stream.

For historical reasons, the server is not guaranteed to send EOF when all available data has been sent. The number of bytes available for reading is stored in the CONTENT_LENGTH environment variable, and CGI programs must read only this many bytes. This is a potential security issue because some servers do send EOF at the end of data, so an incorrectly written CGI might work as expected when first tested, but when moved to another server its behavior might change in an exploitable way.

In the following example, the behavior is undefined, as the code is trying to read until it receives an EOF.

    if($ENV{‘REQUEST_METHOD’} eq “POST”) {     # Wrong, may never terminate
          while(<STDIN>) { […] }               # or read bogus data
    }

The second example correctly reads only CONTENT_LENGTH bytes.

    if($ENV{‘REQUEST_METHOD’} eq “POST”) {
     read(STDIN, $input, $ENV{‘CONTENT_LENGTH’});     # Right
    }

CGI Data: Encoding and Decoding

The data passed to a CGI is a series of key/value pairs representing the form’s contents. It is encoded according to a simple scheme in which all unsafe characters are replaced by their percent-encoding, which is the % character followed by the hexadecimal value of the character. For example, the ~ character is replaced by %7E. For historical reasons, the space character is usually not percent-encoded but is instead replaced by the + character.

Note: A complete list of unsafe characters is available in RFC 1738, Universal Resource Locators, http://ds.internic.net/rfc/rfc1738.txt.

Despite what the unsafe designation seems to imply, the characters are not encoded for security reasons. They are subject to accidental modification at gateways or are used for other purposes in URLs. Because the encoding is expected to be performed by the client, there is no guarantee that unsafe characters have actually been encoded according to the specification. A CGI must not assume that any encoding has been performed.

Before submission to the server, the browser joins each key/value pair with the = character and concatenates them all, separated by the & character. Again, although this is the expected and desired behavior, data of any kind can potentially be submitted.

CGI Libraries

This data format does not lend itself to easy access by the CGI programmer. Several libraries have already done the difficult work. Some of the features available in these libraries are as follows:

• Parsing input data, including canonicalizing line breaks

• Routines to sanitize input data

• Routines to enable easy logging of errors

• Handling GET and POST methods identically

• Debugging aids

Re-inventing these facilities can easily introduce avoidable security problems. Learning to use one or more is a wise time investment. They can be found at the following addresses:

PERL:

• cgi-lib.pl: http://www.bio.cam.ac.uk/web/cgi-lib.pl.txt

• CGI.pm: http://www-genome.wi.mit.edu/ftp/pub/software/WWW/cgi_docs.html (Requires PERL 5)

Tcl:

• tcl-cgi: http://ruulst.let.ruu.nl:2000/tcl-cgi.html

• cgic: http://sunsite.unc.edu/boutell/cgic/

• libcgi: http://raps.eit.com/wsk/dist/doc/libcgi/libcgi.html

Understanding Vulnerabilities

There are several points of attack possible when attempting to compromise a CGI program. The HTTP server and protocol should not be trusted blindly, but environment variables and CGI input data are the most likely avenues of attack. Each of these should be considered before writing or using a new CGI.

Table of Contents