# HTTP 1.1 Web Server This is an implementation overview of the very basic HTTP 1.1 server provided. ## Configuration Configuration is done in a configuration file. `webhttp.config` is a wrapper for Python's `SafeConfigParser`. The default configuration file is `~/.webpy.ini`. An example has been given in `webpy.ini.example`. It should be self-explanatory. There are two settings that may require explanation: * `index` is similar to Apache's `DirectoryIndex`; * the `error`*`xxx`* settings point to files relative to the root content directory that hold files that should be served in case of errors. Most settings can be overridden by command line flags and have sensible defaults. It is not necessary to use a configuration file. If not provided, the default settings are: * Hostname: localhost * Port: 8001 * Timeout: 15s * Maximum number of connections: 1000 ## Logging Logging is done using `logging`. `webhttp.weblogging` is a wrapper for this library that inserts the right name of the logger (currently `webhttp`). ## Parsing [RFC 2616][rfc2616] gives a context free grammar specification of HTTP requests and responses. To parse these elements appropriately, the CFG has been translated directly to regular expressions, in `webhttp.regexes`. This leads to bulky (and possibly inefficient) regexes, but on the other hand produces maintainable code. ## Conenection handling A `webhttp.server.Server` object opens a listening socket on the specified port and hostname. Whenever a connection is requested, a `ConnectionHandler` object is created. Extending `threading.Thread`, this class will run in a separate thread, allowing simultaneous connections. The `ConnectionHandler` will read data and feed it, 4096 bytes at a time, to a `parser.RequestParser`. This class uses a buffer to hold unfinished requests, because a request's size may exceed 4096 bytes. Since the `RequestParser` keeps reading after yielding the first request, persistent connections are implicitly supported (more below). For every request that the `RequestParser` yields, a `composer.ResponseComposer` is used to compose the appropriate response. This response is then sent back to the client in the `ConnectionHandler`. ### Persistent connections As mentioned above, the *handling* of requests in persistent connections is supported implicitly. However, this does not hold for *closing* persistent connections. In accordance with [RFC 2616][rfc2616], persistent connections are closed in one of the following situation: * When the configured timeout is exceeded while waiting for a new request, the connection will be closed and a debug level log message is produced. * When the composed response has a `Connection: close` header, the connection will be closed directly after sending that response. When a client requests closure of a persistent connection through sending the `Connection: close` header, the response will always include this header as well. This situation is therefore covered by the second bullet point. ## Serving GET requests As explained above, the `ResponseComposer` is responsible for building responses for requests. It does this through three key methods. * `compose_response` tackles directory traversal attacks by directly refusing to handle URIs that contain `..`. If this check is passed, a new `Process` is created in which `serve` (see below) is handled. This is necessary because `serve` uses methods that could theoretically time-out, for example when handling excessively large files. * `serve` implements most of the logic. It creates a `Resource` object for the URI requested, handles ETag-related headers (see below), sets the `Content-Type` header as needed, handles encoding (see below), sets the `Connection` header as needed and, perhaps most importantly does error handling. If all goes well, it returns a `Response` which is then returned from `compose_response`. * `serve_error` is a `serve` wrapper that serves the error page for a particular HTTP status code (see Configuration above). ### Status codes 200 for successful requests, 404 if the requested resource could not be found, 403 if the user running the server doesn't have permission to read the resource, or if a directory without index file has been requested. ## Caching using ETags ETags are properties of resources and are therefore generated by the `Resource` class, method `generate_etag`. This uses the md5 hash of the result of `os.stat()` on the file requested. We don't need a cryptographically secure hash for this: only collision resistance is in some way relevant, but clients that are afraid about collisions can always simply not send conditional requests. `Resource` also has an `etag_match` method that checks if a given ETag list matches the ETag of the resource. ### Status codes 304 if the ETag matched and the cache is used. ## Encoding Encoding is done through the `Resource` class. An optional argument `encoding` has been given to the `get_content` method (default: `identity`). Internally, another module, `encodings` is used. This module has functions to convert the internal representation of an encoding to a string and vice versa, and to encode and decode strings using some encoding. Currently, only `gzip` and `identity` are supported. ### Status codes 406 if only unknown encodings have been requested. ## Acknowledgements The following Python modules have been used, in alphabetical order: * `configparser`, in `webhttp.config`, for parsing the ini configuration file * `gzip`, in `webhttp.encodings`, for gzip encoding * `hashlib`, in `webhttp.resource`, for ETag generation * `mimetypes`, in `webhttp.resource`, for guessing content type and encoding * `multiprocessing`, in `webhttp.composer`, for timing out the `serve` call * `StringIO`, in `webhttp.encodings`, for gzip encoding * `urlparse`, in `webhttp.resource`, for parsing URIs In addition to these, the following fairly standard libraries were used: `argparse`, `binascii`, `itertools`, `logging`, `os`, `re`, `socket`, `sys`, `threading`, `time`, `unittest`. [rfc2616]: http://tools.ietf.org/html/rfc2616