wwwstat: Process a sequence of NCSA httpd 1.2 access_log files and output
         a summary of the access statistics in a nice HTML format.
         The program oldwwwstat handles NCSA httpd 1.1 and earlier.

Copyright (c) 1994 Regents of the University of California.

==========================================================================
Licensing and Distribution Information:

This software has been developed by Roy Fielding <fielding@ics.uci.edu> as
part of the Arcadia project at the University of California, Irvine.
Wwwstat was originally based on a multi-server statistics program called
fwgstat-0.035 by Jonathan Magid (jem@sunsite.unc.edu) which, in turn,
was heavily based on xferstats (packaged with the version 17 of the
Wuarchive FTP daemon) by Chris Myers (chris@wugate.wustl.edu).
Those parts of wwwstat derived from fwgstat and xferstats are in the
public domain.  As such, this software and all derivations will always be
free to the general public.

The latest version of wwwstat can always be obtained at

       http://www.ics.uci.edu/WebSoft/wwwstat/

or by anonymous ftp from

       ftp://liege.ics.uci.edu/pub/arcadia/wwwstat/

The wwwstat package and those portions developed exclusively at the
University of California are covered by the above copyright notice.
Redistribution and use in source and binary forms are permitted,
subject to the restriction noted below, provided that the above
copyright notice and this paragraph and the following paragraphs are
duplicated in all such forms and that any documentation, advertising
materials, and other materials related to such distribution and use
acknowledge that the software was developed in part by the University of
California, Irvine.  The name of the University may not be used to
endorse or promote products derived from this software without
specific prior written permission.  THIS SOFTWARE IS PROVIDED ``AS
IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT
LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE.
  
Use of this software in any way or in any form, source or binary,
is not allowed in any country which prohibits disclaimers of any
implied warranties of merchantability or fitness for a particular
purpose or any disclaimers of a similar nature.
  
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY
FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION
(INCLUDING, BUT NOT LIMITED TO, LOST PROFITS) EVEN IF THE UNIVERSITY
OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

==========================================================================
Installation instructions:

1. Get the wwwstat package from the distribution site (above).  Normally,
   it will be in the form of a compressed unix tar file.  If it has not
   already been decompressed by your WWW client, than do one of:

        % uncompress wwwstat-1.0.tar.Z
        % gunzip wwwstat-1.0.tar.gz

   depending on which compressed version you downloaded.


2. Move the resulting wwwstat-1.0.tar file to the directory above
   where you want to install wwwstat, cd to that directory, and do

        % tar xvf wwwstat-1.0.tar

   to create the directory ./wwwstat-1.0 containing the following:

        Changes       -- the list of known problems and version information
        README        -- this file
        country-codes -- a table of Internet domains and their country names
        example.html  -- an example of what wwwstat output should look like
        old2newlog    -- A tool for converting httpd 1.1 logs to 1.2 format
        oldwwwstat    -- wwwstat for old NCSA httpd 1.0 or 1.1 servers
        wwwstat       -- the main perl script

   If you are already using NCSA httpd 1.2, delete the oldwwwstat script.


3. Configure the wwwstat script(s) to match the server configuration
   and default options desired for your site.  You will probably need
   to change the following (with any text editor).

   The first line (it should point to your perl executable)
      #!/usr/public/bin/perl

   The following variables set in the first section of code:
      $OutputTitle       # The output document's HTML Title.
      $LastSummary       # The URL of the previous summary period
      $ServerHome        # The server's default home page.
      $countrycodefile   # The location of the country-codes file.
      $access_log        # The location of your default server access log
      $srm_conf          # The location of your server configuration file
      $zcat              # The name of your "uncompress to stdout" program
      $zhandle           # The file extensions that indicate compressed
      $AppendToLocalhost # If address in log entry is one word (a local host),
                         #     what should be appended? (e.g. ".sub.dom.ain")
      $mydom1            # Identify the last two components of your local
      $mydom2            #     hostname addresses for special treatment
      $HeadEstimate      # Estimated size of the response headers

   The defaults for options that can be overridden by the command line:
      $LocalFullAddress  # Show full address for hosts in my domain?
      $OthersFullAddress # Show full address for hosts outside my domain?
      $ShowUnresolved    # Show all unresolved addresses?


4. Make sure the script is executable:

        % chmod 755 wwwstat


5. That's it.  You should now be able to run wwwstat, e.g.

        % wwwstat > results.html

6. If you have some old (prior to 1.2) logfiles that you want converted
   to the new format, you will also need to customize the old2newlog
   script (most variables are the same as those above).  Usage information
   can be obtained via the -h option.

==========================================================================
Usage:                      (NOTE - oldwwwstat has a different set of options)

       wwwstat [-helLoOuUrvx]  [-s srmfile] [-i pathname]
               [-a IP_address] [-c code] [-d date] [-t hour] [-n archive_name] 
               [-A IP_address] [-C code] [-D date] [-T hour] [-N archive_name] 
               [logfile ...]   [logfile.gz ...]    [logfile.Z ...]

Display Options:
     -h  Help -- just display this message and quit.
     -e  Display all invalid log entries on STDERR.
     -l  Do display full IP address of clients in my domain.
     -L  Don't (i.e. strip the machine name from local addresses).
     -o  Do display full IP address of clients from other domains.
     -O  Don't (i.e. strip the machine name from non-local addresses).
     -u  Do display IP address from unresolved domain names.
     -U  Don't (i.e. group all "unresolved" addresses under that name).
     -r  Display table of requests by each remote ident or authuser.
     -v  Verbose display (to STDERR) of each log entry processed.
     -x  Display all requests of nonexistant files to STDERR.
Input Options:
     -s  Get the server directives from the following srm.conf file.
     -i  Include the following file (assumed to be a prior wwwstat output).
    ...  Process the sequence of logfiles (compressed if extension (gz|Z|z)).
Search Options (include in summary only those log entries):
     -a  Containing a  hostname/IP address  matching the given perl regexp.
     -A  Not containing   "      "     "       "      "      "   "    "
     -c  Containing a  server response code matching the given perl regexp.
     -C  Not containing   "      "     "       "      "      "   "    "
     -d  Containing a  date ("Feb  2 1994") matching the given perl regexp.
     -D  Not containing   "      "     "       "      "      "   "    "
     -t  Containing an hour ("00" -- "23")  matching the given perl regexp.
     -T  Not containing   "      "     "       "      "      "   "    "
     -n  Containing an archive (URL) name   matching perl regexp (except +.).
     -N  Not containing   "      "     "       "      "      "   "    "


Note that the Search Options allow for full use of perl regular expressions
(with the exception that the -a, -A, -n and -N options treat '+' and '.'
characters as normal alphabetic characters).  The following description of
perl regular expressions is mostly from the Perl Reference by Johan Vromans:

    Each character matches itself, unless it is one of the special
    characters ^$+?.*()[]{}|\

    ^     at start of pattern, anchors pattern to the beginning of the
          string being matched.
    $     at end of pattern, anchors pattern to the end of the string
          being matched.
    .     matches any arbitrary character, but not a newline.
    (...) groups a series of pattern elements to a single element.
    +     matches the preceding pattern element one or more times.
    ?     matches zero or one times.
    *     matches zero or more times.
    {N,M} denotes the minimum N and maximum M match count. {N} means
          exactly N times; {N,} means at least N times.
    [...] denotes a class of characters to match. [^...] negates the class.
          Inside a class, '-' indicates a range of characters.
    (...|...|...) matches one of the alternatives.

    Non-alphanumerics can be escaped from their special meaning using a
    backslash (\).  Backslash is also used to form more special characters:

    \w    matches alphanumeric, including `_',
    \W    matches non-alphanumeric.
    \b    matches word boundaries, 
    \B    matches non-boundaries.
    \s    matches whitespace, 
    \S    matches non-whitespace.
    \d    matches numeric, 
    \D    matches non-numeric.

Examples:                              # Summarize:

    wwwstat -a '.com$'                 # only reqs from US commercial orgs.
    wwwstat -a '^simplon.ics.uci.edu$' # only reqs from that hostname
    wwwstat -A '^simplon.ics.uci.edu$' # no reqs from that hostname

    wwwstat -c '302'                   # only redirected requests
    wwwstat -c '^5'                    # only reqs resulting in server errors
    wwwstat -C '200'                   # only unsuccessful requests

    wwwstat -d ' [1-7] '               # only the first  week of the month
    wwwstat -d ' ([89]|1[0-4]) '       # only the second week of the month
    wwwstat -d ' (1[5-9]|2[01]) '      # only the third  week of the month
    wwwstat -d ' 2[2-8] '              # only the fourth week of the month
    wwwstat -d ' (29|30|31) '          # only the leftover days of the month

    wwwstat -d 'Feb'                   # only February  log entries
    wwwstat -d '1994'                  # only year 1994 log entries
    wwwstat -D 'Apr'                   # no entries from April

    wwwstat -t '00'                    # only reqs between midnight and 1am
    wwwstat -T '12'                    # no reqs between noon and 1pm

    wwwstat -n '.gif$'                 # only those reqs with a gif extension
    wwwstat -n '^/~user/'              # only those reqs under user's directory
    wwwstat -N '/hidden/'              # no reqs for files under "hidden" dirs

Depending on your unix shell, some special characters may need to be
escaped on the command line to avoid shell interpretation.


The intention is that wwwstat be run by a wrapper program as a crontab
entry at midnight, with its output redirected to a temporary file
which can then be moved to the site's summary file.  The temporary file is
necessary to avoid erasing your published file during wwwstat's processing
(which would look very odd if someone tried to GET it from your web).
See below for example crontab entries.

One of the nicest things about wwwstat is that it does not make any
changes to or write any files in the server directories.  Thus, this
program can be safely run by any user with read access to the httpd
server's access_log and srm.conf files.  This allows people to do
specialized summaries of just the things they are interested in.

This program could easily be modified to run as a CGI script, but that
is not recommended for slow processors or heavily utilized servers
unless some effort is made to keep the active log file very small
(e.g. by using the -i option to bootstrap prior output of wwwstat).


==========================================================================
Frequently Asked Questions

1. Why is all that legalese necessary?  Isn't wwwstat free?

The above legalese exists because others have abused the priviledge
of using free software.  Because this software was developed by an
employee of the University of California, we must protect ourselves
from lawsuits by those who would abuse our legal system for personal
gain, regardless of any actual damages.  To our knowledge, no damage
has ever been caused by this program.

Wwwstat is distributed free of charge and will remain so as long as
it is legally possible.  If you are not distributing the program
to others, there is no need for you to include mention of the University
of California in its output.  However, I would prefer that you leave in
the reference to wwwstat's distribution site (at the bottom of the output)
so that others can know where to get the original program.

Wwwstat is in use around the world.  If you have translated the output
to another language (i.e. German, French, Maori, etc.), I encourage you
to share those translations with others or mail them to me (Roy Fielding)
so that I can provide special patch files for each language.


2. Will you be developing a version for other httpd's, i.e. CERN, Plexus, ...

Obviously, versions of this program would also be nice for the Plexus
and CERN servers.  However, I found that much of the logic for finding
file names was just too specific to the NCSA server to justify all the
other work of making this general.  Although this should now be easier
given the common logfile format, I don't have the time to install all 
those servers just to see how to do it.  Feel free to do so yourself.


3. Why use a separate program (oldwwwstat) for prior log formats?
   Why not just use a command-line option or examine the log content?

Because prior versions of wwwstat required a great deal of special file-
handling capability to find file size information.  Since that is no longer
needed, it would be a waste to leave it in.  Eventually, all systems will
migrate to the new format (or something like it) and having to maintain
the old code without being able to test it is just plain silly.


4. How do I setup a crontab script to run wwwstat nightly?

Well, that depends on how your system's crontab works, but on mine
(a Sun 4 running SunOS 4.1.2) I can edit the crontab with the command

   % crontab -e

I have the following entry for my nightly script:

   0 0 * * * /dc/ud/www/etc/update-stats

and the following is my update stats script (thanks to Hal Varian)
   ----------------------------------------
   #!/bin/sh

   /dc/ud/www/bin/wwwstat > /tmp/wwwstats.html
   mv -f /tmp/wwwstats.html /dc/ud/www/documentroot/Admin/wwwstats.html
   exit
   ----------------------------------------

Here is another script submitted by LMD/T/AD Roar Smith:
(NOTE I have not tried this myself, but it looks good).
   ---BEGIN wwwstat.cron-------------------
   #!/bin/sh -fh
   #
   # wwwstat.cron
   #
   # Created:      1994-03-11 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se)
   # Modified:     1994-03-22 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se)
   #               Wrote comments.
   # Modified:     1994-04-05 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se)
   #               Bug fix for first day of month.
   #
   # Copyright:    This program is in the Public Domain.
   #
   #
   # Run this script just after midnight on every day of the month.
   #
   # Example crontab entries:
   # --------------------------------------------------
   # 1 0 * * * /library/WWW/wwwstat/wwwstat.cron
   # --------------------------------------------------
   #
   {
        program="/library/WWW/wwwstat/wwwstat-0.3/wwwstat"
        httpd="/usr/local/etc/httpd/httpd"
        statdir="/library/WWW/stats"
        statfile="wwwstats.html"
        tmpfile="/tmp/wwwstats.$$"
        accessfile="/var/adm/httpd_access_log"
        errorfile="/var/adm/httpd_error_log"
        pidfile="/var/adm/httpd.pid"
   
        umask 022
        day="`/bin/date +'%d'`"
        month="`/bin/date +'%m'`"
        set -- Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
        shift $month
        lmonth="$1"
        if [ "$day" -eq 1 ]; then
                #
                # First kill HTTP daemon to avoid interference
                #
                httpdpid=`/bin/cat "$pidfile"`
                if [ -z "$httpdpid" ]; then
                        /bin/kill -TERM "$httpdpid"
                fi
                #
                # Copy Access and Error logfiles
                #
                cp -p "$accessfile" "$accessfile.$lmonth"
                cp -p "$errorfile" "$errorfile.$lmonth"
                /usr/etc/chown root.daemon "$accessfile.$lmonth"
                /usr/etc/chown root.daemon "$errorfile.$lmonth"
                #
                # Empty Access and Error logfiles
                #
                echo -n >"$accessfile"
                echo -n >"$errorfile"
                #
                # Restart HTTP daemon
                #
                (cd / ; "$httpd")
                #
                # Run stats program
                #
                $program -d "$lmonth" "$accessfile.$lmonth" >"$tmpfile" &&
                /bin/mv "$tmpfile" "$statdir/$lmonth.$statfile" &&
                /usr/etc/chown root.daemon "$statdir/$lmonth.$statfile"
                #
                # Copy this as current stats file
                /bin/cp -p "$statdir/$lmonth.$statfile" "$statdir/$statfile"
        else
                #
                # Run stats program
                #
                $program >"$tmpfile" &&
                /bin/mv "$tmpfile" "$statdir/$statfile" &&
                /usr/etc/chown root.daemon "$statdir/$statfile"
        fi
   } 2>&1 |
   mail webmaster 2>&1 1>/dev/null
   exit
   ---END wwwstat.cron---------------------


6. What is the general procedure for monthly resetting of the access_log?

Again, that depends a great deal on how your site is set up and how
frequent the accesses are to your server.  My site gets about 15000
requests a month, so I just do the following at the beginning of each
month (the example is for April):

   % cp httpd/logs/access_log oldlogs/Mar_access_log
   % vi oldlogs/Mar_access_log
         -- then delete all entries that are not from March
            or that are obviously corrupted.
   % wwwstat -e oldlogs/Mar_access_log > /tmp/Mar.wwwstats.html
         -- this creates the full monthly summary for March and at the
            same time (the -e option) lists out any other corrupted entries
            that I may want to delete from the log.
   % mv /tmp/Mar.wwwstats.html documentroot/Admin/Mar.wwwstats.html
         -- to publish the summary on my web.
   % gzip -9 oldlogs/Mar_access_log
         -- use compress if you don't have gzip.
   % cd httpd/logs
   % mv access_log access_log.tmp
         -- if using a standalone type server, send a kill -1 to the 
            httpd process so that it creates a new access_log.  This is
            not necessary for inetd servers.
   % vi access_log.tmp
         -- then delete all entries from March (should now be left with
            only April entries, since this is repeated monthly).
   % cat access_log >> access_log.tmp
   % mv -f access_log.tmp access_log
         -- the above two commands should be done in quick succession
            to avoid missing a new entry, and then followed by a kill -1
            to the httpd process if running in standalone.


7. My server load is HUGE and wwwstat runs out of memory, what can I do?

The only solution I can recommend is to use the -i option and bootstrap
wwwstat's output every day -- setup a process which purges the logfile
every night and creates a wwwstat output file which can be included the
next day, and so on.  The process would do something like:

   % mv -f httpd/logs/access_log /tmp/access_log
     if server is standalone, restart it with
        kill -1 `cat httpd/logs/httpd.pid`
   % wwwstat -i docroot/stats/current.html /tmp/access_log > /tmp/wwwout
   % mv -f docroot/stats/current.html docroot/stats/previous.html
   % mv -f /tmp/wwwout docroot/stats/current.html
   % cat /tmp/access_log >> archived_log
   % rm -f /tmp/access_log 


==========================================================================

If you have any suggestions, bug reports, fixes, or enhancements,
send them to the author Roy Fielding at <fielding@ics.uci.edu>.
See the file Changes for known problems and complete version information.

This work has been sponsored in part by the Advanced Research Projects
Agency under Grant Number MDA972-91-J-1010.  This software does not
necessarily reflect the position or policy of the U.S. Government and no
official endorsement should be inferred.  Their support is appreciated.

