





                          Essence: A Resource Discovery System

                             Based on Semantic File Indexing

            Darren R. Hardy, Michael F. Schwartz - University of Colorado, Boulder

                                        ABSTRACT

                    Discovering different types of file resources (such
                 as  documentation,  programs,  and images) in the vast
                 amount of data contained within network  file  systems
                 is  useful  for  both users and system administrators.
                 In  this  paper  we  discuss  the   Essence   resource
                 discovery  system,  which  exploits  file semantics to
                 index both textual and binary  files.   By  exploiting
                 semantics,  Essence extracts keywords that summarize a
                 file,  and  generates  a  compact  yet  representative
                 index.   Essence  understands  nested  file structures
                 (such as uuencoded, compressed,  ``tar''  files),  and
                 recursively  unravels such files to generate summaries
                 for them.  These features allow Essence to be used  in
                 a  number  of  useful  settings, such as anonymous FTP
                 archives.  We present measurements  of  our  prototype
                 and compare them to related projects, such as the Wide
                 Area Information Servers (WAIS)  system  and  the  MIT
                 Semantic  File  System  (SFS).   We  demonstrate  that
                 Essence can index more data  types,  generate  smaller
                 indexes,  and  in  some  cases  index data faster than
                 these  systems.    Our   prototype   generates   WAIS-
                 compatible   indexes,  allowing  WAIS  users  to  take
                 advantage of the Essence indexing methods.

                     Introduction                First, information in  gen-
               In the past  two  years,  a    eral file systems is typically
            number  of  resource discovery    very  irregularly   organized.
            tools have been introduced  to    Most  Internet  data is expli-
            help  users locate and use the    citly  intended  for  sharing,
            massive amount of  information    and  hence  people  often  put
            available   in   the  Internet    some  effort  into  organizing
            [Schwartz et al.  1992b].   As    the    information    into   a
            disks   have   become  larger,    coherent whole (e.g.,  placing
            cheaper, and  more  plentiful,    an  entire  file system into a
            resource  discovery  has  also    meaningful hierarchical direc-
            become a  problem  in  general    tory   in   an  anonymous  FTP
            purpose  file systems, such as    site).  In contrast, most gen-
            the Sun  Network  File  System    eral   file  system  data  are
            (NFS).   Yet,  the current set    organized  according  to   the
            of Internet discovery tools do    individual  whims of many peo-
            not  apply  well  to  such  an    ple.    Therefore,    resource
            environment,  for  three  rea-    discovery  systems that depend
            sons.                             heavily on users  to  organize
                                              and  browse through data (such
                                              as  Prospero,  [Neuman  1992],


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA        1






       Essence: A Resource Discovery System ...         Hardy & Schwartz


       WorldWideWeb  [Berners-Lee  et    to an entire general file sys-
       al. 1992], or Gopher [McCahill    tem, since  keywords  will  be
       1992])  do  not  work well for    generated  from files that are
       general  purpose  file  system    of interest to few users.
       data.     Instead,   automated       In this paper we present  a
       search procedures are  needed.    system for supporting resource
       This  typically means generat-    discovery in  general  purpose
       ing some type of index of  the    file   systems.    The  system
       available  information [Salton    addresses the  above  problems
       & McGill 1983].                   by generating indexes based on
          Second, general  file  sys-    an understanding of the seman-
       tems  contain  a range of dif-    tics  of the files it indexes.
       ferent  types  of  data,  from    This technique  supports  com-
       unstructured  text  to  struc-    pact  yet  representative sum-
       tured data.  Systems that  use    maries for general collections
       a  generic  indexing procedure    of  data.  In addition to sup-
       (such  as  archie  [Emtage   &    porting  file  indexes,   sum-
       Deutsch 1992] or WAIS [Kahle &    maries  can be browsed to help
       Medlar 1991])  produce  larger    decide whether to  retrieve  a
       or  less  useful indexes under    file  across  a  slow network.
       these   circumstances.     For    We  call  our  system  Essence
       example,  WAIS  is most effec-    because of its ability to sum-
       tive when used on  ASCII  text    marize large amounts  of  data
       files.   Using  WAIS  to index    with relatively small indexes.
       executables  and  other  files
       found  in general file systems       We begin with a  discussion
       is not  very  effective.   The    of  indexing  techniques.   We
       indexes  tend to do a poor job    then  survey   previous   work
       of locating  information,  and    related  to semantic indexing.
       tend to be quite large.           We discuss how Essence  accom-
                                         plishes  semantic indexing and
          Third,  Internet  discovery    uses  it  as   a   basis   for
       tools   typically   focus   on    resource  discovery.  Finally,
       information  known  to  be  of    we discuss the details of  our
       reasonably   broad   interest.    prototype,  and  present  some
       For  example,  anonymous   FTP    measurements   that    compare
       archives   typically   contain    Essence with WAIS and SFS.
       popular documents and software    Full  Text  vs.  Filename  vs.
       packages,  which exhibit heavy       Semantic Indexing
       sharing   [Schwartz   et   al.
       1992a].       In     contrast,       WAIS supports  fine-grained
       general-purpose  file  systems    information access by building
       typically    contain    mostly    full-text  indexes,  in  which
       user-specific data that  exhi-    every  keyword  from a textual
       bit  relatively little sharing    document appears in the index.
       [Muntz   &   Honeyman    1992,    As   indicated   above,   this
       Ousterhout   et   al.   1985].    approach is  primarily  useful
       Current   Internet    resource    for   purely  textual,  widely
       discovery tools have difficul-    popular data.  Moreover,  WAIS
       ties with  such  low  sharing-    has  large space requirements:
       value   data.    For  example,    its indexes are comparable  in
       WAIS's   full-text    indexing    size  to  the  data files they
       mechanism   may   locate  many    represent.  Because  of  these
       uninteresting files if applied    space    requirements,    WAIS


       2        1993 Winter USENIX - January 25-29, 1993 - San Diego, CA






            Hardy & Schwartz         Essence: A Resource Discovery System ...


            distributes the indexes  among    By generating information  for
            the hosts that provide data.      different  types  of  files in
               A  less   space   intensive    different  manners,   semantic
            indexing  approach  is used by    indexing      can     generate
            archie, in which anonymous FTP    representative        keywords
            files  are  summarized by name    without  including  every word
            only  (i.e.,  archie   indexes    from a file.  In  addition  to
            contain no information derived    saving  space,  this technique
            from  file   content).    This    can avoid  including  keywords
            approach produces indexes that    that  might muddle the quality
            are roughly one thousandth the    of an index.  For example,  it
            size  of  the  data  that they    makes  little sense to include
            represent.  In turn, this com-    C  language  constructs   like
            pact  representation  allows a    ``struct''   when  indexing  C
            great deal of  index  informa-    source code, since these  key-
            tion  to  be  collected onto a    words  do  not distinguish the
            single   machine,   supporting    conceptual  content  of   dif-
            far-reaching          searches    ferent C programs.
            (currently reaching over  1000       Semantic indexing  involves
            archive  sites).  Yet, because    two  stages.   The classifica-
            archie  indexes  contain  only    tion stage identifies  promis-
            filenames,  they  support only    ing  files  to  index within a
            name-based searches.  Searches    file system,1 as well as  type
            based   on   more   conceptual    information  for  each identi-
            descriptions of resources  are    fied  file.   The  summarizing
            not  possible, except when the    stage  applies  an appropriate
            filenames  happen  to  reflect    indexing procedure  (called  a
            some   of   these   conceptual    summarizer,  to  emphasize the
            descriptions.                     space  reduction  characteris-
                                              tic)  to each identified file,
               The range of structure  and    based on the type  information
            the  low overall sharing value    uncovered  in  the classifica-
            in general purpose  file  sys-    tion stage.
            tems   (as  discussed  in  the
            introduction),  coupled   with       Since  summarizers   under-
            the    need   for   conceptual    stand  file  types,  they  can
            descriptions and the need  for    extract  keyword   information
            compact   indexes   (motivated    for  both  textual  and binary
            above), all suggest the use of    files.   For   example,   many
            a  different means of indexing    binary     executables    have
            data.  That means is  semantic    related   textual    documents
            indexing.                         describing  their  usage, from
               Semantic indexing  involves    D'l 1.5i 0'
            analyzing   the  structure  of      1For example, this procedure
            file data in  different  ways,    might   embody   site-specific
            depending  on  file type.  For    knowledge that  certain  parts
            example,  UNIX   manual   page    of the file tree contain unin-
            files  are  broken into struc-    teresting  administrative  in-
            tured sections from  which  it    formation,  and  hence needn't
            is  possible to extract infor-    be indexed.  Our current  pro-
            mation about a program's  name    totype  does  not  select file
            and   description,   a   usage    system  subsets  -  it  simply
            synopsis, related programs  or    indexes  whatever  file  trees
            files, and author information.    are specified.


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA        3





       Essence: A Resource Discovery System ...         Hardy & Schwartz


       which keyword information  can    a  one  line  script  (perhaps
       be extracted.                     containing   a  ``grep''  com-
                                         mand).
          Since  keyword  information
       is    extracted    based    on       Essence indexes  can  allow
       knowledge   of   where   high-    users  to  locate needed data.
       quality  information  might be    Moreover,   Essence   produces
       located,   semantic   indexing    summaries  of file data, which
       extracts  fewer  keywords than    allow quick perusal of  poten-
       full-text indexing,  and  thus    tially large files.
       generates   smaller   indexes.       Essence has many  practical
       Yet,  it  retains  the   fine-    resource   discovery  applica-
       grained,   associative  access    tions:
       capability    of     full-text
       indexes.                          o  Systems administrators  and
       The Essence System                   users  can  use  Essence to
          Essence     provides     an       locate  and   learn   about
       integrated  system for classi-       resources  contained within
       fying files, defining  summar-       their file systems  without
       izer    mechanisms,   applying       understanding  the  details
       appropriate   summarizers   to       of their local environment.
       each  file,  and  traversing a       This  is particularly help-
       portion of a  file  system  to       ful in  environments  where
       produce  an  index of its con-       mount points are ``hidden''
       tents.                               by the amd auto-mount  sys-
          Essence   determines   file       tem.
       types  by exploiting file nam-    o  Public archive  administra-
       ing conventions (such as  com-       tors  can  use  Essence  to
       mon  filename  extensions like       index   archive   contents,
       ``.c''), and locating  identi-       providing    compact    yet
       fying  data  or  common struc-       representative descriptions
       tures within  files  (such  as       of     files,     including
       UNIX ``magic numbers'').  Once       compressed archives.  These
       Essence  determines  a  file's       indexes   allow   users  to
       type,    it    executes    the       search for information more
       appropriate   summarizer    to       effectively,   and  examine
       extract   keywords   from  the       summaries    about    files
       file.   Among   other   types,       before retrieving them.
       Essence   understands   nested
       file   structures,   such   as    o  People who  wish  to  index
       compressed,  uuencoded ``tar''       data  and  search  it using
       files.      It     recursively       WAIS  can  use  Essence  to
       extracts files hidden within a       index  more file types than
       nested file, and indexes them.       WAIS itself currently  sup-
                                            ports,  and to produce more
          As a design goal, we wanted       space efficient indexes.
       to  allow  summarizers  to  be       Once Essence  generates  an
       constructed    quickly     and    index  for a portion of a file
       easily,  so that Essence could    system, it exports its indexes
       be  made  to  understand  many    via    WAIS's    search    and
       different  file  types, and so    retrieval   interface.    This
       individual sites could custom-    allows  our indexes to be used
       ize   their  summarizers.   To    within the context of  a  well
       accomplish this goal, we allow    established,   easy   to   use
       summarizers to be as simple as


       4        1993 Winter USENIX - January 25-29, 1993 - San Diego, CA





            Hardy & Schwartz         Essence: A Resource Discovery System ...


            information system.                  files [USENIX 1986].

                     Related Work             o  The   UNIX   find   command
            Identifying and Locating  File       locates   files   using  an
               Resources                         exhaustive search of a por-
               Semantic  indexing  depends       tion  of a file system.  It
            on   successfully  determining       allows  predicates  to   be
            file   types.     Furthermore,       specified  concerning which
            Essence uses semantic indexing       files  to  locate.    Among
            to  locate   file   resources.       other  things, these predi-
            Many systems can either deter-       cates can specify  location
            mine file types or locate file       based  on  the  file  types
            resources,     but     Essence       understood by the UNIX file
            integrates both aspects into a       system  (such  as  ordinary
            single system.                       file,  directory,  or  sym-
                                                 bolic link) [Leffler et al.
            o  The  Modules  system  is  a       1989,     USENIX     1986].
               sophisticated   administra-       Higher-level types (such as
               tive approach  to  locating       image, script, or C  source
               file  resources  associated       code) cannot be specified.
               with specific  applications    o  Many programs use file nam-
               [Furlani  1991].   Applica-       ing  conventions  to  infer
               tions are associated with a       file types.   C  compilers,
               particular   module,  which       for  example, assume that a
               can be easily  incorporated       filename ending  in  ``.c''
               or  removed  from  a user's       is a C source file, while a
               environment.    Both    the       file ending in ``.o'' is  a
               location and identification       relocatable   object  file.
               of  the  applications   and       Similarly, make has various
               their  file  resources  are       implicit   rules  based  on
               explicitly supplied  by  an       file naming conventions.
               administrator, and are hid-    Exploiting File Semantics
               den from the user.                Semantic   indexing    also
            o  The   NeXT   file    system    depends   on  the  ability  to
               browser  determines  common    extract good keyword  informa-
               file  types  by  exploiting    tion from files based on their
               filename  extensions  [NeXT    file types.  A number of  UNIX
               1991].  It then displays an    commands  can extract informa-
               icon  representative of the    tion with varying  degrees  of
               file's  type.   Users   can    quality  from  files  based on
               launch  a specific applica-    their   file   types   [USENIX
               tion by  supplying  only  a    1986].
               filename,  as  the applica-
               tion that  is  launched  is    o  ctags  extracts  procedure,
               determined  by  the  file's       macro,  and  variable names
               type.     Locating     file       from C  source  and  header
               resources  is  accomplished       files.   Some  versions  of
               by  browsing  a  UNIX  file       ctags understand other pro-
               system hierarchy.                 gramming languages, such as
                                                 Lisp, Pascal, and C++.
            o  The   UNIX   file   command    o  strings  extracts  embedded
               attempts to determine vari-       ASCII   text  strings  from
               ous  file  types  based  on       binary files.
               file contents, but provides
               no mechanism  for  locating    o  deroff, detex,  and  ps2txt


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA        5





       Essence: A Resource Discovery System ...         Hardy & Schwartz


          extract   ASCII  text  from    both systems support  flexible
          troff, TeX, and  PostScript    associative   access  to  file
          files, respectively.           data,  they  export  the  data
                                         differently.   Essence exports
       o  what   extracts    embedded    the data through a search  and
          Source  Code Control System    retrieval interface, while SFS
          (SCCS)   information   from    exports  the  data  through  a
          files.                         file  system  interface.   The
          Essence provides  a  single    advantage of the SFS  approach
       cohesive      system      that    is  that it reuses an existing
       integrates  determining   file    and familiar storage  abstrac-
       types,      locating      file    tion.    The  disadvantage  is
       resources, and exploiting file    that doing so leads  to  unde-
       semantics to extract good key-    fined semantics.  For example,
       word information from files.      if a user tries to  copy  data
       MIT Semantic File System          into   a   virtual   directory
          The MIT Semantic File  Sys-    (created as a result of an SFS
       tem  (SFS)  uses semantic file    query),   the   semantics  are
       indexing  to  provide  a  more    undefined.
       effective  storage abstraction    Summarizer Breadth
       than  traditional   hierarchal       Essence   summarizers   are
       file  systems  [Gifford et al.    autonomous    UNIX   programs,
       1991].  SFS exploits  filename    which are easy  to  implement,
       extensions  to  determine file    integrate,  and  maintain. The
       types, and then runs transduc-    Essence  prototype  implements
       ers  on  files to extract key-    summarizers for many more file
       word information for  building    types than SFS does.   Essence
       an index.  SFS provides a vir-    can  index  a  wide variety of
       tual  directory  interface  to    textual and binary data common
       search the resulting index and    in network file systems.
       to  access   files.    Virtual    Space Efficiency
       directory   names  are  inter-       The Essence prototype  pro-
       preted as queries against  the    vides better index compression
       index,  and  the  contents  of    than the SFS prototype.   Com-
       virtual  directories  are  the    parative  measurements  appear
       results  from  these  queries.    later in this paper.
       Therefore,  users  perceive  a
       search-based    interface   to        The Design of Essence
       explore file  systems,  rather       Figure 1 shows how  Essence
       than   the   more  traditional    is organized.
       hierarchical    file    system    D'l 720u 0'
       interface.                        Figure 1: Organization of the Essence System
                                         D'le720uo0'rates as follows:
          Although  Essence  and  SFS
       use  similar semantic indexing    o  Users supply  Essence  with
       techniques,  they  differ   in       the filenames from a select
       orientation,        summarizer       portion of  a  file  system
       breadth, and space efficiency.       that they wish to index.
       Orientation                       o  The  Feeder  module  itera-
          SFS   emphasizes   semantic       tively passes each of these
       indexing as a storage abstrac-       filenames to the  Classifi-
       tion.   In  contrast,  Essence       cation module, which deter-
       emphasizes  semantic  indexing       mines the file's type.
       as  a   basis   for   resource
       discovery.   Concretely, while

       6        1993 Winter USENIX - January 25-29, 1993 - San Diego, CA





            Hardy & Schwartz         Essence: A Resource Discovery System ...


            o  The    Summarize     module    typically   PostScript   image
               chooses an appropriate sum-    files;  and  filenames  with a
               marizer based on the file's    ``.txt'' extension  are  typi-
               type.   It  then  runs this    cally  ASCII text files.  File
               summarizer on the  file  to    naming    conventions     also
               extract  keywords  for  the    include  using  specific words
               Summary Files.                 within a filename.  For  exam-
                                              ple,   information   about  an
               The  three  modules,   Core    entire source distribution  or
            Filename,  Nested File Proces-    application  is often found in
            sor, and Nested  File  Feeder,    files whose name contains  the
            allow   Essence   to   support    string    ``README''.    Files
            nested files.                     named ``Makefile''  are  typi-
            o  Essence saves the initially    cally associated with the UNIX
               supplied  filename  as  the    make command [USENIX 1986].
               Core Filename.
                                                 In  Essence,  file   naming
            o  If    the    Classification    conventions are represented as
               modules determines that the    regular   expressions.     For
               file has a nested file type    example, *.ps or *[Mm]akefile*
               (such   as   a   compressed    represent the  PostScript  and
               file), it passes  the  file    Makefile  file  types, respec-
               to  the Nested File Proces-    tively.  Expressing file  nam-
               sor.                           ing   conventions  as  regular
            o  The Nested  File  Processor    expressions  allows  sites  to
               extracts  the  hidden files    easily  integrate  their local
               from the nested file struc-    semantics into Essence.
               ture    and    passes   the    Locating Identifying Data  and
               extracted  files   to   the       Common Structures
               Nested File Feeder.               In addition to using naming
                                              conventions,  Essence examines
            o  The  Nested   File   Feeder    file contents to try to deter-
               module performs the identi-    mine  file types.  In particu-
               cal function as the  Feeder    lar, many files have an  iden-
               but   bypasses   the   Core    tifying  magic  number associ-
               Filename module.               ated with them.  For  example,
            Determining File Types            NeXT  binary executables start
               Essence   determines   file    with  the  hexadecimal  number
            types  using  a combination of    0xfeedface,  and  Sun  Pixrect
            exploiting file naming conven-    images start with the  hexade-
            tions and heuristically locat-    cimal    number    0x59a66a95.
            ing identified data and common    Furthermore, common structures
            structures within files.          within  a  file  may determine
            Exploiting File Naming Conven-    its file type.   For  example,
               tions                          PostScript  images  start with
               Observing even simple  con-    the string ``%!''; UNIX  shell
            ventions  in  file  naming can    programs start with the string
            determine  file   types   with    ``#!''; C  source  code  files
            fairly  high  certainty.   The    typically     have    comments
            most basic file naming conven-    denoted  with  the  ``/*  */''
            tion  is  filename extensions.    delimiters;   electronic  mail
            For example, filenames with  a    files have distinctive  header
            ``.c'' extension are typically    tags,  such as From:, Received
            C source code files; filenames    by:, and Sender:;  and  USENET
            with  a  ``.ps'' extension are    news    articles   also   have


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA        7





       Essence: A Resource Discovery System ...         Hardy & Schwartz


       distinctive header tags,  such    This design provides a  power-
       as Newsgroups:, Distribution:,    ful  paradigm  for  exploiting
       and Path:.                        file semantics.  Each  summar-
                                         izer   is  associated  with  a
          As  with  exploiting   file    specific file type, and under-
       naming  conventions,  locating    stands  the file's format well
       identifying  data  and  common    enough  to   extract   summary
       structures  within a file is a    information   from  the  file.
       rule-based technique expressed    For  example,  the  summarizer
       with    regular   expressions.    for  a UNIX troff-based manual
       Sites  can  easily   integrate    page  understands  the   troff
       their local semantics into the    syntax   and  the  conventions
       discovery process by modifying    used  to  describe  UNIX  pro-
       these rules.                      grams.   It  uses  this under-
       Nested File Structure             standing  to  extract  summary
          Nested files contain hidden    information, such as the title
       files.     Examples    include    of a program, related programs
       compressed files,  tar  files,    and  files,  the  author(s) of
       uuencoded  files,  ZIP  files,    the  program,  and   a   brief
       and   shell   archive   files.    description  of  the  program.
       Furthermore,   files   can  be    Similar techniques can be used
       arbitrarily   nested    within    on   many   other   moderately
       these  file  types.  For exam-    structured file types, such as
       ple, compressed tar  files  or    source  code.   However,  some
       uuencoded compressed files are    file types do not easily  lend
       common.  Understanding  nested    themselves     to    automated
       file  structures  is useful in    interpretation.  For  example,
       file system environments (such    plain  ASCII  text files typi-
       as anonymous FTP file systems)    cally   contain   unstructured
       in which the vast majority  of    data   that  is  difficult  to
       files have nested structure.      exploit  effectively.    Simi-
          When   Essence   determines    larly, the UNIX ps2txt program
       that  a file has nested struc-    can extract  ASCII  text  from
       ture, it extracts  the  hidden    PostScript   images,  but  the
       files,  determines the result-    resulting    information    is
       ing files' types, and  summar-    unstructured text.
       izes  them.  This process con-
       tinues recursively,  until  no          Essence Prototype
       more    nesting    is   found.       In   this    section,    we
       Extracting hidden files from a    describe  the  techniques used
       nested file is accomplished by    by the  Essence  prototype  to
       running    a     corresponding    determine   file   types   and
       extraction  program,  such  as    exploit  file  semantics  with
       the  UNIX  uncompress  command    summarizers.   We also discuss
       for compressed files, the UNIX    how we integrated Essence with
       tar command with the 'x'  flag    WAIS.
       for  tar  files,  or  the UNIX    Determining File Types
       uudecode command for uuencoded       As described earlier,  file
       files.                            types are determined by under-
       Summarizers                       standing  naming   conventions
          Essence's  summarizers  are    and  locating identifying data
       simple  stand-alone  UNIX pro-    and common structures within a
       grams that are easy  to  write    file.   In the prototype, nam-
       and integrate into the system.    ing conventions are  expressed



       8        1993 Winter USENIX - January 25-29, 1993 - San Diego, CA





            Hardy & Schwartz         Essence: A Resource Discovery System ...


            with  case-insensitive regular    identifying   data  or  common
            expressions.   The   following    structures   is   common   for
            example   shows  some  entries    binary  formats, which usually
            from  the  configuration  file    depend  on  a   single   magic
            that  holds  the  expressions.    number.   Although distinctive
            In this file, the first  field    magic entries are difficult to
            is  the  file  type,  and  the    formulate,  careful  selection
            second  field   is   a   case-    of a magic file allows file to
            insensitive regular expression    accurately    identify    file
            for  the  corresponding   file    types.  In  Essence,  building
            naming convention.                the   magic  file  was  accom-
                                              plished  through  experimenta-
             Compressed   .*\.Z               tion with various entries.
             ManPage      .*\.[12345678]      Summarizers
             PostScript   .*\.(ps|eps)           In the prototype, summariz-
             README       .*readme.*          ers  are  simple UNIX programs
             SCCS         s\..*               that extract keyword  informa-
                                              tion through understanding the
               The prototype also uses the    syntax  and  semantics  of   a
            UNIX file command to determine    specific       file      type.
            file types, based on identify-    Currently, the prototype  sup-
            ing data and common structures    ports  summarizers for twenty-
            within a file  [USENIX  1986].    one file types and four nested
            file  uses the /etc/magic con-    file types.  Table 1 describes
            figuration  file  to   specify    these file types,  their  fre-
            recognizable  file types.  The    quencies   of   occurrence  by
            following list shows some sam-    number of files, average  file
            ple  entries  from /etc/magic,    size,  and  which systems sup-
            where the first field  is  the    port them in two  file  system
            offset of the identifying data    environments: an NFS file sys-
            or   common   structure,   the    tem  that  contains   commonly
            second  field is the type this    shared  data  and  programs in
            data, the third field  is  the    our local environment,  and  a
            identifying   data  or  common    fairly  popular  anonymous FTP
            structure itself, and the last    file                    system
            field   is  the  corresponding    (ftp.cs.colorado.edu).     The
            file type.                        most frequent  file  types  in
             0  string  /*           C programttextFS file system were Text,
             0  string  \037\235     CompressedHdatar, and ManPage.  In  the
             0  long    0xfeedface   NeXT binaryopgmous  FTP file system the
             0  string  #!/bin/perl  Perl programt frequent file types  were
             0  string  %!           PostScriptHimage, C, and Text.

               Creating a  suitable  magic       Essence  supports  more  of
            file  is  not trivial, because    the file types found in common
            the identifying data or common    NFS  and  anonymous  FTP  file
            structures  must  be  distinc-    systems  than  either  WAIS or
            tive.  For example, the ``/*''    SFS,  as  shown  in  Table  1.
            delimiter  for  C  programming    Although  WAIS and SFS support
            language comments is not  suf-    most of the frequently  occur-
            ficiently   distinctive,   and    ring file types (such as Text,
            will  likely   appear   in   a    C, and  CHeader),  Essence  is
            variety  of types of files.  A    the  only system that supports
            lack      of       distinctive    the file types that contribute
                                              most   to  overall  data  size


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA        9





       Essence: A Resource Discovery System ...         Hardy & Schwartz

                                         _______________________________________________________________________________________________________
       (such  as  Binary,  Tar,   and   |     File    |          File Type         |       File Types     |   Frequency by  |      Average     |
       Archive).  Occurrence frequen-   |     Type    |         Description        | _____Supported_By____|__Number_of_Files|_____File_Size____|
       cies will be used in our meas-   |             |                            |  Essence|  WAIS|  SFS|    NFS  |  AFTP |   NFS  |   AFTP  |
       urements, later in this paper.   |__________________________________________|________________|________________________________|__________
       Note that  Table  1  does  not   |__rchive________Library_archives__________|_____x__________|___x______0.36_____0.12|__626.31|____47.52|
       list  specialized  file  types   | Binary      |  Binary Executable         |     x   |      |   x |    5.06 |   0.02|  145.74|   112.00|
       supported by WAIS or SFS  that   |__________________________________________|________________|________________________________|__________
       are  not supported by Essence,   |________________C_source_code___________________x_______x__|___x______1.27____19.33|____3.87|____28.36|
       because  those  types  do  not   | CHeader     |  C header files            |     x   |   x  |   x |   14.73 |  22.42|    4.29|     2.40|
       occur   in   common   NFS  and   |__________________________________________|________________|________________________________|__________
       anonymous  FTP  file   systems   |__ommand________UNIX_shell_scripts________|_____x_______x__|__________1.78_____3.06|____2.75|_____1.55|
       (and hence we have no measure-   | Compressed  |  Compressed file           |     x   |      |     |    0.69 |  11.81|  114.98|    73.29|
       ments  for  them).    Examples   |__________________________________________|________________|________________________________|__________
       include  MedLine  and New York   |__irectory______Directory_______________________x__________|___x______4.87_____5.05|____0.81|_____0.50|
       Times formats.  There  are  12   | Dvi         |  Device-indep. TeX output  |     x   |   x  |     |    0.03 |   0.87|   33.32|    59.18|
       such   formats  understood  by   |__________________________________________|________________|________________________________|__________
       WAIS, and 4 understood by SFS.   |__ail________|__Electronic_mail_________________x_______x__|___x______0.02_____0.17|____1.79|____35.30|
       Also,   as  indicated  in  the   | Makefile    |  UNIX makefiles            |     x   |   x  |   x |    0.26 |   3.87|    0.85|     3.04|
       table,  SFS  indexes   Unknown   |__________________________________________|________________|________________________________|__________
       file  types.   It  does  so by   |__anPage________UNIX_manual_pages_______________x_______x__|___x_____13.78_____0.70|____6.76|___295.92|
       including  the  standard  UNIX   | News        |  USENET news articles      |     x   |   x  |   x |    0.01 |   0.04|   21.96|     1.75|
       attributes  in the index, such   |__________________________________________|________________|________________________________|__________
       as   owner,   directory,   and   |__bject______|__Relocatable_object_file_________x__________|___x______0.00_____1.12|____0.00|____28.11|
       group.                           | Patch       |  File difference listing   |     x   |   x  |     |    0.02 |   0.00|    1.88|     0.00|
                                        |__________________________________________|________________|________________________________|__________
          Table 2  briefly  describes   |__erl________|__Perl_script_____________________x_______x__|__________0.00_____0.02|____0.00|_____3.62|
       the  techniques  used  by  the   | PostScript  |  PostScript images         |     x   |   x  |     |    1.42 |   3.31|   64.64|   194.45|
       Essence  summarizers  for  the   |__________________________________________|________________|________________________________|__________
       supported  file  types,  other   |__CS____________RCS_version_control_files_______x_______x__|__________0.00_____1.41|____0.00|_____8.41|
       than nested types  (the  tech-   | README      |  High-quality information  |     x   |   x  |   x |    0.38 |   1.32|    1.95|     2.88|
       niques  for which were already   |__________________________________________|________________|________________________________|__________
       discussed, in the "Nested File   |__CCS________|__SCCS_version_control_files|_____x_______x__|__________0.00_____0.00|____0.00|_____0.00|
       Structure"   section).    Many   | ShellArchive|  Bourne shell archive      |     x   |      |     |    0.00 |   0.10|    0.00|   486.75|
       other  potential   summarizers   |__________________________________________|________________|________________________________|__________
       are  possible.   For  example,   |__ar____________Tar_archive_____________________x__________|__________0.00_____0.81|____0.00|__1734.21|
       writing summarizers for  other   | Tex         |  TeX source docs           |     x   |   x  |   x |    0.67 |   0.23|   17.79|    34.17|
       types  of source code (such as   |__________________________________________|________________|________________________________|__________
       Lisp or Pascal)  would  be  an   |__ext________|__Unstructured_ASCII_text_________x_______x__|___x_____21.64____19.73|____7.87|____31.11|
       easy    extension    of    the   | Troff       |  Troff source docs         |     x   |   x  |     |    0.03 |   0.25|    9.21|     9.08|
       prototype's source  code  sum-   |__________________________________________|________________|________________________________|__________
       marizers.    However,  writing   |__nknown________Unknown_file_type__________________________|___x_____32.96_____4.26|___44.02|____16.10|
       summarizers for audio or image
       formatsuwould be difficult.2
                                         Table 1: Supported Common File Types and their Frequency and Average Size in Measured File Systems.
                                            The   following    sections
       D'l 1.5i 0'                       describe  some  of  the  tech-
         2One possibility would be to    niques used in various summar-
       sample  a  bitmap file down to    izers,    representative    of
       an icon.  While this does  not    Essence's    supported    file
       easily  support  indexing,  it    types.
       could be used to support quick
       browsing  before retrieving an
       entire  image  across  a  slow
       network.


       10       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA





            Hardy & Schwartz         Essence: A Resource Discovery System ...

                                              ____________________________________________
            Directory                        | File Type |     Summarizer Description    |
               Obtaining a listing of  the   |____________________________________________
            files  in  a  directory  is an   |__rchive______Extract_symbol_table__________
            obvious method for a directory   | Binary    |  Extract  meaningful  strings,|
            summarizer.   However, Essence   |           |  and manual page summary      |
            strives to  obtain  a  higher-   |____________________________________________
            level   understanding   of   a   | C         |  Extract   procedure    names,|
            directory's contents.   There-   |           |  #include'd   filenames,   and|
            fore,  the  prototype attempts   |______________comments______________________
            to extract copyright  informa-   | CHeader   |  Extract   procedure    names,|
            tion  from  files, in addition   |           |  included  filenames, and com-|
            to  the   directory   listing.   |           |  ments                        |
            Copyright   information  typi-   |____________________________________________
            cally contains project, appli-   |__ommand______Extract_comments______________
            cation, or author names.  Key-   | Directory |  Extract  directory  listings,|
            word information  from  README   |           |  copyright   information,  and|
            files  is also included in the   |           |  README files                 |
            directory  summarizer,   since   |____________________________________________
            these   files   contain  high-   |__vi__________Convert_to_ASCII_text________|
            quality information about  the   | Mail      |  Extract select header fields |
            directory's contents.            |____________________________________________
            Binary                           |__akefile__|__Extract_comments______________
               An  obvious  method  for  a   | ManPage   |  Extract author, title,  etc.,|
            binary    summarizer   is   to   |           |  based on ``-man'' macros     |
            extract ASCII strings from the   |____________________________________________
            binary  file,  using  the UNIX   |__ews______|__Extract_select_header_fields__
            strings   command.    However,   | Object    |  Extract symbol table         |
            Essence      filters     these   |____________________________________________
            extracted ASCII strings  using   |__atch________Extract_filenames____________|
            heuristics   that   only  keep   | Perl      |  Extract procedure  names  and|
            strings   that   convey    the   |           |  comments                     |
            binary's   purpose,   such  as   |____________________________________________
            usage, version,  or  copyright   |__ostScript|__Convert_to_ASCII_text________|
            information.    Essence   also   | RCS       |  Extract RCS-supplied summary |
            uses   cross   references   to   |____________________________________________
            obtain   high-quality  summary   |__EADME____|__Use_entire_file______________|
            information from  binary  exe-   | SCCS      |  Extract SCCS-supplied summary|
            cutables.   For  example,  the   |____________________________________________
            binary  summarizer  looks  for   |__ex__________Convert_to_ASCII_text________|
            associated  manual  pages  for   | Text      |  Extract first 100 lines      |
            the given  binary  executable,   |____________________________________________
            and  generates  keywords using   | Troff     |  Extract author, title,  etc.,|
            the manual page summarizer  on   |           |  based  on  ``-man'', ``-ms'',|
            it.                              |           |  ``-me''  macro  packages,  or|
            D'l 720u 0'                      |           |  extract  section  headers and|
                                             |______________topic_sentences.______________


                                              Table22: Essence Summarizer Techniques
                                              Formatted Text
                                                 Although   formatted   text
                                              (such  as  TeX, Troff, or Word
                                              Perfect) has  structured  syn-
                                              tax,  effectively  summarizing


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA       11





       Essence: A Resource Discovery System ...         Hardy & Schwartz


       these   files   is   difficult    summarizer   uses  the  entire
       unless semantic information is    file to generate keywords.
       also  available  [Knuth  1984,
       Lamport  1986,  USENIX  1986].       The  Dvi,  PostScript,  and
       For example, plain Troff files    Tex  summarizers  extract key-
       or  Troff  files using the ``-    words from all  of  the  ASCII
       me'' macros are  difficult  to    text   extracted   from  these
       exploit   semantically,  since    files.  Essence  assumes  that
       their  syntax  is   associated    these  file types contain gen-
       with formatting commands (such    erally useful information, and
       as font size or line spacing),    hence   generates   full  text
       rather  than  more  conceptual    indexes for them.
       commands  (such  commands   to    Source and Object Code
       indicate  an  author's name or       Both source and object code
       paper  title).   Troff   files    are   highly  structured,  and
       using  the ``-ms'' or ``-man''    contain    easily    exploited
       macros are much easier to sum-    semantic  information.   The C
       marize,   since  they  contain    summarizer extracts  procedure
       conceptual commands  (such  as    names,  header  filenames, and
       delimiting     an    abstract,    comments from a C source  code
       author, and title).               file.   Similarly,  the object
                                         summarizer extracts the symbol
          Essence supports a  sophis-    table from an object file.
       ticated  summarizer  for Troff    WAIS Interface
       and the ``-me'', ``-ms'',  and       Essence exports its indexes
       ``-man''  Troff  macros.   The    through   WAIS's   search  and
       TeX summarizer  only  extracts    retrieval interface,  allowing
       ASCII   text  from  TeX  files    users  to  use  tools  such as
       using  detex,  but  exploiting    waissearch and the X  Windows-
       TeX   semantics   would  be  a    based graphical user interface
       trivial   extension   of   the    xwais.  In order  to  generate
       methods used in our Troff sum-    WAIS-compatible       indexes,
       marizer.                          Essence uses  WAIS's  indexing
       Simple Text                       software  to index the Essence
          Simple text is difficult to    summary files.  This mechanism
       summarize    because   it   is    generates    full-text    WAIS
       unstructured.  Essence assumes    indexes from the Essence  sum-
       that   the   highest   quality    mary files.
       information in  most  unstruc-       We modified the WAIS index-
       tured  text  files is near the    ing  mechanism  to  understand
       beginning of the file,  as  is    the format of the Essence sum-
       common with paper abstracts or    mary  files,  so  that it gen-
       tables  of  contents.   There-    erates meaningful  WAIS  head-
       fore, the text file summarizer    lines.   These  headlines pro-
       extracts  keywords  from   the    vide  users   with   a   short
       first one hundred lines of the    description  of a single file,
       text  file.   However,  README    usually  a   filename.    With
       files  typically  contain cru-    Essence, headlines represent a
       cial,   concise    information    file's  core   filename,   its
       about a distribution or appli-    actual  filename, and its file
       cation.   Using  a   full-text    type.
       index of README files provides
       high-quality keywords  without       To support additional  file
       occupying   too   much  space.    types, WAIS must be recompiled
       Therefore,     the      README    with   new   procedures   that


       12       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA





            Hardy & Schwartz         Essence: A Resource Discovery System ...

                                              interactive user interface  to
            understand  these  file types.    Netfind.   The fourth match is
            With Essence,  one  need  only    the README file found  in  the
            write  a  new  summarizer, add    Netfind   distribution  direc-
            its name  to  a  configuration    tory; the fifth match  is  the
            file,  and  add new heuristics    same  file,  but  found in the
            for identifying the file type;    compressed  tar   distribution
            no recompilation is necessary.    netfind.3.10.tar.Z.  The sixth
            In this sense,  Essence  modu-    match is the UNIX manual  page
            larizes  the typed-file index-    for  Netfind.   The  remaining
            ing extensions that  WAIS  can    matches are PostScript  papers
            use,  because  it  removes the    in which Netfind is discussed.
            keyword   extraction   process
            from   WAIS   and   places  it       In WAIS, a  user  retrieves
            instead in  Essence.   Essence    files  by selecting a matching
            is  better  suited  to  incor-    headline.   With  Essence,  if
            porating new file  types,  and    the headline represents a file
            can   be  quickly  adapted  to    hidden within  a  nested  file
            become a comprehensive  index-    (such as the first headline in
            ing system.                       Figure 2), the summary file is
                                              retrieved, instead of retriev-
               Figure 2 shows  an  example    ing the  hidden  file  itself.
            search  of  an index generated    If  the  headline represents a
            by     Essence     of      the    plain file (such as the fourth
            ftp.cs.colorado.edu  anonymous    headline  in  Figure  2),  the
            FTP file system.  It shows  an    file is retrieved.  This func-
            ordered  list of the ten files    tionality  requires allocating
            that best  match  the  keyword    storage for both the  required
            netfind.3 The  headlines  have    summary  files  and the index.
            up  to three fields represent-    However, it  allows  users  to
            ing  the  matching  file:  the    browse   through  remote  file
            core  filename,  the  filename    systems  by   retrieving   and
            (if different  from  the  core    viewing  small  summary  files
            filename),'and the file type.     without  having  to   retrieve
                                              complete  files.  This is use-
            Figure 2: Example WAIS Search UsinguEssence-basednIndex   decide
               Consider the  effectiveness    whether   to   transfer  large
            of  the example search in Fig-    files across a slow network.
            ure 2.  The best  match  is  a     Evaluation and Measurements
            PostScript      paper     that
            discusses a  number  of  tech-       In this section, we present
            niques  for distributed infor-    measurements of indexing speed
            mation systems, with  particu-    and  space   efficiency,   for
            lar   emphasis  on  techniques    Essence,  WAIS,  and  SFS.  We
            demonstrated by  Netfind;  the    also  discuss  the  usefulness
            second match is the same file,    and   overhead   of   indexing
            but found  in  the  compressed    nested  files.   Finally,   we
            tar distribution ALL.PS.tar.Z.    discuss  the  difficulties  in
            The  third  match  is  the   C    evaluating keyword quality.
            source     code     for    the       Before presenting  measure-
                                              ments  of the various systems,
            D'l 1.5i 0'                       we note that it  is  difficult
              3Netfind is an Internet user    to  interpret  time  and space
            directory  service [Schwartz &    efficiency measurements of the
            Tsirigotis 1991].


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA       13





       Essence: A Resource Discovery System ...         Hardy & Schwartz


       systems  being  compared,  for    costs   shown   for    Essence
       two reasons.  First,  indexing    include  the  time  and  space
       speed   and   compression  are    needed to indices -  not  just
       highly dependent  on  indexing    the  summaries  that  are pro-
       techniques.   For  example, an    duced as an intermediate step.
       indexer that skips most of the    As  indicated  in  Table 1 and
       data (such as our Text summar-    with a '-' in  Table  3,  WAIS
       izer) will achieve much higher    and  SFS  cannot  index all of
       indexing  speeds  and compres-    the file  types  that  Essence
       sion  factors  than  one  that    can.    Table   3  shows  that
       uses  all  of these data (such    because there is a high amount
       as the Text indexers  used  by    of  overhead  associated  with
       SFS  and WAIS).  In this case,    interpreting the semantics  of
       the  salient   issue   becomes    a  file  type, Essence indexes
       recall/precision effectiveness    slower than WAIS for some file
       of   the   generated   indices    types.   Essence  indexing  is
       (which  is  difficult to quan-    faster  than  WAIS  for   file
       tify).  For example, a  small,    types  that  have a low amount
       quickly  generated index would    of   semantic   interpretation
       not be a  reasonable  tradeoff    overhead.
       if  one  could  not  use  this    D'l_720u_0'_________________________________________________________
       index to locate desired  data.                    Indexing Rate     Compression Factor     Semantic
       Second, aggregate measurements     File Type        (KB/min)            vs. Index        Exploitation
       (as  given  in  Table  4)  are                 ________________________________________
       affected  by  the distribution    ______________Essence____WAIS_______Essence_____WAIS_____Overhead___
       of different file types in the     Archive      3289.18      -         10.89       -         low
       sample file systems.  Ideally,    ____________________________________________________________________
       we would  have  measured  each    __inary________563.40______-_________21.15_______-__________igh_____
       indexing  system  against  the     C             357.84    593.15       2.46      1.45       high
       same file system data.  We did    ____________________________________________________________________
       this for WAIS and Essence, but    __Header_______164.87____342.93_______1.33______0.65________igh_____
       the SFS code was not available     Command       168.20    277.30       1.23      0.43       high
       at  the  time  we  made  these    ____________________________________________________________________
       measurements.    Instead,   we    __vi___________278.71___2160.00_______0.79______1.76________igh_____
       attempted   to  interpret  the     Mail         3718.12   1071.89       9.44      0.74       low
       measurements given in [Gifford    ____________________________________________________________________
       et al. 1991].  Notwithstanding    __akefile______421.05____648.65_______2.33______0.86________igh_____
       these difficulties  in  inter-     ManPage       165.67    661.59       3.34      1.18       high
       preting  the  measurements, we    ____________________________________________________________________
       feel  it  is   worthwhile   to    __ews_________1913.29____329.24______20.82______0.63________ow______
       present quantitative comparis-     Object       2588.75      -         15.93       -         low
       ons of these systems.             ____________________________________________________________________
                                         __atch________7218.00____993.30______80.20______2.00________ow______
          Table 3 presents the  space     Perl          282.50    713.68       2.05      0.88       high
       and   time   measurements  for    ____________________________________________________________________
       Essence  and  WAIS,  based  on    __ostScript___1151.19____765.60_______4.56______1.67________ow______
       file  type.   We  do  not show     RCS          1293.16    614.25       5.91      1.27       low
       measurements for  nested  file    ____________________________________________________________________
       types  here.   Those  measure-    __EADME________390.00____400.83_______0.68______0.65________igh_____
       ments are discussed  in  Table     SCCS          315.79   1500.00       3.12      3.52       high
       6.   Nor  do  we show measure-    ____________________________________________________________________
       ments for SFS in  this  table,    __ex___________598.93____385.52_______2.09______0.92________ow______
       because    transducer-specific     Text         4699.59   1346.67       8.13      1.37       low
       information was not available.    ____________________________________________________________________
       Also,  note  that the indexing

       14       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA






            Hardy & Schwartz         Essence: A Resource Discovery System ...


            __roff_________703.01___2036.92____un_7.04osyste1.97/280__server____
           |           |         |         |  running  Su|OS  4.|.1,  with a   |
           |Table 3: Es|ence and |AIS Time |ndlSpaceSMeas|rement|eBasedmonsFile|Type.
           |           |         |         |  urements  w|re  pe|formed on a   |
           |___________|_________|_________|__Microvax-3_|unning|__NIX__ver-___|
           |           |  |      | Indexing|Rateon  4.|BS|Compre|sion Factor  ||
           |  File Syst|m | _____|___(KB/mi|)__991].__|__|is___v|.hIndex__is__||
           |           |  |  Esse|ce|   WAI| |appSFSim|te|ssence|-tWAIS|  SFS ||
           |___________|__|______|__|______|__fast__s__he|Sun__/|80.__________||
           |__FS_______|__|__1489|58|__891.|0|__712.00|4_|14.40t|at1.27sen6.82||
           | Anonymous |TP|  1826|17|  897.|1|ca712.00|x |d4.93 |as1.44|th6.82||
           |___________|__|______|__|______|__WAIS.__Taki|g__nto|account__he__||
           |__verage___|__|__1657|88|__894.|1|sl712.00mac|i9.67o|__1.35h__6.82||
           |           |         |         |  was measure|, SFS |appears  to   |
           |           |         |         |  index   dat|  some|hat  faster   |
           |Table 4: We|ghted Tim| and Spac| AveragessBas|ddonsF|le Type Freque|cies
           |   Table 4 |presents |weighted |             |      |              |
           |averages of|the space|and time |     Essence |obtain|  about   a   |
           |measurement| in Table|3, based |  10:1  index|compre|sion factor   |
           |on  the  fi|e type fr|quencies |  on the file|types |hat it sup-   |
           |and  averag|  file  s|zes  (as |  ports, comp|red to|WAIS (1:1),   |
            measured  in  Table  |).   The    SFS (7:1), |nd archie  (765:1)
            weighted  averages  w|re  com-    [Emtage   &|  Deutsch   1992].
            puted using the formu|a:          These  meas|rements  are   not
                      n                       perfect,  because detailed SFS
                      R (fiai)vi,             measurements were  not  avail-
            where  f i=0s  the   frequency    able.
            associatid  with  file type i,       Table 5 shows  the  percen-
            a  is the  average  file  size    tage  of  data in the measured
            aisociated  with  file type i,    file  systems  that   Essence,
            v  is the the indexing rate or    WAIS,  and  SFS  were success-
            tie  compression  factor value    fully able  to  interpret  and
            from Table 3  associated  with    index.   The  NFS  file system
            file  type  i,  and  n  is the    contained  many  custom   file
            number of file types supported    formats that the indexing sys-
            by  the  system.  f a  is used    tems were unable to interpret.
            to normalize the meisirements,    However,   the  anonymous  FTP
            to  reflect  only the system's    file  system  contained   many
            supported file types.  In par-    more   common   file  formats.
            ticular,  only non-nested file    Even though Essence only  sup-
            types  are  included  in   the    ports   a   relatively   small
            aggregate   measurements   for    number of  common  file  types
            WAIS and SFS (since those sys-    (21),  it can index 75% of the
            tems  do  not  support  nested    data found in an average  file
            files),   while   all    types    system - far greater than WAIS
            (including  nested  files) are    (33%) or SFS (18%).
            included in the Essence  meas-
            urements.    We   discuss  the       We found that seventy-eight
            ``unraveling'' costs of  deal-    percent  of  the  files in our
            ing  with  nested  file struc-    anonymous   FTP   had   nested
            tures in Table 6.                 structure.  These measurements
                                              indicate    that    supporting
               The Essence and WAIS  meas-    nested   file   structures  is
            urements  were  performed on a    essential   for   such    file


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA       15






       Essence: A Resource Discovery System ...         Hardy & Schwartz

                                         Makefile, and README  are  all
       systems.   In  contrast,  only    included   in  the  Summarized
       one percent of  the  files  in    Data.  The Summary Output  row
       the NFS file system had nested    concerns the resulting summary
       structure.   In   the   future    files.  The resulting index of
       nested   file  structures  may    the   summary  files  consumed
       become less  common,  as  they    12.94 megabytes.
       mostly  represent inadequacies       Note that this  compression
       of current  file  systems  and    ratio   (60.22/12.94)   under-
       remote  access protocols.  For    states the actual compression,
       example, tar files are popular    because the indexed data actu-
       in  FTP  file  systems because    ally consumed 262.03  MB.   In
       they   make   it   easier   to    particular,  indexing  systems
       retrieve  an  entire directory    (like WAIS) that do  not  sup-
       tree, and FTP does not provide    port  nested  structure  would
       a  recursive retrieval mechan-    have   to   leave   the   data
       ism.720u 0'                       uncompressed.  Hence, we actu-
       __________________________________ally__chieve a  twofold  space
      |______________|__Essence|__WAIS____eSFStion.  WAIS would need to
      | Anonymous FTP|   98.51 |  48.47| k27.56|he  uncompressed   data
      |______________|____________________round,  and  then  would gen-
      |_NFS__________|___50.70____17.88|__r8.11an index whose size  was
      | Average      |   74.61 |  33.18| c17.84|ble to the uncompressed
      |______________|____________________ata.__ Essence  generates   a
                                         smaller  index,  and can func-
       Table 5: Percentage of InterpretableoDataith  compressed   data.
       D'l 720u 0'                       Putting  the numbers together,
                                         WAIS  would  require  approxi-
          Table  6  shows  how   much    mately 264 MB of space for the
       overhead  the prototype incurs    uncompressed data  plus  index
       when indexing nested files  in    (basically,  twice the size of
       the   measured  anonymous  FTP    the  Summarized  data),  while
       file system.  In  this  table,    Essence  requries  only  73 MB
       the Original Data row concerns    total - a  72%  space  savings
       the data which reside  in  the    over WAIS.
       anonymous   FTP  file  system.    Analysis of Keyword Quality
       The Processed  Data  row  con-       Qualitative   analysis   of
       cerns  the  data  that Essence    information  retrieval systems
       processes while  indexing  the    is                  difficult.
       Original   Data.   These  data    Recall/precision  measurements
       include all  of  the  original    are difficult to obtain, since
       files  and  each file within a    they   rely   on   hand-chosen
       nested  file  structure.   For    reference   sets   [Salton   &
       example,    given   the   file    McGill 1983], and hence do not
       foo.tar.Z from the example  in    scale well to measuring  large
       the   previous   Nested   File    information collections.  More
       Structure section,  foo.tar.Z,    effective  measurements  might
       foo.tar,     foo.c,     foo.h,    be  obtained by evaluating the
       Makefile, and README  are  all    effectiveness of a system from
       included   in   the  Processed    experience with an active user
       Data.  The Summarized Data row    community.  We have  made  the
       concerns  the  data  on  which    Essence  prototype  is  publi-
       summarizers  are   run.    For    cally available to allow users
       example,     foo.c,     foo.h,    to  make  their own subjective


       16       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA






            Hardy & Schwartz         Essence: A Resource Discovery System ...

                                              through WAIS,  using  Essence-
            judgements.                       based  indexes.  Using Essence
            __________________________________to__ index   public   archives
           |                |   Total  |   Totall|ws  remote users to search
           |                |   Number |   Sizenf|rmation based  on  concep-
           |________________|__of_Files|__(in_MB)l  descriptions and to view
           | Original Data  |    1213  |   60.22m|aries  before   retrieving
           |________________|__________|______files.     This   would   help
           |_Processed_Data______6409__|__262.03crease    network    traffic
           | Summarized Data|    5334  |  132.36u|ed  by retrieving unwanted
           |________________|__________|______files.
           |_Summary_Output______5334__|___15.87cord-Level Indexing Support
                                                 WAIS supports indexing  and
                                              retrieving   information  with
            Table26: Nested File Structure Overheadd-level       granularity
                  Future Directions           (e.g.,  allowing  a  file con-
                                              taining many  electronic  mail
            On-the-Fly Nested File Summar-    messages  to  be  treated as a
               izers                          sequence  of  mail   records).
               The    Essence    prototype    Essence only supports indexing
            relies  heavily  on  the  file    and   retrieving   information
            system  to  implement   nested    with  file-level  granularity.
            file structure interpretation.    A future improvement would  be
            This  implementation  degrades    to  modify  Essence to support
            performance    when   indexing    record structured files.
            files with nested file  struc-    File Tree Pruning
            tures  (as  shown in Table 6),       The  design  of   Essence's
            because  it  causes  a   large    file    classification   stage
            amount  of  disk  I/O.  An in-    includes the ability to  iden-
            memory  implementation   would    tify  promising files to index
            significantly  improve perfor-    within a file system, in addi-
            mance, by drastically  cutting    tion  to  type information for
            file   system   I/O.   We  are    each file.  Our current proto-
            currently considering such  an    type does not select file sys-
            implementation,  based  on the    tem  subsets   -   it   simply
            GNU  ``tar''  program,   which    indexes  whatever  file  trees
            supports  an  option to output    are   specified.    A   future
            extracted files to stdout.        improvement  would  be  to add
            Summarizers                       selection criteria to the pro-
               The   prototype   currently    totype  (e.g.,  prunning files
            supports  twenty-one summariz-    from  consideration  based  on
            ers.  Expanding Essence's sum-    their  location  in  the  name
            marizer  suite to support more    tree, names/types, or  sharing
            file   types   would   further    history).   This would further
            increase its effectiveness.       refine the quality of indexes,
            Anonymous FTP Indexing            and  reduce the space required
               Currently,   the    Essence    for indexing  an  entire  file
            index  for  the  anonymous FTP    system.
            site at ftp.cs.colorado.edu is               Summary
            available  through  WAIS using
            the   aftp-cs-colorado-edu.src       The increasing abundance of
            WAIS   source.    However,  we    inexpensive     local    disks
            would  like   to   make   more    creates   resource   discovery
            anonymous  FTP sites available    problems   even   in   locally


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA       17






       Essence: A Resource Discovery System ...         Hardy & Schwartz


       distributed file systems.  The    of useful  settings,  such  as
       Internet   resource  discovery    anonymous FTP archives.
       tools that have achieved popu-       Essence can index more data
       lar  acceptance  over the past    types,  index data faster, and
       two years are not well  suited    generate smaller indexes  than
       to  general  purpose file sys-    WAIS  or the MIT Semantic File
       tems, because of the irregular    System.   Our  prototype  gen-
       organization,   the  range  of    erates         WAIS-compatible
       different degrees of  informa-    indexes, allowing  WAIS  users
       tion  structure,  and the gen-    to   take   advantage  of  the
       erally low  sharing  value  of    Essence indexing methods.
       information  in such file sys-    Prototype Availability
       tems.                                The   Essence    prototype,
          In this paper we  presented    including  its source code and
       the Essence system, which gen-    WAIS modifications, is  publi-
       erates file summaries based on    cally  available  by anonymous
       an understanding of the seman-    FTP  from  ftp.cs.colorado.edu
       tics of the various  types  of    in   /pub/cs/distribs/essence.
       files  it  indexes.   The sum-    The prototype  is  written  in
       maries  are  useful  both  for    the  C  and  Perl  programming
       producing  searchable indexes,    languages [Kernighan & Ritchie
       and  for  allowing  users   to    1988, Wall & Schwartz 1991].
       retrieve and browse small sum-
       maries before deciding whether           Acknowledgements
       to   retrieve   a  large  file       This material is based upon
       across a slow network.  Simple    work  supported in part by the
       techniques   to  exploit  file    National  Science   Foundation
       semantics  yield  compact  yet    under grant NCR-9105372, and a
       representative   indexes   for    grant from  Sun  Microsystems'
       both textual and binary files.    Collaborative   Research  Pro-
       The  indexes generated in this    gram.
       fashion are more  content-rich
       than  archie's index, yet more       We thank Sean Coleman,  Jim
       space  efficient   than   WAIS    O'Toole,  David  Wood, and the
       indexes.                          USENIX program  committee  for
                                         their helpful comments on this
          Essence     provides     an    paper.
       integrated  system for classi-              References
       fying files, defining  summar-
       izer    mechanisms,   applying    [Berners-Lee et al. 1992]
       appropriate   summarizers   to      T. Berners-Lee, R. Cailliau,
       each  file,  and  traversing a      J.  Groff and B. Pollermann,
       portion of a  file  system  to      World-Wide Web: The Informa-
       produce  an  index of its con-      tion   Universe,  Electronic
       tents.   Importantly,  Essence      Networking: Research, Appli-
       understands nested file struc-      cations  and  Policy,  1(2),
       tures  (such   as   uuencoded,      Meckler Publications,  West-
       compressed,   ``tar''  files),      port, CT, Spring 1992.
       and recursively unravels  such    [Budd & Levin 1982]
       files  to  generate  summaries      T. A. Budd and G. M.  Levin,
       for  them.   The  ability   to      A  UNIX  Bibliographic Data-
       index  nested  files  and many      base  Facility,  Tech.  Rep.
       other   file   types    allows      82-1, Department of Computer
       Essence to be used in a number      Science,    University    of


       18       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA






            Hardy & Schwartz         Essence: A Resource Discovery System ...


              Arizona, Tucson, AZ, 1982.        J. S. Quarterman, The Design
            [Emtage & Deutsch 1992]             and  Implementation  of  the
              A. Emtage  and  P.  Deutsch,      4.3BSD UNIX  Operating  Sys-
              Archie   -   An   Electronic      tem,  Addison-Wesley,  Read-
              Directory  Service  for  the      ing, MA, 1989.
              Internet,    Proc.    USENIX    [McCahill 1992]
              Winter  Conf.,  pp.  93-110,      M.  McCahill,  The  Internet
              January 1992.                     Gopher: A Distributed Server
                                                Information System,  ConneX-
            [Furlani 1991]                      ions  - The Interoperability
              J. L. Furlani, Modules: Pro-      Report,  6(7),  pp.   10-14,
              viding   a   Flexible   User      Interop, Inc., July 1992.
              Environment,  Proc.   USENIX
              Large  Installation  Systems    [Muntz & Honeyman 1992]
              Administration   V    Conf.,      D. Muntz  and  P.  Honeyman,
              October 1991.                     Multi-Level  Caching in Dis-
            [Gifford et al. 1991]               tributed File Systems - or -
              D. K. Gifford, P.  Jouvelot,      Your Cache Ain't Nuthin' But
              M.  A.  Sheldon,  and  J. W.      Trash, Proc. of  the  USENIX
              O'Toole, Jr., Semantic  File      Winter  Conf.,  pp. 305-313,
              Systems,   Proc.   13th  ACM      San Francisco,  CA,  January
              Symp.    Operating     Syst.      1992.
              Prin.,  pp.  16-25,  October    [NeXT 1991]
              1991.                             NeXT  Computer,  Inc.,  NeXT
                                                User's  Reference, NeXT Com-
            [Kahle & Medlar 1991]               puter, Inc.,  Redwood  City,
              B. Kahle and A.  Medlar,  An      CA, 1991.
              Information  System for Cor-
              porate  Users:   Wide   Area    [Neuman 1992]
              Information Servers, ConneX-      B. C.  Neuman,  Prospero:  A
              ions - The  Interoperability      Tool for Organizing Internet
              Report,   5(11),   pp.  2-9,      Resources,  Electronic  Net-
              Interop,   Inc.,    November      working:  Research, Applica-
              1991.                             tions, and Policy, 2(1), pp.
            [Kernighan & Ritchie 1988]          30-37, Meckler Publications,
              B. W. Kernighan  and  D.  M.      Westport, CT, Spring 1992.
              Ritchie,  The  C Programming    [Ousterhout et al. 1985]
              Language, 2nd Edition, Pren-      J. Ousterhout, H. Da  Costa,
              tice Hall, Englewood Cliffs,      D.  Harrison,  J.  Kunze, M.
              NJ, 1988.                         Kupfer, and J.  Thompson,  A
                                                Trace-Driven Analysis of the
            [Knuth 1984]                        UNIX 4.2  BSD  File  System,
              D. E.  Knuth,  The  TeXbook,      Proc. 10th ACM Symp. Operat-
              Addison-Wesley, Reading, MA,      ing Syst. Prin., pp.  15-24,
              1984.                             December 1985.
            [Lamport 1986]
              L. Lamport, LaTeX:  A  Docu-    [Salton & McGill 1983]
              ment    Prepartion   System,      G. Salton and M. J.  McGill,
              Addison-Wesley, Reading, MA,      Introduction    to    Modern
              1986.                             Information       Retrieval,
                                                McGraw-Hill,  New  York, NY,
            [Leffler et al. 1989]               1983.
              S.   J.   Leffler,   M.   K.    [Schwartz et al. 1992a]
              McKusick,  M. J. Karels, and      M. F. Schwartz, D. J. Ewing,


            1993 Winter USENIX - January 25-29, 1993 - San Diego, CA       19






       Essence: A Resource Discovery System ...         Hardy & Schwartz


         and  R.  S. Hall, A Measure-    develops    network    support
         ment Study of Internet  File    software and  Internet  utili-
         Transfer Traffic, Tech. Rep.    ties.  Hardy can be reached by
         CU-CS-571-92, University  of    US Mail at the  Computer  Sci-
         Colorado, Boulder, CO, Janu-    ence Department, University of
         ary 1992.                       Colorado, Boulder,  CO  80309-
       [Schwartz et al. 1992b]           0430; or by electronic mail at
         M. F. Schwartz,  A.  Emtage,    hardy@cs.colorado.edu.
         B.  Kahle, and B. C. Neuman,       Michael     F.     Schwartz
         A  Comparison  of   Internet    received  his  PhD in Computer
         Resource           Discovery    Science from the University of
         Approaches,  Computing  Sys-    Washington.   He  is currently
         tems,  5(4),  University  of    an Assistant Professor of Com-
         California Press,  Berkeley,    puter  Science  at the Univer-
         CA, Fall 1992.                  sity  of  Colorado,   Boulder.
                                         His research focuses on issues
       [Schwartz & Tsirigotis 1991]      raised by  international  net-
         M. F.  Schwartz  and  P.  G.    works and distributed systems,
         Tsirigotis,  Experience with    with   particular   focus   on
         a   Semantically   Cognizant    resource discovery and network
         Internet  White Pages Direc-    measurement.  Schwartz  chairs
         tory Tool, J.  Internetwork-    an   Internet   Research  Task
         ing:  Research  and  Experi-    Force   Research   Group    on
         ence, 2(1), pp. 23-50, March    Resource  Discovery and Direc-
         1991.                           tory Service, and  is  on  the
       [USENIX 1986]                     editorial  boards for IEEE/ACM
         USENIX   Association,   UNIX    Transactions on Networking and
         Supplementary Documents, 4.3    for   Internet  Society  News.
         Berkeley Software  Distribu-    Schwartz can be reached by  US
         tion, November 1986.            Mail  at  the Computer Science
                                         Department,   University    of
       [WAIS 1992]                       Colorado,  Boulder,  CO 80309-
         WAIS server sources,  avail-    0430; or by electronic mail at
         able  by  anonymous FTP from    schwartz@cs.colorado.edu.
         think.com:/wais/wais-
         sources.tar.Z.
       [Wall & Schwartz 1991]
         L. Wall and R. L.  Schwartz,
         Programming  Perl,  O'Reilly
         and Associates, Inc., Sebas-
         topol, CA, 1991.

             Author Information
          Darren R.  Hardy  earned  a
       B.S.  in Computer Science from
       the  University  of  Colorado,
       Boulder, and is currently com-
       pleting an  M.S.  in  Computer
       Science.   He  specializes  in
       network  resource   discovery,
       distributed    systems,    and
       information retrieval.   As  a
       systems  engineer  at XOR Net-
       work  Engineering,  Inc.,   he


       20       1993 Winter USENIX - January 25-29, 1993 - San Diego, CA




