Table of Contents

NAME

index.noun, data.noun, index.verb, data.verb, index.adj, data.adj,
index.adv, data.adv - WordNet database files (default file names)

noun.idx, noun.dat, verb.idx, verb.dat, adj.idx, adj.dat, adv.idx, adv.dat -
WordNet database files (PC)

noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

cousin.tops, cousin.exc - files used by search code to group similar senses

cousin.tps, cousin.exc - files used by search code to group similar senses
(PC)

sentidx.vrb, sents.vrb - files used by search code to display sentences
illustrating the use of some specific verbs

(The remainder of this manual page refers to database files by their default
file names.)

DESCRIPTION

For each syntactic category, two files are needed to represent the contents
of the WordNet database - index. pos and data. pos , where pos is noun ,
verb , adj and adv . The other auxiliary files are used by the WordNet
library's searching functions and are needed to run the various WordNet
browsers.

Each index file is an alphabetized list of all the words found in WordNet in
the corresponding part of speech. On each line, following the word, is a
list of byte offsets (synset_offset s) in the corresponding data file, one
for each synset containing the word. Words in the index file are in lower
case only, regardless of how they were entered in the lexicographer files.
This folds various orthographic representations of the word into one line
enabling database searches to be case insensitive. See wninput(5WN) for a
detailed description of the lexicographer files

A data file for a syntactic category contains information corresponding to
the synsets that were specified in the lexicographer files, with relational
pointers resolved to synset_offset s. Each line corresponds to a synset.
Pointers are followed and hierarchies traversed by moving from one synset to
another via the synset_offset s.

The exception list files, pos .exc , are used to help the morphological
processor find base forms from irregular inflections.

The files cousin.tops and cousin.exc are used by the searching software to
group similar senses of a word together when displayed.

The files sentidx.vrb and sents.vrb contain sentences illustrating the use
of specific senses of some verbs. These files are used by the searching
software in response to a request for verb sentence frames. Generic sentence
frames are displayed when an illustrative sentence is not present.

The various database files are in ASCII formats that are easily read by both
humans and machines. All fields, unless otherwise noted, are separated by
one space character, and all lines are terminated by a newline character.
Fields enclosed in italicized square brackets may not be present.

See wngloss(7WN) for a glossary of WordNet terminology and a discussion of
the database's content and logical organization.

Index File Format

Each index file begins with several lines containing a copyright notice,
version number and license agreement. These lines all begin with two spaces
and the line number so they do not interfere with the binary search
algorithm that is used to look up entries in the index files. All other
lines are in the following format. In the field descriptions, number always
refers to a decimal integer unless otherwise defined.

lemma pos poly_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt
synset_offset [synset_offset...]

lemma
     lower case ASCII text of word or collocation. Collocations are formed
     by joining individual words with an underscore (_ ) character.
pos
     Syntactic category: n for noun files, v for verb files, a for adjective
     files, r for adverb files.

All remaining fields are with respect to senses of lemma in pos .

poly_cnt
     Number of different senses (polysemy) lemma has in WordNet. Note that
     this is the same value as sense_cnt , but is retained for historical
     reasons.
p_cnt
     Number of different types of pointers lemma has in all synsets
     containing it.
ptr_symbol
     A space separated list of p_cnt different types of pointers that lemma
     has in all synsets containing it. See wninput(5WN) for a list of
     pointer_symbol s. If all senses of lemma have no pointers, this field
     is omitted and p_cnt is 0 .
sense_cnt
     Number of synsets that lemma is in. This is the number of senses of the
     word in WordNet. See Sense Numbers below for a discussion of how sense
     numbers are assigned and the order of synset_offset s in the index
     files.
tagsense_cnt
     Number of senses of lemma that are ranked according to their frequency
     of occurrence in semantic concordance texts.
synset_offset
     Byte offset in data.pos file of a synset containing lemma . Each
     synset_offset in the list corresponds to a different sense of lemma in
     WordNet. synset_offset is an 8 digit, zero-filled decimal integer that
     can be used with fseek(3) to read a synset from the data file. When
     passed to read_synset(3WN) along with the syntactic category, a data
     structure containing the parsed synset is returned.

Data File Format

Each data file begins with several lines containing a copyright notice,
version number and license agreement. These lines all begin with two spaces
and the line number. All other lines are in the following format. Integer
fields are of fixed length, and are zero-filled.

synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt
[ptr...] [frames...] | gloss

synset_offset
     Current byte offset in the file represented as an 8 digit decimal
     integer.
lex_filenum
     Two digit decimal integer corresponding to the lexicographer file name
     containing the synset. See lexnames(5WN) for the list of filenames and
     their corresponding numbers.
ss_type
     One character code indicating the synset type:

     n    NOUN
     v    VERB
     a    ADJECTIVE
     s    ADJECTIVE SATELLITE
     r    ADVERB

w_cnt
     Two digit hexadecimal integer indicating the number of words in the
     synset.
word
     ASCII form of a word as entered in the synset by the lexicographer,
     with spaces replaced by underscore characters (_ ). The text of the
     word is case sensitive, in contrast to its form in the corresponding
     index. pos file, that contains only lower-case forms. In data.adj , a
     word is followed by a syntactic marker if one was specified in the
     lexicographer file. A syntactic marker is appended, in parentheses,
     onto word without any intervening spaces. See wninput(5WN) for a list
     of the syntactic markers for adjectives.
lex_id
     One digit hexadecimal integer that, when appended onto lemma , uniquely
     identifies a sense within a lexicographer file. lex_id numbers usually
     start with 0 , and are incremented as additional senses of the word are
     added to the same file, although there is no requirement that the
     numbers be consecutive or begin with 0 . Note that a value of 0 is the
     default, and therefore is not present in lexicographer files.
p_cnt
     Three digit decimal integer indicating the number of pointers from this
     synset to other synsets. If p_cnt is 000 the synset has no pointers.
ptr
     A pointer from this synset to another. ptr is of the form:

          pointer_symbol synset_offset pos source/target

     where synset_offset is the byte offset of the target synset in the data
     file corresponding to pos .

     The source/target field distinguishes lexical and semantic pointers. It
     is a four byte hexadecimal field: the first two bytes indicate the word
     number in the current (source) synset, the last two bytes indicate the
     word number in the target synset. A value of 0000 means that
     pointer_symbol represents a semantic relation between the current
     (source) synset and the target synset indicated by synset_offset .

     A lexical relation between two words in different synsets is
     represented by non-zero values in the source and target word numbers.
     The first and last two bytes of this field indicate the word numbers in
     the source and target synsets, respectively, between which the relation
     holds. Word numbers are assigned to the word fields in a synset, from
     left to right, beginning with 1 .

     See wninput(5WN) for a list of pointer_symbol s, and semantic and
     lexical pointer classifications.
frames
     In data.verb only, a list of numbers corresponding to the generic verb
     sentence frames for word s in the synset. frames is of the form:

     f_cnt + f_num w_num [+ f_num w_num...]

     where f_cnt a two digit decimal integer indicating the number of
     generic frames listed, f_num is a two digit decimal integer frame
     number, and w_num is a two digit hexadecimal number indicating the word
     in the synset that the frame applies to. As with pointers, if this
     number is 00 , f_num applies to all word s in the synset. If non-zero,
     it is applicable only to the word indicated. Word numbers are assigned
     as described for pointers. Each f_num w_num pair is preceded by a + .
     See wninput(5WN) for the text of the generic sentence frames.
gloss
     Each synset contains a gloss. A gloss is represented as a vertical bar
     (| ), followed by a text string that continues until the end of the
     line. The gloss may contain a definition, one or more example
     sentences, or both.

Sense Numbers

Senses in WordNet are generally ordered from most to least frequently used,
with the most common sense numbered 1 . Frequency of use is determined by
the number of times a sense is tagged in the various semantic concordance
texts. Senses that are not semantically tagged follow the ordered senses.
The tagsense_cnt field for each entry in the index.pos files indicates how
many of the senses in the list have been tagged.

The cntlist(5WN) file provided with the database lists the number of times
each sense is tagged in the semantic concordances. The data from cntlist is
used by grind(1WN) to order the senses of each word. When the index .pos
files are generated, the synset_offset s are output in sense number order,
with sense 1 first in the list. Senses with the same number of semantic tags
are assigned unique but consecutive sense numbers. The WordNet OVERVIEW
search displays all senses of the specified word, in all syntactic
categories, and indicates which of the senses are represented in the
semantically tagged texts.

Exception List File Format

Exception lists are alphabetized lists of inflected forms of words and their
base forms. The first field of each line is an inflected form, followed by a
space separated list of one or more base forms of the word. There is one
exception list file for each syntactic category.

Note that the noun and verb exception lists were automatically generated
from a machine-readable dictionary, and contain many words that are not in
WordNet. Also, for many of the inflected forms, base forms could be easily
derived using the standard rules of detachment programmed into Morphy (See
morph(7WN) ). These anomalies are allowed to remain in the exception list
files, as they do no harm.

Format of Sense Grouping Files

In all syntactic categories the synsets containing a search word can be
displayed in sense number order. For nouns and verbs, the RELATIVES search
displays similar senses of a word together.

Two files are used by the grouping algorithm of the WordNet search code.
cousin.tops is a list of pairs of synset_offset s, one pair per line, each
representing a set of top nodes for the noun cousin grouping relation.

Candidates for cousin grouping are checked by hand and those that would
normally be grouped by the grouping heuristics, but are really not similar
in meaning, are listed in cousin.exc . Each line contains at least two space
separated synset_offset s, with the one of lower numerical value coming
first. If additional offsets are present, they are all of greater numerical
value than the first field. Offsets occurring on the same line represent
synsets containing different senses of a word that should not be grouped
together.

See wngroups(7WN) for more information on how noun and verb senses are
grouped.

Verb Example Sentences

For some verb senses, example sentences illustrating the use of the verb
sense can be displayed. Each line of the file sentidx.vrb contains a
sense_key followed by a space and a comma separated list of example sentence
template numbers, in decimal. The file sents.vrb lists all of the example
sentence templates. Each line begins with the template number followed by a
space. The rest of the line is the text of a template example sentence, with
%s used as a placeholder in the text for the verb. Both files are sorted
alphabetically so that the sense_key and template sentence number can be
used as indices, via binsrch(3WN) , into the appropriate file.

When a request for FRAMES is made, the WordNet search code looks for the
sense in sentidx.vrb . If found, the sentence template(s) listed is
retrieved from sents.vrb , and the %s is replaced with the verb. If the
sense is not found, the applicable generic sentence frame(s) listed in
frames is displayed.

NOTES

Information in the data.pos and index.pos files represents all of the word
senses and synsets in the WordNet database. The word , lex_id , and
lex_filenum fields together uniquely identify each word sense in WordNet.
These can be encoded in a sense_key as described in senseidx(5WN) . Each
synset in the database can be uniquely identified by combining the
synset_offset for the synset with a code for the syntactic category (since
it is possible for synsets in different data.pos files to have the same
synset_offset ).

The WordNet system provide both command line and window-based browser
interfaces to the database. Both interfaces utilize a common library of
search and morphology code. The source code for the library and interfaces
is included in the WordNet package. See wnintro(3WN) for an overview of the
WordNet source code.

ENVIRONMENT VARIABLES

WNHOME
     Base directory for WordNet. Unix default is /usr/local/wordnet1.6 , PC
     default is C:\wn16 , Macintosh default is : .
WNSEARCHDIR
     Directory in which the WordNet database has been installed. Unix
     default is WNHOME/dict , PC default is WNHOME\dict , Macintosh default
     is :Database .
WNDBVERSION
     Indicate which format the WordNet database files in WNSEARCHDIR are in.
     The default is 1.6 . Setting WNDBVERION to 1.5 allows the 1.6 library
     code to work with the 1.5 database files.

FILES

All files are in the directory WNSEARCHDIR .

index.pos
     database index files (Unix and Macintosh)
pos.idx
     database index files (PC)
data.pos
     database data files (Unix and Macintosh)
pos.dat
     database data files (PC)
cousin.*
     files used to group similar senses
*.vrb
     files of sentences illustrating the use of verbs
pos.exc
     morphology exception lists

SEE ALSO

grind(1WN) , wn(1WN) , wnb(1WN) , wnintro(3WN) , binsrch(3WN) , cntlist(5WN)
, glossidx(5WN) , lexnames(5WN) , senseidx(5WN) , wninput(5WN) , morphy(7WN)
, semcor(7WN) , wngloss(7WN) , wngroups(7WN) , wnpkgs(7WN) , wnstats(7WN) .

----------------------------------------------------------------------------

Table of Contents

   * NAME
   * DESCRIPTION
        o Index File Format
        o Data File Format
        o Sense Numbers
        o Exception List File Format
        o Format of Sense Grouping Files
        o Verb Example Sentences
   * NOTES
   * ENVIRONMENT VARIABLES
   * FILES
   * SEE ALSO
