========================
=  COW frequency lists =
========================


http://corporafromtheweb.org/
http://webcorpora.org/
http://hpsg.fu-berlin.de/cow/


INTRODUCTION ==========================================================

These files are word and lemma frequency lists for the COW corpora.
They have been derived from the unigram lists as published and contain
some convenient additional counts and ranks for the items, cf. FORMAT.
Because files which contain all (even low-frequency) items tend to be
large, there are alternative versions with a convenient frequency
threshold applied.

The file name is (once unzipped):

    CORPUS.freqN.AGGREGATION.tsv

CORPUS is the name of the published COW sentence shuffle corpus.
N is the frequency threshold: only words/lemmas with raw frequzency N
or greater are included in the list. AGGREGATION is either

    w	just words (= tokens)
    wp	word plus POS tag combinations
    l	lemmas

For example

    encow14ax.freq10+.wp.tsv.zip 

is the frequency list for word and POS tags combinations which have
frequency 10 or greater from the ENCOW14AX corpus.


AUTHORS AND CITATION ==================================================

The COW corpora were created by Felix Bildhauer and Roland Schäfer of
German Grammar Group, Freie Universität Berlin:

http://hpsg.fu-berlin.de/~fbildhau/ (COW)
http://hpsg.fu-berlin.de/~rsling/ (COW, COW ngrams, frequencies)

If you use the data, always cite at least the most recent publication
mentioned here:

http://corporafromtheweb.org/category/cow-citation/


LICENSE ===============================================================

This work is licensed under the Creative Commons Attribution 4.0
International License. To view a copy of this license,
visit http://creativecommons.org/licenses/by/4.0/ or send a letter to
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.


FORMAT ================================================================

The files are sorted by decreasing frequency of the item and are
formatted as tab-separated value files. Each line contains the
following tab-separated fields for one item, where the first line is
a header line:

    f_raw
	raw frequency of the item in the corpus

    rank_abs
	absolute frequency rank in the corpus (equal f_raw means
	equal rank)

    f_permil
	frequency per million items

    f_logpermil+REALNUMBER
	log10 of the frequency per million shifted into the positive
	range by adding REALNUMBER (such that this values is 0 for
	items with f_raw=1)

    f_logpermil+10
	log10 of the frequency per million shifted into the positive
	range by adding 10

    band
	frequency band calculated by round(log2(f_max)/f_raw)+0,5)
	wherein f_max is the raw frequency of the most frequent item

    token...
	the word or lemma with potential annotations (possibly several
	tab-separated fields)


=======================================================================
This document: May 28, 2015. Roland Schäfer.
=======================================================================
