Metadata-Version: 2.1
Name: wikivector
Version: 1.2.1
Summary: WikiVector: Tools for encoding Wikipedia articles as vectors
Home-page: https://github.com/mortonne/wikivector
Author: Neal Morton
Author-email: mortonne@gmail.com
License: GPL-3.0-or-later
Keywords: NLP,Wikipedia
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: selectolax
Requires-Dist: h5py (>=3)
Requires-Dist: pandas
Requires-Dist: tensorflow-hub

# wikivector

[![PyPI version](https://badge.fury.io/py/wikivector.svg)](https://badge.fury.io/py/wikivector)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4453878.svg)](https://doi.org/10.5281/zenodo.4453878)

Tools for encoding Wikipedia articles as vectors.

## Installation

To get the latest stable version:

```bash
pip install wikivector
```

To get the development version:

```bash
pip install git+git://github.com/mortonne/wikivector
```

## Exporting Wikipedia text

First, run [WikiExtractor](https://github.com/attardi/wikiextractor)
on a Wikipedia dump. This will generate a directory with many 
subdirectories and text files within each subdirectory. Next, build 
a header file with a list of all articles in the extracted text data:

```bash
wiki_header wiki_dir
```

where `wiki_dir` is the path to the output from `WikiExtractor`. 
This will create a CSV file called `header.csv` with the title of each 
article and the file in which it can be found.

To extract specific articles, write a CSV file with two columns: "item"
and "title". The "title" for each item must exactly match an article
title in the Wikipedia dump. We refer to this file as the `map_file`.

If you are working with an older Wikipedia dump, it can be difficult to 
find the correct titles for article pages, as page titles may have changed
between the archive and the current online version of Wikipedia. To help 
identify mismatches between the map file and the Wikipedia dump, you can 
run:

```bash
wiki_check_map header_file map_file
```

to display any items whose article is not found in the header file. You 
can then use the Bash utility `grep` to search the header file for correct 
titles for each missing item.

When your map file is ready, extract the text for each item:

```bash
export_articles header_file map_file output_dir
```

where `map_file` is the CSV file with your items, and `output_dir` is
where you want to save text files with each item's article. Check the
output carefully to ensure that you have the correct text for each item
and that XML tags have been stripped out.

## Universal Sentence Encoder

Once articles have been exported, you can calculate a vector embedding
for each item using the Universal Sentence Encoder.

```bash
embed_articles map_file text_dir h5_file
```

This reads a map file specifying an item pool (only the "item" field is 
used) and outputs vectors in an hdf5 file. To read the vectors, in 
Python:

```python
from wikivector import vector
vectors, items = vector.load_vectors(h5_file)
```

## Citation

If you use wikivector, please cite the following paper:

Morton, NW*, Zippi, EL*, Noh, S, Preston, AR. In press.
Semantic knowledge of famous people and places is represented in hippocampus and distinct cortical networks.
Journal of Neuroscience. *authors contributed equally


