Metadata-Version: 2.1
Name: ddhi-encoder
Version: 1.2.3
Summary: Encoding tools for DDHI
Home-page: https://github.com/pyscaffold/pyscaffold/
Author: Clifford Wulfman
Author-email: cwulfman@princeton.edu
License: mit
Project-URL: Documentation, https://pyscaffold.org/
Platform: any
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python
Description-Content-Type: text/x-rst; charset=UTF-8
Requires-Dist: docx2python
Requires-Dist: lxml
Requires-Dist: spacy
Provides-Extra: testing
Requires-Dist: pytest ; extra == 'testing'
Requires-Dist: pytest-cov ; extra == 'testing'

A collection of command-line utilities to assist in the creation of
TEI-encoded oral history interviews. Part of the Dartmouth Digital
History Initiative.

.. _ddhi-encoder-1:

DDHI Encoder
============

The ddhi-encoder package is being developed to assist encoders in the
DDHI project in encoding oral history interview transcripts in TEI. At
present, it contains three command-line utilities:

#. ``ddhi_convert``: convert a Dartmouth DVP transcript from docx to
   tei.xml.
#. ``ddhi_tag``: perform named-entity tagging on a DDHI TEI
   transcription.
#. ``ddhi_mentioned_places``: extract places from stand-off markup
   for processing with OpenRefine
#. ``ddhi_update_places``: update places in stand-off markup

Installation
------------

You can use pip to install this package:

.. code:: bash

   pip install ddhi-encoder

To peform named-entity tagging with ``ddhi_tag``, you will need a Spacy
model. Before running ``ddhi_tag``, install Spacy's small English model:

.. code:: bash

   python -m spacy download en_core_web_sm

See `the Spacy documentation <https://spacy.io/models>`__ for more
information.

Use
---

Use ``ddhi_convert`` to transform a DOCX-encoded transcription into a
simply structured TEI document:

.. code:: bash

   ddhi_convert ~/Desktop/transcripts/zien_jimmy_transcript_final.docx -o tmp.tei.xml

Use ``ddhi_tag`` to add named-entity tags to a TEI-encoded
transcription:

.. code:: bash

   ddhi_tag -o zien.tei.xml tmp.tei.xml

Encoders are then expected to edit the text of the interview,
correcting automatically generated named-entity tags and adding new
ones.  when this phase of editing is complete, use
``ddhi_generate_standoff`` to  create a ``<standOff>`` element in the
interview and link the entities to names in the text.

Use ``ddhi_mentioned_places`` to extract the places in a TEI file's
standoff markup and print it as tab-separated values:

.. code:: bash

	  ddhi_mentioned_places lovely.tei.xml > lovely.tsv

Then use OpenRefine or another tool to refine this list with
identifiers and other metadata.

Use ``ddhi_update_places`` to update the places in a TEI file's
standoff markup with identifiers and geo-coordinates obtained via
OpenRefine or other procedure:

.. code:: bash

	  ddhi_update_places lovely.tei.xml lovely_updates.tsv >
	  updated_lovely.tei.xml

Similarly, use ``ddhi_mentioned_events`` and ``ddhi_update_events`` to
perform the same operations for events.


