Metadata-Version: 2.1
Name: docria
Version: 0.1.0
Summary: Semi-structured Document Model
Home-page: https://github.com/marcusklang/docria
Author: Marcus Klang
Author-email: marcus.klang@cs.lth.se
License: Apache 2.0
Project-URL: Source, https://github.com/marcusklang/docria
Project-URL: Tracker, https://github.com/marcusklang/docria/issues
Description: Docria (Python)
        ===============
        
        Semi-structured strongly typed document storage model for Python 3+
        
        ---------------
        
        Overview
        --------
        
        The document model consists of the following concepts:
        
         * **Document**: The overall container for everything (all nodes, layers, texts must be contained within)
         * **Document fields**: a single dictionary per document to store metadata in.
         * **Text**: The basic text representation, a wrapped string to track spans.
         * **Text Spans**: Subsequence of a string, can always be converted into a hard string by using str(span)
         * **Layer**: Collection of nodes
         * **Layer Schema**: Definition of field names and types when document is serialized
         * **Node**: Single node with zero or more fields with values
         * **Node fields**: Key, value pairs.
        
        All parts of the document are accessible in three properties:
        
        .. code-block:: python
        
            from docria.model import Document
        
            doc = Document()
            doc.props  # The Document metadata dictionary
            doc.layers # The layer dictionary, name of layer to collection
            doc.texts  # The texts dictionary.
        
        
        Example of usage
        ----------------
        
        .. code-block:: python
        
            :name How to create a document and insert nodes
        
            from docria.model import Document, DataTypes as T
            import re
            # Stupid tokenizer
            tokenizer = re.compile(r"[a-zA-Z]+|[0-9]+|[^\s]")
        
            doc = Document()
        
            # Create a new text context called 'main' with the text 'This code was written in Lund, Sweden.'
            main_text = doc.add_text("main", "This code was written in Lund, Sweden.")
            #                                 01234567890123456789012345678901234567
            #                                 0         1         2         3
        
            # Create a new layer with fields: id, text and head.
            #
            # Fields:
            #   id is an int32
            #   text is a span from context 'main'
            #   head is a node reference into the token layer (the layer we are creating)
            #
            tokens = doc.add_layer("token", id=T.int32, text=main_text.spantype, head=T.noderef("token"))
        
            # Adding nodes: Solution 1
            i = 1
            token_zero = None
            token_two = None
            for m in tokenizer.finditer(str(main_text)):
                token_node = tokens.add(id=i, text=main_text[m.start():m.end()])
                if i == 0:
                    token_zero = token_node
                elif i == 2:
                    token_two = token_node
        
                i += 1
        
            token_two["head"] = token_zero
        
            # Solution 2: If adding many nodes
            token_list = []
        
            i = 1
            for m in tokenizer.finditer(str(main_text)):
                # This token is dangling, and is not attached until add_many
                token = Node({"id": i, "text": main_text[m.start():m.end()]}))
                token_list.append(token)
                i += 1
        
            token_list[2]["head"] = token_list[0]
            tokens.add_many(token_list)
        
        Document I/O
        ------------
        
        In ``docria.storage`` there is a DocumentIO class which provides factory methods to create readers and writers.
        
        .. code-block:: python
        
            :name How to create file writer and reader
        
            from docria.storage import DocumentIO
        
            with DocumentIO.write("output-file.docria") as docria_writer:
                for doc in documents:
                    docria_writer.write(doc)
        
        
            with DocumentIO.read("output-file.docria") as docria_reader:
                for doc in docria_reader:
                    # Do something with doc, which is a document
                    pass
        
        Raw reading and writing of documents:
        
        .. code-block:: python
        
            :name Using the Msgpack Codec
        
            from docria.codec import MsgpackCodec
        
            binarydata = bytes()  # from any location
        
            # To decode into a document
            doc = MsgpackCodec.decode(binarydata)
        
            # To encode into a document
            binarydata = MsgpackCodec.encode(doc)
        
        Notes
        -----
        
        Use regular object references when referring to a node.
        
        The settings used for pretty printing is controlled by ``docria.printout.options``.
        
        By convention pretty printing will output [layer name]#[internal id] where the internal id can be used to get the node.
        However, this id is only guaranteed to be static if the layer is not changed, if changed it is invalid.
        
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Topic :: Utilities
Description-Content-Type: text/x-rst
