Metadata-Version: 2.0
Name: lieparse
Version: 1.0.4
Summary: HTML parser ant text retriever using user defined rule set
Home-page: https://pypi.org/project/lieparse/
Author: Vidmantas Balčytis
Author-email: vidma@lema.lt
License: LICENSE.txt
Keywords: html parser
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.5
Requires-Dist: regex

lieparse Python project
=======================

**lieparse** is HTML parser ant text retriever using user defined rule set.

**HISTORY**

Library was initially written for Vilnius University **Liepa-2** project.
Although *LIEPA* is an abbreviation of project name, in Lithuanian *Liepa* means "Linden tree".
The tree image is in projects logotype also.

**QUICK USAGE**

Lets say you have HTML markup text read into string HtmlText.
Then to retrieve all text from division with id="main" you need to::

    from lieparse import lieParser
    rules = '<div id="main">$Data[]</div> ::$Data[];'
    parser = lieParser(rules)
    parser.feed(HtmlText)
    parser.close()

*More sophisticated example can be found after rules syntax definitions*

**RULES SYNTAX**

Rules consist of statements optionally separated by white space.

White space is considered space, tab, new line and comment.

Comment begins from # sign and lasts to line end.
(Concretely regex match ``r'(?:\s*(?:#.*?\n)?)*\s*')``

Statements can contain incorporated statements and data definitions.

*Statements are*:

*Rep statement* - loops matching all incorporated statements::

   #(<<other statements>>)
   #+(<<other statements>>)
   *(<<other statements>>)

where::

   # is optional numeric value and means repeat count
   + is one-or-more modifier. If standing alone is same as 1+
   * is zero-or-more modifier. Cannot be preceded by number

If number or '+', '*' modifiers are found before other statements (Tag or Any), repeat block
is generated automatically. So writing '2<div></div>' is automatically converted to
'2(<div></div>)'.

*Any statement* - matches any of incorporated statements::

      {<<other statements>>}

Any-match is done by statements definition order until first one matches.
Statement can contain Any, Tag or Rep statements. Print statement is not allowed here.

*Tag statement* - main matching statement the html text is checked onto::

    <name attr="string" $aData[<<aData attrs>>] >
        <<filterStr>> $Data[<<Data attrs>>] <<other statements>>
    </name>

where:

   *name* is tag name, something like 'div', 'li', 'span'.

   *attr* is optional and optionally multiple attribute that must be present in html
   tag to be matched. Real tag must contain all, but maybe not only, specified attributes
   to match this rule. If attribute in html tag has no value, in rule it must be
   specified with empty string as a value.

   class="" attribute is split by whitespace into sets while parsing rule as well
   as while parsing html. Rule attribute set must be html attribute subset to match.

   style="" attribute is handled similarly, but splitting on ';' and replacing
   multiple white space to single space and stripping spaces before adding to set.

   *$aData* is optional and optionally multiple attribute data definition. Can be
   indirect data ($*data[]) also. Definition follows. Data variable name must
   insensitive match regular expression ``'[a-z]+[a-z0-9_]*'``

   *filterStr* is optional tag data filtering string. If enclosed in '/' marks -
   regular expression match is performed against Tag data. If simple string -
   full match is performed (i.e. "My data" is equivalent to "/^My data$/").
   If tag data is not matched - tag is considered not matching.

   *$Data* is optional and optionally multiple data collection attribute. Can be
   indirect data ($*data[]) or to-first-tag data ($data[!]) or both.

Statement can incorporate other statements (Rep, Any, Tag, Print) mixed with $Data definitions

*Print statement* - only facility to output gathered data::

    :<<flags>> <<loopDef>>:<<"string">> $pData[<<pData attrs>>] <<Other print statements>> ;

where:
   *flags* is optional print behavior modifiers - string (no quotes) containing one or
   more flag letters. Next flags are defined::

      n - print new line after full print statement
      N - print new line after each individual loop of print statement
      s - separate each print value with space

   *loopDef* is expression defining how much times print body will be performed. If not
   specified it defaults to 1. If defined - it is counted at run time depending
   on real data. Loop counter is from 0 up-to loopDef. On run time current loop counter
   can be accessed in index expressions as $0. Outer loop statements counter is
   accessible as $1 for first surrounding print statement, $2 for second and so on, the
   last being ourselves (so same as $0).

   *loopDef* can be one of next:
         *indexExpr* - countable expression (look below) with $# as surrounding
         loop counters, numbers, parenthesis and arithmetic operations '+', '-', '*'.

         *$Data* - get length of Data array (note no []).

         *$\*Data* - get length of array, which name is in $Data.

   *string* is optional string that will be printed

   *pData* is data variable name (can be indirect: $*pData) from which data will be printed.
   Full definition is below.

string, pData and other print statements can be freely mixed inside print statement body.

*indexExpr* - countable expression, that can be used in print statement loop definition and
   in pData (print statement data) definition.
   indexExpr is countable expression with $# as surrounding loop counters, numbers,
   parenthesis and arithmetic operations '+', '-', '*'.

   Valid indexExpr's::

      3
      $2 + 1
      ($1 + 1) * 2

*Data* statements can be found inside Tag definition (aData), inside Tag body (dData and xData)
   and print statement (pData). Data reference (without []) can be found in print loopDef.
   pData can not be modified - information is only retrieved from named variable.
   Other types of Data is dedicated to collect data from html text.
   All data variables are arrays. After definition (even if it occurs with '+' sign) array
   pointer is 0. Pointer can be incremented by '+' sign in variable attributes. Pointer can
   never be decremented. '-' sign in attributes clears variable data, leaving index unchanged.
   '!' in attributes defines xData instead of dData.
   Variables can be direct::

      $<<name>>[<<attr>>] - defines variable named <<name>>

   and indirect::

      $*<<name>>[<<attr>>] - here name of variable is kept in last element of array $<<name>>[]

   Only one level of indirection is allowed.
   <<attr>> and behavior differs depending on variable scope (aData, dData, xData or pData).
   However in all scopes accessed data is same for same named variable.

*For aData, dData and xData:*

   *<<attr>>* consists of optional flag with values '!', +' or '-' and optional space separated
   strings.

   If flag is:

      '!' - xData type variable is defined. Valid only for variables inside Tag body.

      '+' - index value is incremented before other operations. The exception is if variable is
      first time defined - in this case index is left 0.

      '-' - all data accumulated in variable by current index is cleared before other operations.

      When no flag is present, data is appended to variable by current index.

   String can be enclosed in double quotation marks. This allows strings with spaces.
   If no strings are defined - passed data is simply added to variable.

   String can be:

      */<<match>>/*         - if passed data not matches regular expression it is ignored. All other
      strings are not processed

      */<<find>>/<<repl>>/* - if *<<find>>* regular expression matches passed data, it is replaced
      with *<<repl>>* and got data added to variable. On no match - data is ignored. Other
      strings are processed with all data passed to them.

      *@<<attrName>>*       - Value of specified Tag attribute is added to variable.

      *<<otherString>>*     - specified string is added to variable.

   *Data passed to variables is:*

      *aData* - all Tag attributes with names as name="value". If there is some class values
      they are passed as separate class="value" pairs.

      *dData* - all accumulated data in this and above Tag levels.

      *xData* - all accumulated data up to first sub-tag match.

   *For pData*:

      *<<attr>>* can be one of next forms:
         <<indexExpr0>>;<<indexExpr>> <<regexps>>  - for indirect variables only or

         <<indexExpr>> <<regexps>>                - for all variables

            *<<indexExpr>>* - is optional array index value at which will be printed. If not specified
            defaults to $0

            *<<indexExpr0>>* - is optional parent array index from which variable name is taken.
            Defaults to $0.

            *<<regexps>>* is optional regular expressions in form /<<find>>/<<repl>>/
            All expressions are applied to data value before print by order of appearance.

**ADVANCED EXAMPLE**

We will retrieve python library names from https://docs.python.org/3.6/py-modindex.html::

    import sys
    from lieparse import lieParser
    from pycurl import Curl, global_init, global_cleanup, GLOBAL_ALL
    usragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0"
    url = "https://docs.python.org/3.6/py-modindex.html"
    rules = r'''
    <table class="indextable modindextable">
        *<code class="xref">
            $Data[+]
        </code>
    </table>
    :N $Data:$Data[];     # if flags are ns we will have space separated list
    '''

    global_init(pycurl.GLOBAL_ALL)
    c = Curl()
    c.setopt(c.USERAGENT, usragent)
    c.setopt(c.SSL_VERIFYPEER, 0)      # have problems verifying certificate under Windows
    c.setopt(c.URL, url)
    s = c.perform_rs()
    global_cleanup()

    parser = lieParser(rules)
    parser.feed(s)
    v = parser.close()
    if v != 0:
        print("Unmatched {} items".format(v), file=sys.stderr)

:Author: Vidmantas Balčytis <vidma@lema.lt>
:Version: 1.0.4 (2020.01.09)



