Metadata-Version: 2.1
Name: haruka_parser
Version: 0.4.9
Summary: A simple HTML Parser
Home-page: https://github.com/prnake/haruka-parser
Author: papersnake
Author-email: prnake@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown
Requires-Dist: py_asciimath
Requires-Dist: inscriptis
Requires-Dist: tabulate
Requires-Dist: numpy
Requires-Dist: resiliparse
Requires-Dist: ftfy
Requires-Dist: faust-cchardet
Requires-Dist: lxml[html_clean]
Requires-Dist: courlan
Requires-Dist: html2text
Requires-Dist: charset_normalizer

# Haruka Parser

A simple HTML Parser

## Install

```bash
pip install haruka-parser
```

## Usage

```python3
from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)
```

## Configurations

```python3
from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}
```
