Metadata-Version: 2.1
Name: selectolax
Version: 0.3.32
Summary: Fast HTML5 parser with CSS selectors.
Home-page: https://github.com/rushter/selectolax
Author: Artem Golubin
Author-email: Artem Golubin <me@rushter.com>
License: MIT
Project-URL: Repository, https://github.com/rushter/selectolax
Project-URL: Documentation, https://selectolax.readthedocs.io/en/latest/parser.html
Project-URL: Changelog, https://github.com/rushter/selectolax/blob/main/CHANGES.rst
Keywords: selectolax,html,parser,css,fast
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Internet
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/x-rst
Provides-Extra: cython
License-File: LICENSE

.. image:: docs/logo.png
  :alt: selectolax logo

-------------------------

.. image:: https://img.shields.io/pypi/v/selectolax.svg
        :target: https://pypi.python.org/pypi/selectolax

A fast HTML5 parser with CSS selectors using `Modest <https://github.com/lexborisov/Modest/>`_ and
`Lexbor <https://github.com/lexbor/lexbor>`_ engines.


Installation
------------
From PyPI using pip:

.. code-block:: bash

        pip install selectolax

If installation fails due to compilation errors, you may need to install `Cython <https://github.com/cython/cython>`_:

.. code-block:: bash

        pip install selectolax[cython]

This usually happens when you try to install an outdated version of selectolax on a newer version of Python.


Development version from GitHub:

.. code-block:: bash

        git clone --recursive  https://github.com/rushter/selectolax
        cd selectolax
        pip install -r requirements_dev.txt
        python setup.py install

How to compile selectolax while developing:

.. code-block:: bash

    make clean
    make dev

Basic examples
--------------

Here are some basic examples to get you started with selectolax:

Parsing HTML and extracting text:

.. code:: python

    In [1]: from selectolax.parser import HTMLParser
       ...:
       ...: html = """
       ...: <h1 id="title" data-updated="20201101">Hi there</h1>
       ...: <div class="post">Lorem Ipsum is simply dummy text of the printing and typesetting industry. </div>
       ...: <div class="post">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</div>
       ...: """
       ...: tree = HTMLParser(html)

    In [2]: tree.css_first('h1#title').text()
    Out[2]: 'Hi there'

    In [3]: tree.css_first('h1#title').attributes
    Out[3]: {'id': 'title', 'data-updated': '20201101'}

    In [4]: [node.text() for node in tree.css('.post')]
    Out[4]:
    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',
     'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']

Using advanced CSS selectors:

.. code:: python

    In [1]: html = "<div><p id=p1><p id=p2><p id=p3><a>link</a><p id=p4><p id=p5>text<p id=p6></div>"
       ...: selector = "div > :nth-child(2n+1):not(:has(a))"

    In [2]: for node in HTMLParser(html).css(selector):
       ...:     print(node.attributes, node.text(), node.tag)
       ...:     print(node.parent.tag)
       ...:     print(node.html)
       ...:
    {'id': 'p1'}  p
    div
    <p id="p1"></p>
    {'id': 'p5'} text p
    div
    <p id="p5">text</p>


* `Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>`_

Available backends
------------------

Selectolax supports two backends: ``Modest`` and ``Lexbor``. By default, all examples use the Modest backend.
Most of the features between backends are almost identical, but there are still some differences.

As of 2024, the preferred backend is ``Lexbor``. The ``Modest`` backend is still available for compatibility reasons
and the underlying C library that selectolax uses is not maintained anymore.


To use ``lexbor``, just import the parser and use it in the similar way to the `HTMLParser`.

.. code:: python

    In [1]: from selectolax.lexbor import LexborHTMLParser

    In [2]: html = """
       ...: <title>Hi there</title>
       ...: <div id="updated">2021-08-15</div>
       ...: """

    In [3]: parser = LexborHTMLParser(html)
    In [4]: parser.root.css_first("#updated").text()
    Out[4]: '2021-08-15'


Simple Benchmark
----------------

* Extract title, links, scripts and a meta tag from main pages of top 754 domains. See ``examples/benchmark.py`` for more information.

============================ ===========
Package                       Time
============================ ===========
Beautiful Soup (html.parser)  61.02 sec.
lxml / Beautiful Soup (lxml)  9.09 sec.
html5_parser                  16.10 sec.
selectolax (Modest)           2.94 sec.
selectolax (Lexbor)           2.39 sec.
============================ ===========

Links
-----

*  `selectolax API reference <https://selectolax.readthedocs.io/en/latest/index.html>`_
*  `Video introduction to web scraping using selectolax <https://youtu.be/HpRsfpPuUzE>`_
*  `How to Scrape 7k Products with Python using selectolax and httpx <https://www.youtube.com/watch?v=XpGvq755J2U>`_
*  `Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>`_
*  `Modest introduction <https://lexborisov.github.io/Modest/>`_
*  `Modest benchmark <https://lexborisov.github.io/benchmark-html-parsers/>`_
*  `Python benchmark <https://rushter.com/blog/python-fast-html-parser/>`_
*  `Another Python benchmark <https://www.peterbe.com/plog/selectolax-or-pyquery>`_

License
-------

* Modest engine — `LGPL2.1 <https://github.com/lexborisov/Modest/blob/master/LICENSE>`_
* selectolax - `MIT <https://github.com/rushter/selectolax/blob/master/LICENSE>`_
