Metadata-Version: 2.4
Name: memorious4
Version: 4.0.0
Summary: A minimalistic, recursive web crawling library for Python.
License-Expression: MIT
License-File: LICENSE
License-File: NOTICE
Author: Organized Crime and Corruption Reporting Project
Author-email: data@occrp.org
Requires-Python: >=3.11,<3.14
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: ftp
Provides-Extra: postgres
Provides-Extra: redis
Provides-Extra: sql
Requires-Dist: PySocks (==1.7.1) ; extra == "ftp"
Requires-Dist: alephclient (>=2.6.0,<3.0.0)
Requires-Dist: anystore (>=1.0.1,<2.0.0)
Requires-Dist: banal (>=1.0.6,<2.0.0)
Requires-Dist: dateparser (>=1.2.1,<2.0.0)
Requires-Dist: fakeredis (>=2.26.2,<3.0.0) ; extra == "redis"
Requires-Dist: followthemoney (>=4.5.1,<5.0.0)
Requires-Dist: ftm-lakehouse (>=0.2.0,<0.3.0)
Requires-Dist: ftmq (>=4.5.2,<5.0.0)
Requires-Dist: furl (>=2.1.0,<3.0.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: jinja2 (>=3.0.0,<4.0.0)
Requires-Dist: jq (>=1.6.0,<2.0.0)
Requires-Dist: legacy-cgi (>=2.6.3,<3.0.0)
Requires-Dist: lxml[html-clean] (>=6.0.2,<7.0.0)
Requires-Dist: normality (>=3.0.2,<4.0.0)
Requires-Dist: openaleph-procrastinate (>=5.2.0,<6.0.0)
Requires-Dist: psycopg2 (>=2.9.10,<3.0.0) ; extra == "postgres"
Requires-Dist: python-dateutil (>=2.9.0.post0,<3.0.0)
Requires-Dist: redis (>=4.0.0,<6.0.0) ; extra == "redis"
Requires-Dist: requests-ftp (>=0.3.1,<0.4.0) ; extra == "ftp"
Requires-Dist: requests[security] (>=2.32.3,<3.0.0) ; extra == "ftp"
Requires-Dist: rigour (>=1.6.2,<2.0.0)
Requires-Dist: sqlalchemy (>=2.0.36,<3.0.0) ; extra == "postgres"
Requires-Dist: sqlalchemy (>=2.0.36,<3.0.0) ; extra == "sql"
Requires-Dist: stringcase (>=1.2.0,<2.0.0)
Project-URL: Documentation, https://docs.investigraph.dev/lib/memorious
Project-URL: Homepage, https://docs.investigraph.dev/lib/memorious
Project-URL: Issues, https://github.com/dataresearchcenter/memorious/issues
Project-URL: Repository, https://github.com/dataresearchcenter/memorious
Description-Content-Type: text/x-rst

=========
Memorious
=========

    The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

    -- `Funes the Memorious <http://users.clas.ufl.edu/burt/spaceshotsairheads/borges-funes.pdf>`_,
    Jorge Luis Borges

.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg

``memorious`` is a light-weight web scraping toolkit. It supports scrapers that
collect structured or un-structured data. This includes the following use cases:

* Make crawlers modular and simple tasks reusable
* Provide utility functions to do common tasks such as data storage, HTTP session management
* Integrate crawlers with the Aleph and FollowTheMoney ecosystem
* Get out of your way as much as possible

Design
------

When writing a scraper, you often need to paginate through through an index
page, then download an HTML page for each result and finally parse that page
and insert or update a record in a database.

``memorious`` handles this by managing a set of ``crawlers``, each of which
can be composed of multiple ``stages``. Each ``stage`` is implemented using a
Python function, which can be reused across different ``crawlers``.

The basic steps of writing a Memorious crawler:

1. Make YAML crawler configuration file
2. Add different stages
3. Write code for stage operations (optional)
4. Test, rinse, repeat

Documentation
-------------

The documentation for Memorious is available at
`docs.investigraph.dev/lib/memorious <https://docs.investigraph.dev/lib/memorious>`_.
Feel free to edit the source files in the ``docs`` folder and send pull requests for improvements.

To serve the documentation locally, run ``mkdocs serve``

