Metadata-Version: 2.1
Name: bhfutils
Version: 0.1.15
Summary: Utilities that are used by any spider of Behoof project
Home-page: https://behoof.app/
Author: Teplygin Vladimir
Author-email: vvteplygin@gmail.com
License: MIT
Keywords: behoof,scrapy-cluster,utilities
Description-Content-Type: text/x-rst
Requires-Dist: python-json-logger ==0.1.8
Requires-Dist: redis >=4.0.2
Requires-Dist: kazoo >=2.8.0
Requires-Dist: mock >=4.0.3
Requires-Dist: playwright >=1.17.2
Requires-Dist: testfixtures >=6.18.3
Requires-Dist: ujson >=4.3.0
Requires-Dist: future >=0.18.2
Provides-Extra: all
Requires-Dist: python-json-logger ==0.1.8 ; extra == 'all'
Requires-Dist: redis >=4.0.2 ; extra == 'all'
Requires-Dist: kazoo >=2.8.0 ; extra == 'all'
Requires-Dist: mock >=4.0.3 ; extra == 'all'
Requires-Dist: playwright >=1.17.2 ; extra == 'all'
Requires-Dist: testfixtures >=6.18.3 ; extra == 'all'
Requires-Dist: ujson >=4.3.0 ; extra == 'all'
Requires-Dist: future >=0.18.2 ; extra == 'all'
Requires-Dist: mock >=2.0.0 ; extra == 'all'
Requires-Dist: testfixtures >=4.13.5 ; extra == 'all'
Provides-Extra: docs
Requires-Dist: sphinx ; extra == 'docs'
Requires-Dist: mock >=2.0.0 ; extra == 'docs'
Requires-Dist: testfixtures >=4.13.5 ; extra == 'docs'
Provides-Extra: lint
Requires-Dist: pep8 ; extra == 'lint'
Requires-Dist: pyflakes ; extra == 'lint'
Provides-Extra: test
Requires-Dist: mock >=2.0.0 ; extra == 'test'
Requires-Dist: testfixtures >=4.13.5 ; extra == 'test'

******************************
Behoof Scrapy Cluster Template
******************************

Overview
--------

The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.

Requirements
------------

- Unix based machine (Linux or OS X)
- Python 2.7 or 3.6

Installation
------------

Inside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation
-------------

Full documentation for the ``bhfutils`` package does not exist

custom_cookies.py
==================

The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py
========================

The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py
===============================

The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded 

redis_dupefilter.py
===================

The ``redis_dupefilter`` module is redis-based request duplication filter

redis_global_page_per_domain_filter.py
======================================

The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
