Metadata-Version: 2.0
Name: floscraper
Version: 0.2.0
Summary: Simple webscraper built on top of requests and beautifulsoup
Home-page: https://github.com/the01/python-floscraper
Author: the01
Author-email: jungflor@gmail.com
License: MIT License
Description-Content-Type: UNKNOWN
Keywords: floscrapper scraping web cache requests beautifulsoup
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: wheel (>=0.30.0)
Requires-Dist: flotils (<0.4.0,>=0.3.5a0)
Requires-Dist: python-dateutil (<2.7,>=2.6.1)
Requires-Dist: beautifulsoup4 (<4.7,>=4.6.0)
Requires-Dist: chardet (<4.0,>=3.0.4)
Requires-Dist: portalocker (<1.1,>=0.5.5)
Requires-Dist: requests (<3.0,>=2.18.4)
Requires-Dist: requests-toolbelt (<0.9,>=0.8)
Requires-Dist: html2text (>=2017.10.4)

FLOSCRAPER
##########

Some basic webscraper I use in many projects.

.. image:: https://img.shields.io/pypi/v/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper

.. image:: https://img.shields.io/pypi/l/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper

.. image:: https://img.shields.io/pypi/dm/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper


webscraper
==========
Module to ease web efforts

**Supports**

* Cached web requests (Wrapper around requests)
* Bultin parsing/scraping (Wrapper around beautifulsoup)


**Constructor parameters**

* url: Default url, used if nothing else specified
* scheme: Default scheme for scrapping
* timeout
* cache_directory: Where to save cache files
* cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
* cache_use_advanced
* auth_method: Authentication method (default: HTTPBasicAuth)
* auth_username: Authentication username. If set, enables authentication
* auth_password: Authentication password
* handle_redirect: Allow redirects (default: True)
* user_agent: User agent to use
* default_user_agents_browser: Browser to set in user agent (from ``default_user_agents`` dict)
* default_user_agents_os: Operating system to set in user agent (from ``default_user_agents`` dict)
* user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
* user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
* html2text: HTML2text settings
* html_parser: What html parser to use (default: html.parser - built in)


**Example**

.. code-block:: python

    # Setup WebScraper with caching
    web = WebScraper({
        'cache_directory': "cache",
        'cache_time': 5*60
    })

    # First call to git -> hit internet
    web.get("https://github.com/")

    # Second call to git (within 5 minutes of first) -> hit cache
    web.get("https://github.com/")

Whitch results in the following output:

::

    2016-01-07 19:22:00 DEBUG   [WebScraper._getCached] From inet https://github.com
    2016-01-07 19:22:00 INFO    [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com
    2016-01-07 19:22:01 DEBUG   [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None
    2016-01-07 19:22:01 DEBUG   [WebScraper._getCached] From cache https://github.com


.. :changelog:

History
=======

0.2.0 (2017-10-12)
---------------------

* Rework api names
* Redesign caching


0.1.15a0 (2016-03-08)
---------------------

* First release on PyPI.


