Metadata-Version: 2.0
Name: scrape
Version: 0.3.0
Summary: a command-line web scraping and crawling tool
Home-page: https://github.com/huntrar/scrape
Author: Hunter Hammond
Author-email: huntrar@gmail.com
License: MIT
Keywords: scrape entire webpage website pdf text keyword crawl save page filter regex lxml html download downloader
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: lxml
Requires-Dist: pdfkit
Requires-Dist: requests

# scrape

## a command-line web scraping and crawling tool
scrape is a command-line tool used to extract webpage content in the form of text, pdf, or simply html. A crawling mechanism allows scrape to follow subsequent webpages either freely or according to a set of keywords, making scraping entire websites a quick and easy task. scrape can extract text content for any tag attributes, such as href for links or text for plain text. Text can be filtered in a grep-like manner, saving you another extra step!

## Installation
* `pip install scrape`
* [Installing wkhtmltopdf](https://github.com/pdfkit/pdfkit/wiki/Installing-WKHTMLTOPDF)

## Usage
    usage: scrape.py [-h] [-r [READ [READ ...]]]
                     [-a [ATTRIBUTES [ATTRIBUTES ...]]] [-c [CRAWL [CRAWL ...]]]
                     [-ca] [-f [FILTER [FILTER ...]]] [-ht] [-l LIMIT] [-n] [-p]
                     [-q] [-t] [-v]
                     [urls [urls ...]]

    a command-line web scraping and crawling tool

    positional arguments:
      urls                  url(s) to scrape

    optional arguments:
      -h, --help            show this help message and exit
      -r [READ [READ ...]], --read [READ [READ ...]]
                            read in local html file(s)
      -a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
                            tag attribute(s) for extracting lines of text, default
                            is text
      -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                            regexp(s) to match links to crawl
      -ca, --crawl-all      crawl all links
      -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                            regexp(s) to filter lines of text
      -ht, --html           save output as html
      -l LIMIT, --limit LIMIT
                            set page crawling limit
      -n, --nonstrict       set crawler to visit other websites
      -p, --pdf             save output as pdf
      -q, --quiet           suppress output
      -t, --text            save output as text, default
      -v, --version         display current version

## Author
* Hunter Hammond (huntrar@gmail.com)

## Notes
* Pages are converted to text by default, you can specify --html or --pdf to save to a different format.

* If saving to text, lines may be filtered for keywords by passing one or more regexps to --filter.

* Also if saving to text, you may specify specific tag attributes to extract from the page using --attributes. The default choice is to extract only text attributes, but you can specify one or many different attributes (such as href, src, title, or any attribute available..).

* Pages are saved temporarily as PART%d.html files during processing. These files are removed automatically if saving to text or pdf.

* Entire websites can be downloaded by using the --crawl-all flag or by passing one or more regexps to --crawl, which filters through a list of URL's.

* If you want the crawler to follow links outside of the given URL's domain, use --nonstrict.

* Crawling can be stopped by Ctrl-C or by setting the number of pages to be crawled using --limit.



News
====

0.3.0
------

 - added read option for user inputted html files, currently writes files individually and not grouped, to do next is add grouping option
 - added html/ directory containing test html files
 - made relative imports explicit using absolute_import
 - added proxies to utils.py

0.2.10
------

 - moved OrderedSet class to orderedset.py rather than utils.py

0.2.9
------

 - updated program description and keywords in setup.py

0.2.8
------

 - restricts crawling to seed domain by default, changed --strict to --nonstrict for crawling outside given website

0.2.5
------

 - added requests to install_requires in setup.py

0.2.4
------

 - added attributes flag which specifies which tag attributes to extract from a given page, such as text, href, etc.

0.2.3
------

 - updated flags and flag help messages
 - verbose now by default and reduced number of messages, use --quiet to silence messages
 - changed name of --files flag to --html for saving output as html
 - added --text flag, default is still text

0.2.2
------

 - fixed character encoding issue, all unicode now

0.2.1
------

 - improvements to exception handling for proper PART file removal

0.2.0
------

 - pages are now saved as they are crawled to PART.html files and processed/removed as necessary, this greatly saves on program memory
 - added a page cache with a limit of 10 for greater duplicate protection
 - added --files option for keeping webpages as PART.html instead of saving as text or pdf, this also organizes them into a subdirectory named after the seed url's domain
 - changed --restrict flag to --strict for restricting the domain to the seed domain while crawling
 - more --verbose messages being printed

0.1.10
------

 - now compares urls scheme-less before updating links to prevent http:// and https:// duplicates and replaced set_scheme with remove_scheme in utils.py
 - renamed write_pages to write_links

0.1.9
------

 - added behavior for --crawl keywords in crawl method
 - added a domain check before outputting crawled message or adding to crawled links
 - domain key in args is now set to base domain for proper --restrict behavior
 - clean_url now rstrips / character for proper link crawling
 - resolve_url now rstrips / character for proper out_file writing
 - updated description of --crawl flag

0.1.8
------

 - removed url fragments
 - replaced set_base with urlparse method urljoin
 - out_file name construction now uses urlparse 'path' member
 - raw_links is now an OrderedSet to try to eliminate as much processing as possible
 - added clear method to OrderedSet in utils.py

0.1.7
------

 - removed validate_domain and replaced it with a lambda instead
 - replaced domain with base_url in set_base as should have been done before
 - crawled message no longer prints if url was a duplicate

0.1.6
------

 - uncommented import __version__

0.1.5
------

 - set_domain was replaced by set_base, proper solution for links that are relative
 - fixed verbose behavior
 - updated description in README

0.1.4
------

 - fixed output file generation, was using domain instead of base_url
 - minor code cleanup

0.1.3
------

 - blank lines are no longer written to text unless as a page separator
 - style tags now ignored alongside script tags when getting text

0.1.2
------

 - added shebang

0.1.1
------

 - uncommented import __version__

0.1.0
------

 - reformatting to conform with PEP 8
 - added regexp support for matching crawl keywords and filter text keywords
 - improved url resolution by correcting domains and schemes
 - added --restrict option to restrict crawler links to only those with seed domain
 - made text the default write option rather than pdf, can now use --pdf to change that
 - removed page number being written to text, separator is now just a single blank line
 - improved construction of output file name

0.0.11
------

 - fixed missing comma in install_requires in setup.py
 - also labeled now as beta as there are still some kinks with crawling

0.0.10
------

 - now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

 - pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

 - better implementation of crawler, can now scrape entire websites
 - added OrderedSet class to utils.py

0.0.7
------

 - changed --keywords to --filter and positional arg url to urls

0.0.6
------

 - use --keywords flag for filtering text
 - can pass multiple links now
 - will not write empty files anymore

0.0.5
------

 - added --verbose argument for use with pdfkit
 - improved output file name processing

0.0.4
------

 - accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

 - Moved utils.py to scrape/

0.0.2
------

 - First entry




