Metadata-Version: 1.1
Name: scrapy-sqlitem
Version: 0.1.2
Summary: Scrapy extension to save items to a sql database
Home-page: https://github.com/ryancerf/scrapy-sqlitem
Author: Ryan Cerf
Author-email: ryancerf@yahoo.com
License: BSD
Description: scrapy-sqlitem
        ==============
        
        scrapy-sqlitem allows you to define scrapy items using Sqlalchemy models
        or tables. It also provides an easy way to save to the database in
        chunks.
        
        This project is in beta. Pull requests and feedback are welcome. The
        regular caveats of using a sql database backend for a write heavy
        application still apply.
        
        Inspiration from
        `scrapy-redis <https://github.com/darkrho/scrapy-redis>`__ and
        `scrapy-djangoitem <https://github.com/scrapy-plugins/scrapy-djangoitem>`__
        
        Quickstart
        ==========
        
        ::
        
            pip install scrapy_sqlitem
        
        `Define items using Sqlalchemy ORM <http://docs.sqlalchemy.org/en/rel_1_0/orm/tutorial.html#declare-a-mapping>`__
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
            from scrapy_sqlitem import SqlItem
        
            class MyModel(Base):
                __tablename__ = 'mytable'
                id = Column(Integer, primary_key=True)
                name = Column(String)
        
            class MyItem(SqlItem):
                sqlmodel = MyModel
        
        `Or Define Items using Sqlalchemy Core <http://docs.sqlalchemy.org/en/rel_1_0/core/metadata.html>`__
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
            from scrapy_sqlitem import SqlItem
        
            class MyItem(SqlItem):
                sqlmodel = Table('mytable', metadata
                    Column('id', Integer, primary_key=True),
                    Column('name', String, nullable=False))
        
        If tables have not been created yet make sure to create them. See
        sqlalchemy docs and the example spider.
        
        Use SqlSpider to easily save scraped items to the database
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        settings.py
        
        .. code:: python
        
            DATABASE_URI = "sqlite:///"
        
        Define your spider
        ~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
            from scrapy_sqlitem import SqlSpider
        
            class MySpider(SqlSpider):
               name = 'myspider'
        
               start_urls = ('http://dmoz.org',)
        
               def parse(self, response):
                    selector = Selector(response)
                    item = MyItem()
                    item['name'] = selector.xpath('//title[1]/text()').extract_first()
                    yield item
        
        Run the spider
        ~~~~~~~~~~~~~~
        
        .. code:: sh
        
            scrapy crawl myspider
        
        Query the database
        ~~~~~~~~~~~~~~~~~~
        
        .. code:: sql
        
            Select * from mytable;
        
             id |               name                |
            ----+-----------------------------------+
              1 | DMOZ - the Open Directory Project |
        
        Other Information
        =================
        
        Do not want to use SqlSpider? Write a pipeline instead.
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
        
            from sqlalchemy import create_engine
        
            class CommitSqlPipeline(object):
                    
                    def __init__(self):
                            self.engine = create_engine("sqlite:///")
        
                    def process_item(self, item, spider):
                            item.commit_item(engine=self.engine)
        
        Drop items missing required primary key data before saving to the db
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
        
            from scrapy.exceptions import DropItem
        
            class DropMissingDataPipeline(object):
                    def process_item(self, item, spider):
                            if item.null_required_fields:
                                    raise DropItem
                            else:
                                    return item
            # Watch out for Serial primary keys that are considered null.
        
        Save to the database in chunks rather than item by item
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        Inherit from SqlSpider and..
        
        In settings
        
        .. code:: python
        
            DEFAULT_CHUNKSIZE = 500
        
            CHUNKSIZE_BY_TABLE = {'mytable': 1000, 'othertable': 250}
        
        If an error occurs while saving a chunk to the db it will try and save
        each item one at a time
        
        Access the underlying sqlalchemy table to query the database
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: sql
        
             INSERT INTO mytable (id, name) VALUES ('1','ryan')
        
        .. code:: python
        
            myitem = MyItem()
            # bind the table to an engine (I could have done this when I created the table too)
            myitem.table.metadata.bind = self.engine
            myitem.table.select().where(item.table.c.id == 1).execute().fetchone() 
        
            (1, 'ryan')
        
        What row in the database matches the data in my item?
        
        .. code:: python
        
            myitem = MyItem()
            myitem['id'] = 1
            myitem.get_matching_dbrow(bind=self.engine)
        
            (1, 'ryan')
        
        This is same query as the one above!
        
        Gotchas
        =======
        
        If you subclass either item\_scraped or spider\_closed make sure to call super!
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: python
        
        
            class MySpider(SqlSpider):
                    
                    def parse(self, response):
                            pass
        
                    def spider_closed(self, spider, reason):
                            super(MySpider, self).spider_closed(spider, reason)
                            self.log("Log some really important custom stats")
        
        Be Careful with other Mixins. The inheritance structure can get a little
        messy. If a class early in the mro subclasses item\_scraped and does not
        call super the item\_scraped method of SqlSpider will never get called.
        
        Other Methods of sqlitem
        ========================
        
        sqlitem.table
        ~~~~~~~~~~~~~
        
        -  returns the sqlalchemy core table that corresponds to that item.
        
        sqlitem.null\_required\_fields
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        -  returns a set of the database key names that are are marked not
           nullable and the corresponding data in the item is null.
        
        sqlitem.null\_primary\_key\_fields
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        -  returns a set of the primary key names where the corresponding data
           in the item is null.
        
        sqlitem.primary\_keys
        ~~~~~~~~~~~~~~~~~~~~~
        
        sqlitem.required\_keys
        ~~~~~~~~~~~~~~~~~~~~~~
        
        sqlitem.get\_matching\_dbrow(bind=None, use\_cache=True)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        -  Find the data in the database that matches the primary key data in
           the item
        
        ToDo
        ====
        
        -  Continuous integration Tests
        
        
Platform: UNKNOWN
Classifier: Framework :: Scrapy
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Topic :: Utilities
Classifier: Framework :: Scrapy
Requires: scrapy (>=0.24.5)
Requires: sqlalchemy
