cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor
Contents
cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor#
- class cmoncrawl.processor.pipeline.extractor.DomainRecordExtractor(filter_non_ok: bool = True)#
Dummy Extractor which simply extracts the html
- __init__(filter_non_ok: bool = True)#
Methods
__init__([filter_non_ok])extract(response, metadata)extract_soup(soup, metadata)filter_raw(response, metadata)filter_soup(soup, metadata)preprocess(response, metadata)