cmoncrawl.processor.pipeline.streamer.StreamerFileHTML
Contents
cmoncrawl.processor.pipeline.streamer.StreamerFileHTML#
- class cmoncrawl.processor.pipeline.streamer.StreamerFileHTML(root: Path, max_directory_size: int)#
- __init__(root: Path, max_directory_size: int)#
Methods
__init__(root, max_directory_size)clean_up()get_file_name(metadata)metadata_to_string(extracted_data)stream(extracted_data, metadata)