Skip to content

Scrapper

flowtask.components.CompanyScraper.scrapper

CompanyScraper

CompanyScraper(loop=None, job=None, stat=None, **kwargs)

Bases: FlowComponent, SeleniumService, HTTPService

Company Scraper Component

Overview:

This component scrapes company information from different sources using HTTPService. It can receive URLs from a previous component (like GoogleSearch) and extract specific company information.

.. table:: Properties :widths: auto

+-----------------------+----------+------------------------------------------------------------------------------------------------------+ | Name | Required | Description | +-----------------------+----------+------------------------------------------------------------------------------------------------------+ | url_column (str) | Yes | Name of the column containing URLs to scrape (default: 'search_url') | +-----------------------+----------+------------------------------------------------------------------------------------------------------+ | wait_for (tuple) | No | Element to wait for before scraping (default: ('class', 'company-overview')) | +-----------------------+----------+------------------------------------------------------------------------------------------------------+

Return:

The component adds new columns to the DataFrame with company information: - headquarters - phone_number - website - stock_symbol - naics_code - employee_count

close async

close()

Clean up resources.

extract_company_info

extract_company_info(soup, search_term, search_url)

Extract company information from the page.

run async

run()

Execute scraping for each URL in the DataFrame.

scrape_url async

scrape_url(idx, url)

Scrape company information from URL.

search_in_ddg async

search_in_ddg(search_term, company_name, scrapper, backend='html', region='wt-wt')

Search for a term in DuckDuckGo.

split_parts

split_parts(task_list, num_parts=5)

Split task list into parts for concurrent processing.

start async

start(**kwargs)

Initialize the component and validate required parameters.