graphai.core.scraping.scraping module
- graphai.core.scraping.scraping.compare_strings(string1, string2)
Compares two input strings and returns array of 0 and 1s :param string1: First string :param string2: Second string
- Returns:
Array of 0s and 1s, with 1 indicating equality of characters at that index
- graphai.core.scraping.scraping.find_consecutive_runs(v, min_run=32)
Finds consecutive runs of equal elements in list (i.e. a sequence of 40 times the same value) :param v: The list or string :param min_run: Minimum length of a consecutive run
- Returns:
List of tuples (k1, k2, Length, Value)
- graphai.core.scraping.scraping.find_edge_patterns(content_stack, flip_strings=False)
Finds repeated patterns at the edge (beginning or end) of strings in a list of strings :param content_stack: List of strings (which are the contents of webpages) :param flip_strings: Whether to flip the strings or not. Finds footer patterns if True, header patterns if False.
- Returns:
List of patterns
- graphai.core.scraping.scraping.string_circular_shift(s, shift=1)
Performs a circular shift on a string by the provided value :param s: String to shift :param shift: How many characters to shift the string by
- Returns:
Shifted string
- graphai.core.scraping.scraping.find_spaces(s)
Finds spaces in string :param s: Input string
- Returns:
List of starting points of every space sequence in the string
- graphai.core.scraping.scraping.shift_to_max_correlation(s1, s2)
Shifts two strings to find their maximum correlation (as indicated by the positions of spaces in them) and the largest-matching string pattern with that shift :param s1: First string :param s2: Second string
- Returns:
(Optimal shift value, number of intersections with optimal shift, position of intersections, largest matching string pattern with optimal shift)
- graphai.core.scraping.scraping.find_repeated_patterns(content_stack, min_length=1024)
Finds repeated patterns in a list of strings, everywhere within the strings :param content_stack: List of strings :param min_length: Minimum length of the matching substrings
- Returns:
List of matching patterns
- graphai.core.scraping.scraping.extract_text_from_url(url, request_headers=None, max_length=None, tag_search_sequence=None)
Extracts text from webpage by URL :param url: The url :param request_headers: Request headers for the headless browser :param max_length: Maximum length of the page contents :param tag_search_sequence: Sequence of tags to search for the contents in
- Returns:
Contents of the page
- graphai.core.scraping.scraping.check_url(test_url, request_headers=None)
Checks if a URL is accessible and returns the fully resolved URL if so :param test_url: Starting URL :param request_headers: Headers of the request, uses defaults if None
- Returns:
The validated URL, status message, and status code
- graphai.core.scraping.scraping.create_base_url_token(url)
Creates a standard token for a given base URL. :param url: Base url
- Returns:
Token
- graphai.core.scraping.scraping.initialize_url(url, base_url=None)
Initializes the provided URL by determining its protocol (http or https) and validating it :param url: The URL to initialize :param base_url: The token of the base URL, extracted from url if None
- Returns:
The validated base URL, the original (corrected) base URL, status message, and status code
- graphai.core.scraping.scraping.generate_sublink_token(base_token, validated_url, sublink)
Generates the token for a sublink based on the base token :param base_token: Token created for the base URL :param validated_url: Validated base URL :param sublink: Sublink to generate the token for
- Returns:
Token for the sublink
- graphai.core.scraping.scraping.reconstruct_data_dict(sublinks, tokens, contents=None, page_types=None)
Reconstructs the data dict that is used for processing the content of sublinks using precalculated inputs :param sublinks: List of sublinks :param tokens: List of tokens :param contents: List of contents, optional :param page_types: List of page types, optional
- Returns:
Reconstructed data dictionary
- graphai.core.scraping.scraping.initialize_data_dict(base_token, validated_url, sublinks)
Initializes the data dictionary used for processing the content of sublinks :param base_token: Base URL token :param validated_url: Validated base URL :param sublinks: List of sublinks
- Returns:
Data dictionary with ids filled in and content/pagetype fields vacant
- graphai.core.scraping.scraping.get_sublinks(base_url, validated_url, request_headers=None)
Retrieves all the sublinks of a URL :param base_url: Base URL :param validated_url: Base validated URL :param request_headers: Headers of the request, uses defaults if None
- Returns:
List of sublinks, a data dictionary mapping each sublink to a dict to be filled later, and the validated URL
- graphai.core.scraping.scraping.parse_page_type(url, validated_url)
Parses the type of a page according to predefined types :param url: Given URL :param validated_url: Base validated URL
- Returns:
Page type
- graphai.core.scraping.scraping.process_all_sublinks(data, base_url, validated_url)
Processes all the sublinks and extracts their contents :param data: Data dict (which will be modified) :param base_url: Corrected (but not validated) base URL (used for the id of the sublink) :param validated_url: Validated base URL
- Returns:
Modified data dict
- graphai.core.scraping.scraping.remove_headers(data)
Removes all headers and footers from the data dict, which contains the contents of all the sublinks of a base URL :param data: Data dict
- Returns:
Modified data dict with all headers and footers eliminated
- graphai.core.scraping.scraping.remove_long_patterns(data, min_length=1024)
Removes all long patterns from the data dict, which contains the contents of all the sublinks of a base URL :param data: Data dict :param min_length: Minimum length of a long pattern
- Returns:
Modified data dict with all long patterns eliminated
- graphai.core.scraping.scraping.cache_lookup_get_sublinks(token)
- graphai.core.scraping.scraping.initialize_url_and_get_sublinks(token, url)
- graphai.core.scraping.scraping.scraping_sublinks_callback(results)
- graphai.core.scraping.scraping.cache_lookup_process_all_sublinks(token, headers, long_patterns)
- graphai.core.scraping.scraping.process_all_scraping_sublinks_parallel(results, i, n_total)
- graphai.core.scraping.scraping.sublink_parallel_processing_merge_callback(results)
- graphai.core.scraping.scraping.remove_junk_scraping_parallel(results, i, n_total, headers, long_patterns)
- graphai.core.scraping.scraping.extract_scraping_content_callback(results, headers, long_patterns)