graphai.core.scraping.scraping module

graphai.core.scraping.scraping.compare_strings(string1, string2)

Compares two input strings and returns array of 0 and 1s :param string1: First string :param string2: Second string

Returns:: Array of 0s and 1s, with 1 indicating equality of characters at that index

graphai.core.scraping.scraping.find_consecutive_runs(v, min_run=32)

Finds consecutive runs of equal elements in list (i.e. a sequence of 40 times the same value) :param v: The list or string :param min_run: Minimum length of a consecutive run

Returns:: List of tuples (k1, k2, Length, Value)

graphai.core.scraping.scraping.find_edge_patterns(content_stack, flip_strings=False)

Finds repeated patterns at the edge (beginning or end) of strings in a list of strings :param content_stack: List of strings (which are the contents of webpages) :param flip_strings: Whether to flip the strings or not. Finds footer patterns if True, header patterns if False.

Returns:: List of patterns

graphai.core.scraping.scraping.string_circular_shift(s, shift=1)

Performs a circular shift on a string by the provided value :param s: String to shift :param shift: How many characters to shift the string by

Returns:: Shifted string

graphai.core.scraping.scraping.find_spaces(s)

Finds spaces in string :param s: Input string

Returns:: List of starting points of every space sequence in the string

graphai.core.scraping.scraping.shift_to_max_correlation(s1, s2)

Shifts two strings to find their maximum correlation (as indicated by the positions of spaces in them) and the largest-matching string pattern with that shift :param s1: First string :param s2: Second string

Returns:: (Optimal shift value, number of intersections with optimal shift, position of intersections, largest matching string pattern with optimal shift)

graphai.core.scraping.scraping.find_repeated_patterns(content_stack, min_length=1024)

Finds repeated patterns in a list of strings, everywhere within the strings :param content_stack: List of strings :param min_length: Minimum length of the matching substrings

Returns:: List of matching patterns

graphai.core.scraping.scraping.extract_text_from_url(url, request_headers=None, max_length=None, tag_search_sequence=None)

Extracts text from webpage by URL :param url: The url :param request_headers: Request headers for the headless browser :param max_length: Maximum length of the page contents :param tag_search_sequence: Sequence of tags to search for the contents in

Returns:: Contents of the page

graphai.core.scraping.scraping.check_url(test_url, request_headers=None)

Checks if a URL is accessible and returns the fully resolved URL if so :param test_url: Starting URL :param request_headers: Headers of the request, uses defaults if None

Returns:: The validated URL, status message, and status code

graphai.core.scraping.scraping.create_base_url_token(url)

Creates a standard token for a given base URL. :param url: Base url

Returns:: Token

graphai.core.scraping.scraping.initialize_url(url, base_url=None)

Initializes the provided URL by determining its protocol (http or https) and validating it :param url: The URL to initialize :param base_url: The token of the base URL, extracted from url if None

Returns:: The validated base URL, the original (corrected) base URL, status message, and status code

graphai.core.scraping.scraping.generate_sublink_token(base_token, validated_url, sublink)

Generates the token for a sublink based on the base token :param base_token: Token created for the base URL :param validated_url: Validated base URL :param sublink: Sublink to generate the token for

Returns:: Token for the sublink

graphai.core.scraping.scraping.reconstruct_data_dict(sublinks, tokens, contents=None, page_types=None)

Reconstructs the data dict that is used for processing the content of sublinks using precalculated inputs :param sublinks: List of sublinks :param tokens: List of tokens :param contents: List of contents, optional :param page_types: List of page types, optional

Returns:: Reconstructed data dictionary

graphai.core.scraping.scraping.initialize_data_dict(base_token, validated_url, sublinks)

Initializes the data dictionary used for processing the content of sublinks :param base_token: Base URL token :param validated_url: Validated base URL :param sublinks: List of sublinks

Returns:: Data dictionary with ids filled in and content/pagetype fields vacant

graphai.core.scraping.scraping.get_sublinks(base_url, validated_url, request_headers=None)

Retrieves all the sublinks of a URL :param base_url: Base URL :param validated_url: Base validated URL :param request_headers: Headers of the request, uses defaults if None

Returns:: List of sublinks, a data dictionary mapping each sublink to a dict to be filled later, and the validated URL

graphai.core.scraping.scraping.parse_page_type(url, validated_url)

Parses the type of a page according to predefined types :param url: Given URL :param validated_url: Base validated URL

Returns:: Page type

graphai.core.scraping.scraping.process_all_sublinks(data, base_url, validated_url)

Processes all the sublinks and extracts their contents :param data: Data dict (which will be modified) :param base_url: Corrected (but not validated) base URL (used for the id of the sublink) :param validated_url: Validated base URL

Returns:: Modified data dict

graphai.core.scraping.scraping.remove_headers(data)

Removes all headers and footers from the data dict, which contains the contents of all the sublinks of a base URL :param data: Data dict

Returns:: Modified data dict with all headers and footers eliminated

graphai.core.scraping.scraping.remove_long_patterns(data, min_length=1024)

Removes all long patterns from the data dict, which contains the contents of all the sublinks of a base URL :param data: Data dict :param min_length: Minimum length of a long pattern

Returns:: Modified data dict with all long patterns eliminated

graphai.core.scraping.scraping.cache_lookup_get_sublinks(token)

graphai.core.scraping.scraping.initialize_url_and_get_sublinks(token, url)

graphai.core.scraping.scraping.scraping_sublinks_callback(results)

graphai.core.scraping.scraping.cache_lookup_process_all_sublinks(token, headers, long_patterns)

graphai.core.scraping.scraping.process_all_scraping_sublinks_parallel(results, i, n_total)

graphai.core.scraping.scraping.sublink_parallel_processing_merge_callback(results)

graphai.core.scraping.scraping.remove_junk_scraping_parallel(results, i, n_total, headers, long_patterns)

graphai.core.scraping.scraping.extract_scraping_content_callback(results, headers, long_patterns)