graphai.core.utils.text.clean module

class graphai.core.utils.text.clean.HTMLCleaner

Bases: HTMLParser

Class to parse and clean HTML tags from raw text.

handle_starttag(tag, attrs)
handle_endtag(tag)
handle_data(d)
get_data()
graphai.core.utils.text.clean.normalize(text)

Normalizes the given text by solving encoding problems, deleting URLs, emails, cleaning HTML tags and converting to lowercase.

Parameters:

text (str) – Text to be normalized.

Returns:

Normalized text.

Return type:

str