graphai.core.text.wikisearch module

graphai.core.text.wikisearch.search_wikipedia_api(text, limit=10)

Perform search query to Wikipedia API for a given text.

Parameters:
  • text (str) – Query text for the search.

  • limit (int) – Maximum number of returned results.

Returns:

A list of dictionaries with keys ‘concept_id’ and ‘concept_name’ containing the top matches for the search.

Return type:

list

graphai.core.text.wikisearch.search_elasticsearch(text, es, limit=10)

Perform search query to elasticserch cluster for a given text.

Parameters:
  • text (str) – Query text for the search.

  • es (ESConceptDetection) – Elasticsearch interface.

  • limit (int) – Maximum number of returned results.

Returns:

A list of dictionaries with keys ‘concept_id’, ‘concept_name’ and ‘score’ containing the top matches for the search.

Return type:

list

graphai.core.text.wikisearch.wikisearch(keywords_list, es, fraction=(0, 1), method='es-base')

Finds 10 relevant concepts (Wikipedia pages) for each set of keywords in a list.

Parameters:
  • keywords_list (list(str)) – List containing the sets of keywords for which to search concepts.

  • es (ESConceptDetection) – Elasticsearch interface.

  • fraction (tuple(int, int)) – Portion of the keywords_list to be processed, e.g. (1/3, 2/3) means only

  • considered. (the middle third of the list is) –

  • method (str) – Method to retrieve the concepts (Wikipedia pages). It can be either “wikipedia-api”, to use the

  • API (Wikipedia) –

  • {"es-base" (or one of) –

  • "es-score"}

  • elasticsearch. (to use) –

Returns:

A pandas DataFrame with columns [‘keywords’, ‘concept_id’, ‘concept_name’, ‘searchrank’, ‘search_score’], unique by (‘keywords’, ‘concept_id’). The searchrank is the position of the concept in the list of results for that set of keywords, starting with 1. The search score is the elasticsearch score for method “es-score” or 1 - (searchrank - 1)/n for the other methods. Default: ‘es-base’. Fallback: ‘wikipedia-api’.

Return type:

pd.DataFrame