graphai.pipelines.investment.concept_configuration module

graphai.pipelines.investment.concept_configuration.norm(x)

Computes the norm of the different configurations in x.

Parameters:: x (pd.DataFrame) – DataFrame whose columns are key + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key. For example, if x has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].

Returns (pd.DataFrame): DataFrame with columns key + [‘Norm’], containing the norm of each configuration, computed: as sqrt{sum_{c in C} X(c)^2}, where C is the set of concepts and X: C o [0, 1] is a given configuration.

graphai.pipelines.investment.concept_configuration.normalise(x)

Returns the same set of configurations as in x, with their scores divided by each norm. Hence, all configurations in x have norm 1.

Parameters:: x (pd.DataFrame) – DataFrame whose columns are key + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key.

Returns (pd.DataFrame): DataFrame with the same columns as x, containing configurations indexed by the same set as: x. The score of a concept in the returned configuration is the score it has in x divided by the norm of the configuration.

graphai.pipelines.investment.concept_configuration.mix(x, edges, min_ratio=0.05)

Mixes the configurations in x according to the edges in edges.

Parameters:

x (pd.DataFrame) – DataFrame whose columns are key + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key. For example, if x has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].
edges (pd.DataFrame) – DataFrame whose columns are [‘SourcePageID’, ‘TargetPageID’, ‘Score’], which define the weighted edges of the concepts graph.
min_ratio (float) – For every resulting configuration, only concepts whose ratio of score over maximum score is above min_ratio are kept. If set to 0, then all concepts are kept.

Returns (pd.DataFrame): DataFrame with the same columns as x, containing configurations indexed by the same set as: x. The score of a concept in the mixed configuration is the arithmetic mean of the products of the configuration score and the edge score in the 1-ball of the concept, assuming every concept has a loop with score 1.

graphai.pipelines.investment.concept_configuration.normalise_graph(edges)

Adds missing reverse edges and averages scores. Adds loops on each vertex with a score of one.

Parameters:: edges (pd.DataFrame) – DataFrame whose columns are [‘SourcePageID’, ‘TargetPageID’, ‘Score’], which define the weighted edges of the concepts graph.

Returns (pd.DataFrame): DataFrame whose columns are [‘SourcePageID’, ‘TargetPageID’, ‘Score’], with each pair in: both directions and with loops on every vertex with a score of 1.

graphai.pipelines.investment.concept_configuration.combine(x, y, pairs)

Combines the configurations in x and y based on the associations in pairs.

Parameters:

x (pd.DataFrame) – DataFrame whose columns are key_x + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key_x. For example, if x has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].
y (pd.DataFrame) – DataFrame whose columns are key_y + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key_y. For example, if y has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].
pairs (pd.DataFrame) – DataFrame whose columns are key_x + key_y. Configurations in x and y are compared based on the associations in this DataFrame.

Returns (pd.DataFrame): DataFrame with columns key_x + key_y + [‘PageID’, ‘Score’]. For each row in pairs, there is: a configuration of scores in the returned DataFrame. The score for each concept is the geometric mean of the scores in x and y.

graphai.pipelines.investment.concept_configuration.compute_affinities(x, y, pairs, edges=None, mix_x=False, mix_y=False, normalise_before=False, method='cosine', k=1)

Computes affinity scores between the pairs configurations in x and y indexed in pairs, according to the concepts (edge-weighted) graph specified in edges.

Parameters:

x (pd.DataFrame) – DataFrame whose columns are key_x + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key_x. For example, if x has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].
y (pd.DataFrame) – DataFrame whose columns are key_y + [‘PageID’, ‘Score’]. Each configuration of scores is given by a unique tuple of values for columns in key_y. For example, if y has columns [‘InvestorID’, ‘PageID’, ‘Score’], then each value of InvestorID is assumed to index a configuration, given by the columns [‘PageID’, ‘Score’].
pairs (pd.DataFrame) – DataFrame whose columns are key_x + key_y. Configurations in x and y are compared based on the associations in this DataFrame.
edges (pd.DataFrame) – DataFrame whose columns are [‘SourcePageID’, ‘TargetPageID’, ‘Score’], which define the weighted edges of the concepts graph. Only required if one of mix_x or mix_y are True.
mix_x (bool) – Whether to replace x with its mixing before affinity computation. Recommended to set to True if the configurations in x have a low number of concepts. If set to True, then edges is required.
mix_y (bool) – Whether to replace y with its mixing before affinity computation. Recommended to set to True if the configurations in y have a low number of concepts. If set to True, then edges is required.
normalise_before (bool) – Whether to normalise score configurations to have norm 1 before computing affinities.
method (str) – Which method to use to compute affinities. ‘euclidean’ uses a function based on the euclidean distance of each pair of configurations. ‘cosine’ uses a function based on cosine similarity of each pair of configurations. Notice that ‘cosine’ performs faster than ‘euclidean’.
k (float) – Coefficient that controls the shape of the affinity function for method=’euclidean’. It takes any value in (0, +inf), typical values range from 0.1 to 10. The higher the value of k, the higher the score the same pair of configurations will be assigned. Unused if method=’cosine’.

Returns (pd.DataFrame): DataFrame with columns key_x + key_y + [‘Score’], containing the same rows as pairs.

For each pair of configuration X and Y, their score is computed as follows:

If method=’cosine’, the score is the ratio of the norm of U*V squared (equivalently, the scalar product
<U, V>) and the product of norms of U and V.
If method=’euclidean’, the score is 1 - tanh(k * ||U - V||), for some k > 0.

If mix_x is True, then U is defined as the mixing of X with respect to edges, otherwise U is X. If mix_y is True, then V is defined as the mixing of Y with respect to edges, otherwise V is Y. Finally, U*V denotes the combination of U and V.