gigl.src.post_process.utils.cosine_similarity#

Attributes#

COSINE_SIM_FIELD

Functions#

`assert_cosine_similarity_stats`(...)
`calculate_cosine_sim_between_embedding_tables`(...)	Return: a pd.Dataframe with columns: {DEFAULT_NODE_ID_FIELD, _emb_1, _emb_2, COSINE_SIM_FIELD}
`calculate_cosine_similarity_stats`(cosine_sim_df)	Calculates statistics of cosine similarity
`get_table_paths_via_timedelta`(bq_utils, ...)

Module Contents#

gigl.src.post_process.utils.cosine_similarity.assert_cosine_similarity_stats(cosine_similarity_stats, expected_cosine_similarity)[source]#

Parameters:

cosine_similarity_stats (pandas.DataFrame)
expected_cosine_similarity (dict[str, float])

Return type:

None

gigl.src.post_process.utils.cosine_similarity.calculate_cosine_sim_between_embedding_tables(bq_utils, table_1, table_2, n)[source]#

Return: a pd.Dataframe with columns: {DEFAULT_NODE_ID_FIELD, _emb_1, _emb_2, COSINE_SIM_FIELD} NOTE: Currently, the query below takes 17min for n=100M. If in future we wish to increase n to avoid the issue: results that exceed the BQ query limit, we can comment out the last lines. For, now we don’t do so as we don’t need to evaluate cosine similarity for more than 100M embeddings. Hence, there is no need to store an extra table in BQ.

Parameters:

bq_utils (gigl.src.common.utils.bq.BqUtils)
table_1 (str)
table_2 (str)
n (int)

Return type:

pandas.DataFrame

gigl.src.post_process.utils.cosine_similarity.calculate_cosine_similarity_stats(cosine_sim_df)[source]#

Calculates statistics of cosine similarity Args: pd.DataFrame: with columns: {DEFAULT_NODE_ID_FIELD, _emb_1, _emb_2, COSINE_SIM_FIELD} :returns: with columns: {count, mean, std, min, 1%, 5%, 25%, 50%, 75%, 95%, 99%, max, dtype} :rtype: pd.Series

Parameters:: cosine_sim_df (pandas.DataFrame)
Return type:: pandas.Series

gigl.src.post_process.utils.cosine_similarity.get_table_paths_via_timedelta(bq_utils, reference_table, lookback_days)[source]#

Parameters:

bq_utils (BqUtils)
reference_table (str) – example: project.gbml_embeddings.embeddings_gigl_2024_01_01
lookback_days (int) – search within this many days and get the latest available table

Return type:

Tuple[str, str]

gigl.src.post_process.utils.cosine_similarity.COSINE_SIM_FIELD = '_cosine'[source]#