gigl.src.common.models.layers.count_min_sketch#

Attributes#

Classes#

CountMinSketch

A probability data structure that can be used to estimate the frequency of an item in a stream.

Functions#

calculate_in_batch_candidate_sampling_probability(...)

Calculate in batch negative sampling rate given the frequency tensor, total count and batch size.

Module Contents#

class gigl.src.common.models.layers.count_min_sketch.CountMinSketch(width=2000, depth=10)[source]#

Bases: object

A probability data structure that can be used to estimate the frequency of an item in a stream. For full details please refer to the paper https://dsf.berkeley.edu/cs286/papers/countmin-latin2004.pdf

There is also a good blog from Redis to talk about its application https://redis.io/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/

How accurate is the estimation?

Denote the total count is N, and the width of the table is w, the depth of the table is d. For each row (in d), we have atleast 1/2 of the probability that the hashed count is less than 2N/w Then, with d rows, the final results is more than 2N/w larger than the actual count with a probability less than 1/2^d

Currently, this implementation only uses single thread, and we might want to optimize it if we see performance issue

Parameters:
  • width (int)

  • depth (int)

add(item, delta=1)[source]#

Add an item to the sketch

Parameters:
  • item (Any)

  • delta (int)

Return type:

None

add_torch_long_tensor(tensor)[source]#

Add all items in a torch long tensor to the sketch

Parameters:

tensor (torch.LongTensor)

Return type:

None

estimate(item)[source]#

Return the estimated count of the item

Parameters:

item (Any)

Return type:

int

estimate_torch_long_tensor(tensor)[source]#

Return the estimated count of all items in a torch long tensor

Parameters:

tensor (torch.LongTensor)

Return type:

torch.LongTensor

get_table()[source]#

Return the internal state of the table, for testing purpose

Return type:

numpy.ndarray

total()[source]#

Return the total number of items seen so far

Return type:

int

gigl.src.common.models.layers.count_min_sketch.calculate_in_batch_candidate_sampling_probability(frequency_tensor, total_cnt, batch_size)[source]#

Calculate in batch negative sampling rate given the frequency tensor, total count and batch size. Please see https://www.tensorflow.org/extras/candidate_sampling.pdf for more details Here we estimate the negative sampling probability Q(y|x) P(candidate in batch | x) ~= P(candidate in batch)

= 1 - P(candidate not in batch) = 1 - P(candidate not in any position in batch) ~= 1 - (1 - frequency / total_cnt) ^ batch_size ~= 1 - (1 - batch_size * frequency / total_cnt) = batch_size * frequency / total_cnt

Where the approximation only holds when frequency / total_cnt << 1, which may not be true at the very beginning of training Thus, we cap the probability to be at most 1.0 Note that the estimation for positive and hard negatives may be less accurate than for random negatives because there is a larger error in P(candidate in batch | x) ~= P(candidate in batch)

Parameters:
  • frequency_tensor (torch.LongTensor)

  • total_cnt (int)

  • batch_size (int)

Return type:

torch.Tensor

gigl.src.common.models.layers.count_min_sketch.logger[source]#