gigl.src.common.models.layers.count_min_sketch#
Attributes#
Classes#
A probability data structure that can be used to estimate the frequency of an item in a stream. |
Functions#
Calculate in batch negative sampling rate given the frequency tensor, total count and batch size. |
Module Contents#
- class gigl.src.common.models.layers.count_min_sketch.CountMinSketch(width=2000, depth=10)[source]#
Bases:
object
A probability data structure that can be used to estimate the frequency of an item in a stream. For full details please refer to the paper https://dsf.berkeley.edu/cs286/papers/countmin-latin2004.pdf
There is also a good blog from Redis to talk about its application https://redis.io/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
- How accurate is the estimation?
Denote the total count is N, and the width of the table is w, the depth of the table is d. For each row (in d), we have atleast 1/2 of the probability that the hashed count is less than 2N/w Then, with d rows, the final results is more than 2N/w larger than the actual count with a probability less than 1/2^d
Currently, this implementation only uses single thread, and we might want to optimize it if we see performance issue
- Parameters:
width (int)
depth (int)
- add(item, delta=1)[source]#
Add an item to the sketch
- Parameters:
item (Any)
delta (int)
- Return type:
None
- add_torch_long_tensor(tensor)[source]#
Add all items in a torch long tensor to the sketch
- Parameters:
tensor (torch.LongTensor)
- Return type:
None
- estimate(item)[source]#
Return the estimated count of the item
- Parameters:
item (Any)
- Return type:
int
- estimate_torch_long_tensor(tensor)[source]#
Return the estimated count of all items in a torch long tensor
- Parameters:
tensor (torch.LongTensor)
- Return type:
torch.LongTensor
- gigl.src.common.models.layers.count_min_sketch.calculate_in_batch_candidate_sampling_probability(frequency_tensor, total_cnt, batch_size)[source]#
Calculate in batch negative sampling rate given the frequency tensor, total count and batch size. Please see https://www.tensorflow.org/extras/candidate_sampling.pdf for more details Here we estimate the negative sampling probability Q(y|x) P(candidate in batch | x) ~= P(candidate in batch)
= 1 - P(candidate not in batch) = 1 - P(candidate not in any position in batch) ~= 1 - (1 - frequency / total_cnt) ^ batch_size ~= 1 - (1 - batch_size * frequency / total_cnt) = batch_size * frequency / total_cnt
Where the approximation only holds when frequency / total_cnt << 1, which may not be true at the very beginning of training Thus, we cap the probability to be at most 1.0 Note that the estimation for positive and hard negatives may be less accurate than for random negatives because there is a larger error in P(candidate in batch | x) ~= P(candidate in batch)
- Parameters:
frequency_tensor (torch.LongTensor)
total_cnt (int)
batch_size (int)
- Return type:
torch.Tensor