gigl.distributed.graph_store.sharding#
Graph-store-specific sharding helpers.
Provides ServerSlice and compute_server_assignments() which
implement contiguous server-to-compute-node assignment for graph-store
fetch operations.
Storage servers are assigned to compute nodes in contiguous blocks. Each compute node fetches all data from its assigned server(s) and receives empty tensors for unassigned ones.
If there are more servers than compute nodes, the extra servers are divided among compute nodes and each server’s data is sliced proportionally (e.g. with 3 servers and 2 compute nodes, one compute node receives the first half of the middle server’s data and the other receives the second half). If there are more compute nodes than servers, multiple compute nodes share a single server, each receiving a proportional slice of that server’s data.
When rank and world_size are both None, all data is returned
unsharded from every storage server.
Classes#
The fraction of a storage server's data owned by one compute node. |
Functions#
|
Compute which servers, and which fractions of them, belong to one compute node. |
Module Contents#
- class gigl.distributed.graph_store.sharding.ServerSlice[source]#
The fraction of a storage server’s data owned by one compute node.
Fractions are stored as numerator/denominator pairs rather than concrete indices because the actual tensor length is unknown at assignment time (assignments are computed before data is fetched). The concrete start/end indices are resolved lazily in
slice_tensor().The fraction
[start_numerator/denominator, end_numerator/denominator)describes the half-open interval of the server’s data that this compute node owns.- Parameters:
server_rank – The rank of the storage server this slice refers to.
start_numerator – Numerator of the start fraction.
end_numerator – Numerator of the end fraction.
denominator – Shared denominator for both fractions (always equal to
num_compute_nodes).
Examples
A slice covering the full server (fraction 0/1 to 1/1):
>>> ServerSlice(server_rank=0, start_numerator=0, end_numerator=1, ... denominator=1)
A slice covering the first half of a server (fraction 0/2 to 1/2):
>>> ServerSlice(server_rank=1, start_numerator=0, end_numerator=1, ... denominator=2)
- slice_tensor(tensor)[source]#
Slice a tensor according to this server assignment.
Converts the stored fractions to concrete indices using the tensor’s length, then returns
tensor[start_idx:end_idx].- Parameters:
tensor (torch.Tensor) – The full data tensor from the server identified by
server_rank.- Returns:
The sub-tensor belonging to this compute node.
- Return type:
torch.Tensor
- gigl.distributed.graph_store.sharding.compute_server_assignments(num_servers, num_compute_nodes, compute_rank)[source]#
Compute which servers, and which fractions of them, belong to one compute node.
Maps the range of servers onto the range of compute nodes using a segment-overlap algorithm. Each compute node “owns” a contiguous segment of length
num_serverson a number line of total lengthnum_servers * num_compute_nodes. Each server occupies a segment of lengthnum_compute_nodeson that same line. The overlap between a compute node’s segment and a server’s segment determines the fraction of that server assigned to the compute node.- Parameters:
num_servers (int) – Total number of storage servers.
num_compute_nodes (int) – Total number of compute nodes.
compute_rank (int) – The rank of the compute node to compute assignments for, in
[0, num_compute_nodes).
- Returns:
A dict mapping server rank to a
ServerSlicedescribing the fraction of that server’s data owned bycompute_rank. Servers with no overlap are omitted from the dict.- Raises:
ValueError – If any argument is out of its valid range.
- Return type:
dict[int, ServerSlice]
Examples
With 2 servers and 2 compute nodes, each compute node gets one full server:
>>> compute_server_assignments(num_servers=2, num_compute_nodes=2, compute_rank=0) {0: ServerSlice(server_rank=0, start_numerator=0, end_numerator=2, denominator=2)}
With 3 servers and 2 compute nodes, compute rank 0 gets all of server 0 and the first half of server 1:
>>> compute_server_assignments(num_servers=3, num_compute_nodes=2, compute_rank=0) {0: ServerSlice(..., 0..2/2), 1: ServerSlice(..., 0..1/2)}