gigl.src.mocking.lib.pyg_datasets_forks#

Our mocking logic uses public datasets like Cora and DBLP from PyG. PyG datasets are downloaded from public sources which may not be available or rate-limit us. We thus override the dataset classes to download the datasets from GCS buckets to avoid issues.

Attributes#

unprocessed_datasets_gcs_uri

Classes#

`CoraFromGCS`	The citation network datasets `"Cora"`, `"CiteSeer"` and
`DBLPFromGCS`	A subset of the DBLP computer science bibliography website, as

Module Contents#

class gigl.src.mocking.lib.pyg_datasets_forks.CoraFromGCS(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None, force_reload=False)[source]#

Bases: torch_geometric.datasets.Planetoid

The citation network datasets "Cora", "CiteSeer" and "PubMed" from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. Nodes represent documents and edges represent citation links. Training, validation and test splits are given by binary masks.

Parameters:

root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset ("Cora", "CiteSeer", "PubMed").
split (str, optional) –
The type of dataset split ("public", "full", "geom-gcn", "random"). If set to "public", the split will be the public fixed split from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. If set to "full", all nodes except those in the validation and test sets will be used for training (as in the “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling” paper). If set to "geom-gcn", the 10 public fixed splits from the “Geom-GCN: Geometric Graph Convolutional Networks” paper are given. If set to "random", train, validation, and test sets will be randomly generated, according to num_train_per_class, num_val and num_test. (default: "public")
num_train_per_class (int, optional) – The number of training samples per class in case of "random" split. (default: 20)
num_val (int, optional) – The number of validation samples in case of "random" split. (default: 500)
num_test (int, optional) – The number of test samples in case of "random" split. (default: 1000)
transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)
pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)
force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name	#nodes	#edges	#features	#classes
Cora	2,708	10,556	1,433	7
CiteSeer	3,327	9,104	3,703	6
PubMed	19,717	88,648	500	3

download()[source]#: Downloads the dataset to the self.raw_dir folder.

url[source]#

class gigl.src.mocking.lib.pyg_datasets_forks.DBLPFromGCS(root, transform=None, pre_transform=None, force_reload=False)[source]#

Bases: torch_geometric.datasets.DBLP

A subset of the DBLP computer science bibliography website, as collected in the “MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding” paper. DBLP is a heterogeneous graph containing four types of entities - authors (4,057 nodes), papers (14,328 nodes), terms (7,723 nodes), and conferences (20 nodes). The authors are divided into four research areas (database, data mining, artificial intelligence, information retrieval). Each author is described by a bag-of-words representation of their paper keywords.

Parameters:

root (str) – Root directory where the dataset should be saved.
transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)
pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)
force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Node/Edge Type	#nodes/#edges	#features	#classes
Author	4,057	334	4
Paper	14,328	4,231
Term	7,723	50
Conference	20	0
Author-Paper	196,425
Paper-Term	85,810
Conference-Paper	14,328

download()[source]#: Downloads the dataset to the self.raw_dir folder.

url[source]#

gigl.src.mocking.lib.pyg_datasets_forks.unprocessed_datasets_gcs_uri[source]#