gigl.src.mocking.lib.pyg_datasets_forks#

Our mocking logic uses public datasets like Cora and DBLP from PyG. PyG datasets are downloaded from public sources which may not be available or rate-limit us. We thus override the dataset classes to download the datasets from GCS buckets to avoid issues.

Attributes#

Classes#

CoraFromGCS

The citation network datasets "Cora", "CiteSeer" and

DBLPFromGCS

A subset of the DBLP computer science bibliography website, as

Module Contents#

class gigl.src.mocking.lib.pyg_datasets_forks.CoraFromGCS(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None, force_reload=False)[source]#

Bases: torch_geometric.datasets.Planetoid

The citation network datasets "Cora", "CiteSeer" and "PubMed" from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. Nodes represent documents and edges represent citation links. Training, validation and test splits are given by binary masks.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("Cora", "CiteSeer", "PubMed").

  • split (str, optional) –

    The type of dataset split ("public", "full", "geom-gcn", "random"). If set to "public", the split will be the public fixed split from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. If set to "full", all nodes except those in the validation and test sets will be used for training (as in the “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling” paper). If set to "geom-gcn", the 10 public fixed splits from the “Geom-GCN: Geometric Graph Convolutional Networks” paper are given. If set to "random", train, validation, and test sets will be randomly generated, according to num_train_per_class, num_val and num_test. (default: "public")

  • num_train_per_class (int, optional) – The number of training samples per class in case of "random" split. (default: 20)

  • num_val (int, optional) – The number of validation samples in case of "random" split. (default: 500)

  • num_test (int, optional) – The number of test samples in case of "random" split. (default: 1000)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name

#nodes

#edges

#features

#classes

Cora

2,708

10,556

1,433

7

CiteSeer

3,327

9,104

3,703

6

PubMed

19,717

88,648

500

3

download()[source]#

Downloads the dataset to the self.raw_dir folder.

url[source]#
class gigl.src.mocking.lib.pyg_datasets_forks.DBLPFromGCS(root, transform=None, pre_transform=None, force_reload=False)[source]#

Bases: torch_geometric.datasets.DBLP

A subset of the DBLP computer science bibliography website, as collected in the “MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding” paper. DBLP is a heterogeneous graph containing four types of entities - authors (4,057 nodes), papers (14,328 nodes), terms (7,723 nodes), and conferences (20 nodes). The authors are divided into four research areas (database, data mining, artificial intelligence, information retrieval). Each author is described by a bag-of-words representation of their paper keywords.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.HeteroData object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Node/Edge Type

#nodes/#edges

#features

#classes

Author

4,057

334

4

Paper

14,328

4,231

Term

7,723

50

Conference

20

0

Author-Paper

196,425

Paper-Term

85,810

Conference-Paper

14,328

download()[source]#

Downloads the dataset to the self.raw_dir folder.

url[source]#
gigl.src.mocking.lib.pyg_datasets_forks.unprocessed_datasets_gcs_uri[source]#