gigl.src.mocking.lib.pyg_datasets_forks#
Our mocking logic uses public datasets like Cora and DBLP from PyG. PyG datasets are downloaded from public sources which may not be available or rate-limit us. We thus override the dataset classes to download the datasets from GCS buckets to avoid issues.
Attributes#
Classes#
The citation network datasets |
|
A subset of the DBLP computer science bibliography website, as |
Module Contents#
- class gigl.src.mocking.lib.pyg_datasets_forks.CoraFromGCS(root, name, split='public', num_train_per_class=20, num_val=500, num_test=1000, transform=None, pre_transform=None, force_reload=False)[source]#
Bases:
torch_geometric.datasets.Planetoid
The citation network datasets
"Cora"
,"CiteSeer"
and"PubMed"
from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. Nodes represent documents and edges represent citation links. Training, validation and test splits are given by binary masks.- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset (
"Cora"
,"CiteSeer"
,"PubMed"
).split (str, optional) –
The type of dataset split (
"public"
,"full"
,"geom-gcn"
,"random"
). If set to"public"
, the split will be the public fixed split from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. If set to"full"
, all nodes except those in the validation and test sets will be used for training (as in the “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling” paper). If set to"geom-gcn"
, the 10 public fixed splits from the “Geom-GCN: Geometric Graph Convolutional Networks” paper are given. If set to"random"
, train, validation, and test sets will be randomly generated, according tonum_train_per_class
,num_val
andnum_test
. (default:"public"
)num_train_per_class (int, optional) – The number of training samples per class in case of
"random"
split. (default:20
)num_val (int, optional) – The number of validation samples in case of
"random"
split. (default:500
)num_test (int, optional) – The number of test samples in case of
"random"
split. (default:1000
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False
)
STATS:
Name
#nodes
#edges
#features
#classes
Cora
2,708
10,556
1,433
7
CiteSeer
3,327
9,104
3,703
6
PubMed
19,717
88,648
500
3
- class gigl.src.mocking.lib.pyg_datasets_forks.DBLPFromGCS(root, transform=None, pre_transform=None, force_reload=False)[source]#
Bases:
torch_geometric.datasets.DBLP
A subset of the DBLP computer science bibliography website, as collected in the “MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding” paper. DBLP is a heterogeneous graph containing four types of entities - authors (4,057 nodes), papers (14,328 nodes), terms (7,723 nodes), and conferences (20 nodes). The authors are divided into four research areas (database, data mining, artificial intelligence, information retrieval). Each author is described by a bag-of-words representation of their paper keywords.
- Parameters:
root (str) – Root directory where the dataset should be saved.
transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.HeteroData
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.HeteroData
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False
)
STATS:
Node/Edge Type
#nodes/#edges
#features
#classes
Author
4,057
334
4
Paper
14,328
4,231
Term
7,723
50
Conference
20
0
Author-Paper
196,425
Paper-Term
85,810
Conference-Paper
14,328