Running the MAG240M experiments on your own GCP project#

1. (Optional) Pull MAG240M data into your own project#

GiGL assumes your data is available in BQ Tables. BQ is ubiqutous to large enterprises as it provides a serverless, highly scalable, and cost-effective platform for storing and analyzing massive datasets.

We provide a script fetch_data.ipynb which you can utilize to load the MAG240M data into BQ tables in your own project. Alternatively, you can skip this all together since we a copy of this dataset in BQ that can be utilized right away.

2. Run e2e pipeline#

Prerequiste: Ensure you have access to your own GCP project, and a service account setup. You should also have gcloud cli setup locally and/or running the notebook through a GCP VM. Some basic knowledge of GCP may be necessary here.

Note: If you decided to follow step 1. you may need to subesequently modify paths in examples.MAG240M.preprocessor_config.Mag240DataPreprocessorConfig

Follow along mag240m.ipynb to run an e2e GiGL pipeline on the MAG240M dataset. It will guide you through running each component: config_populator -> data_preprocessor -> subgraph_sampler -> split_generator -> trainer -> inferencer