Leverage Google Cloud's powerful ecosystem for end-to-end data science and machine learning workflows.
- 📚 Table of Contents
- ✨ Overview / Introduction
- 🚀 Feature Highlights
- 🏗️ Architecture & Core Components
- 🚦 Getting Started
- 🛠️ Usage & Workflows
⚠️ Limitations, Known Issues & Future Roadmap- 🤝 Contributing & Development Guidelines
- 📄 License, Credits & Contact
- 🗃️ Appendix
Welcome to the data-science-on-gcp
repository! This project serves as the official companion code for the book "Data Science on the Google Cloud Platform, 2nd Edition" by Valliappa Lakshmanan. It provides a comprehensive collection of practical examples, scripts, and Jupyter notebooks to guide you through building end-to-end data science and machine learning solutions using Google Cloud Platform (GCP).
Purpose & Goals: The primary goal of this repository is to bridge the gap between theoretical data science concepts and their practical implementation on a robust, scalable cloud platform. It aims to empower data scientists, machine learning engineers, and developers to effectively leverage GCP services for various stages of the data science lifecycle, from data ingestion and processing to model training, deployment, and real-time predictions.
Why it Matters / Problem it Solves: Building production-ready data science pipelines often requires navigating complex cloud infrastructure and services. This repository provides a structured, hands-on approach, demonstrating best practices and concrete code examples to:
- Simplify the adoption of GCP for data science workloads.
- Showcase scalable solutions for large datasets.
- Enable efficient experimentation and deployment of machine learning models.
- Provide a reusable codebase for common data science tasks on GCP.
Target Audience: This project is ideal for:
- Data Scientists & ML Engineers: Looking to operationalize their models and scale their workflows on GCP.
- Software Developers: Interested in integrating data science capabilities into their applications using cloud services.
- Students & Learners: Seeking practical, hands-on experience with real-world data science problems on a leading cloud platform.
- Anyone: Who wants to understand the end-to-end journey of data from raw input to actionable insights and predictions using Google Cloud.
This repository covers a wide array of data science and machine learning capabilities on GCP, organized into modular chapters.
-
🔍 Data Ingestion & Storage (
02_ingest
)- ✅ Robust scripts for downloading and uploading large datasets to Google Cloud Storage (GCS).
- 💡 Automated monthly updates via Cloud Run and Docker for continuous data refresh.
- 🚀 Efficient loading of data into BigQuery for analytics.
-
📊 SQL-based Data Exploration & Preparation (
03_sqlstudio
)- ✅ Examples of complex SQL queries for data cleansing, transformation, and feature engineering within BigQuery.
- 💡 Creation of views for simplifying data access and ensuring data consistency.
- 🚀 Contingency scripts for managing data schema evolution.
-
⚡ Real-time Streaming & Processing (
04_streaming
)- ✅ Simulating real-time data streams using Pub/Sub.
- 💡 Data transformation and aggregation with Apache Beam on Dataflow.
- 🚀 Real-time analytics and dashboarding patterns.
-
🧪 Interactive Data Exploration (
05_bqnotebook
)- ✅ Jupyter notebooks for interactive data exploration directly on BigQuery datasets.
- 💡 Scripts for generating training data subsets (e.g.,
trainday
).
-
⚙️ Big Data Processing with Dataproc (
06_dataproc
)- ✅ Creating and managing Apache Spark/Hadoop clusters on Dataproc.
- 💡 Running Spark jobs for large-scale data processing and machine learning (e.g., Naive Bayes).
- 🚀 Serverless Spark execution using Dataproc Serverless.
-
🧠 Scalable Machine Learning with Spark MLlib (
07_sparkml
)- ✅ Training machine learning models (e.g., Logistic Regression) using Spark MLlib on Dataproc.
- 💡 Notebooks for visualizing model performance and experimentation.
-
🌟 Serverless ML with BigQuery ML (
08_bqml
)- ✅ Training various ML models (logistic regression, nonlinear, time-series) directly within BigQuery using SQL.
- 💡 Demonstrations of feature engineering and evaluation with BigQuery ML.
-
📈 Model Deployment with Vertex AI (
09_vertexai
)- ✅ End-to-end workflow for training and deploying TensorFlow 2.x models to Vertex AI Endpoints.
- 💡 Scripts for calling deployed models for predictions.
-
🔄 MLOps Principles & Pipelines (
10_mlops
)- ✅ Implementing MLOps best practices, including continuous training and deployment on Vertex AI.
- 💡 Custom model training with Vertex AI Training.
-
📡 Real-time Prediction Systems (
11_realtime
)- ✅ Building real-time prediction services using deployed models.
- 💡 Data transformation pipelines for preparing input data for predictions.
-
📦 Full Dataset Operations (
12_fulldataset
)- ✅ Guidance for working with complete, larger datasets as opposed to samples.
This project demonstrates a typical data science architecture on GCP, focusing on modularity, scalability, and leveraging managed services.
The core architecture follows a common pattern: data ingestion, processing, analysis, model training, and deployment.
graph TD
subgraph Data Sources
A[Raw Data (CSV, APIs)]
end
subgraph Ingestion & Storage
B[Cloud Storage] --> C[BigQuery]
A --> B
A -- (Real-time) --> D[Pub/Sub]
end
subgraph Data Processing & Transformation
D --> E[Dataflow (Apache Beam)]
E --> C
C --> F[Dataproc (Apache Spark)]
F --> C
end
subgraph Model Training & Development
C -- SQL -- H[BigQuery ML]
C -- Python/Notebooks -- G[Vertex AI Workbench]
F -- Spark MLlib -- G
G -- TensorFlow/Scikit-learn -- I[Vertex AI Training]
end
subgraph Model Deployment & Prediction
H --> J[Vertex AI Endpoints]
I --> J
J -- Batch/Online -- K[Applications/Services]
end
subgraph Monitoring & Orchestration
L[Cloud Monitoring]
M[Cloud Scheduler/Cloud Functions/Cloud Run]
M -- Automate Ingestion/Training --> A,I
end
K -- Feedback Loop --> A
Component Responsibilities:
- Cloud Storage (GCS): Acts as the primary data lake for raw and semi-processed data. Used for staging files before ingestion into BigQuery or for large file storage for Dataproc.
- Pub/Sub: A real-time messaging service used for ingesting streaming data, enabling decoupled and scalable event-driven architectures.
- BigQuery: A fully-managed, serverless data warehouse. Central for analytical queries, data exploration, feature engineering, and direct machine learning (BigQuery ML).
- Dataflow (Apache Beam): A fully-managed service for executing Apache Beam pipelines, enabling scalable batch and streaming data processing and transformation.
- Dataproc (Apache Spark/Hadoop): A managed service for running Apache Spark, Hadoop, Flink, and other open-source data tools. Ideal for large-scale data processing, complex transformations, and Spark MLlib training.
- Vertex AI (Workbench, Training, Endpoints): Google Cloud's unified ML platform.
- Vertex AI Workbench: Managed Jupyter notebooks for interactive development and experimentation.
- Vertex AI Training: For custom model training, leveraging managed infrastructure.
- Vertex AI Endpoints: For deploying trained models as scalable, low-latency prediction services.
- Cloud Run / Cloud Functions: Serverless compute platforms used for scheduled tasks, event-driven triggers, and deploying small microservices (e.g., monthly data updates).
- Cloud Monitoring: For observing the performance, health, and availability of GCP services and applications.
- Cloud Scheduler: For defining and executing cron jobs, often used to trigger Cloud Run or Cloud Functions.
- Programming Languages: Python (primary), SQL, Shell Scripting (Bash)
- Core Libraries:
- Data Manipulation: Pandas, NumPy
- Machine Learning: TensorFlow (v2.x), Scikit-learn, Spark MLlib
- Big Data: Apache Beam, Apache Spark
- GCP Interaction: Google Cloud Client Libraries
- Data Storage & Warehousing: Google Cloud Storage, BigQuery
- Compute: Cloud Run, Dataflow, Dataproc, Vertex AI (managed services)
- Orchestration: Cloud Scheduler,
gcloud
CLI
Follow these steps to set up your environment and start exploring the code.
Before you begin, ensure you have the following installed and configured:
- Google Cloud Platform Account: You'll need an active GCP account.
gcloud
CLI: Google Cloud SDK installed and authenticated.- Install Google Cloud SDK
- Authenticate:
gcloud auth login
- Set up application default credentials:
gcloud auth application-default login
- Python 3.8+: Recommended version.
- Docker: Required for running Dockerized services (e.g.,
02_ingest/monthlyupdate
).
Start by cloning the project repository to your local machine.
git clone https://github.com/GDSC-FSC/data-science-on-gcp.git
cd data-science-on-gcp
All operations require a GCP project.
Click to expand: Create a new GCP Project
If you don't have a project, create one:
- Go to the Google Cloud Console.
- In the project selector, click "New Project".
- Enter a Project name (e.g.,
ds-on-gcp-project
). - Note down your Project ID (e.g.,
ds-on-gcp-project-12345
). This will be used frequently. - Click "Create".
Set your default project and region for the gcloud
CLI. Replace YOUR_PROJECT_ID
with your actual GCP Project ID and YOUR_GCP_REGION
with a suitable region (e.g., us-central1
).
gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/region YOUR_GCP_REGION
gcloud config set compute/zone YOUR_GCP_REGION-a # Or a specific zone within your region
Enable Required APIs: Many services require explicit API enablement. You can enable them all at once:
gcloud services enable \
compute.googleapis.com \
storage.googleapis.com \
bigquery.googleapis.com \
pubsub.googleapis.com \
dataflow.googleapis.com \
dataproc.googleapis.com \
aiplatform.googleapis.com \
cloudbuild.googleapis.com \
cloudfunctions.googleapis.com \
run.googleapis.com \
cloudscheduler.googleapis.com \
cloudresourcemanager.googleapis.com
For a fully pre-configured environment, click the "Open in Cloud Shell" button at the top of this README. This will clone the repository and open it in a browser-based terminal with the Google Cloud SDK already installed and authenticated.
Each directory in this repository corresponds to a chapter/module in the book and focuses on a specific aspect of data science on GCP. Below are examples for key modules.
This module covers downloading raw data, uploading it to Google Cloud Storage, and loading it into BigQuery. It also includes setting up automated monthly updates.
-
Navigate to the module directory:
cd 02_ingest
-
Download and Upload Raw Data: The
ingest.sh
script orchestrates downloading flight data, uploading it to GCS, and creating a raw table in BigQuery.Click to expand: View `ingest.sh` details
The
ingest.sh
script performs the following:- Calls
download.sh
to download the flight data from a public source. - Calls
upload.sh
to upload the downloaded data to a GCS bucket you specify. - Loads the data from GCS into a BigQuery table.
You might need to adjust bucket names and dataset IDs in the scripts. Ensure your
PROJECT_ID
andBUCKET
environment variables are set. - Calls
-
Explore Data in BigQuery: Once loaded, you can query the data in the BigQuery console or using the
bq
command-line tool.bq query --use_legacy_sql=false "SELECT * FROM flights.rawflights LIMIT 10"
This module demonstrates building a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery.
-
Navigate to the module directory:
cd 04_streaming
-
Setup Pub/Sub Topic and BigQuery Table: Create a Pub/Sub topic and a BigQuery table to store the streaming results.
Click to expand: Create Pub/Sub topic and BigQuery table
# Create Pub/Sub topic gcloud pubsub topics create flights_realtime # Create BigQuery dataset and table (schema defined in design/airport_schema.json) bq mk --dataset YOUR_PROJECT_ID:flights bq mk --table YOUR_PROJECT_ID:flights.streaming_airports design/airport_schema.json
-
Run a Dataflow Job: The
df01.py
script transforms streaming airport data. Execute it as a Dataflow job.Click to expand: View `df01.py` details
This script uses Apache Beam to read messages from a Pub/Sub topic, apply transformations, and write the results to a BigQuery table. It typically involves:
- Reading from
flights_realtime
Pub/Sub topic. - Parsing JSON messages.
- Applying windowing and aggregations (e.g., calculating averages).
- Writing to a BigQuery table.
- Reading from
-
Simulate Real-time Data: In a separate terminal, run the simulator to publish data to the Pub/Sub topic.
cd ../simulate python simulate.py --project YOUR_PROJECT_ID --topic flights_realtime --speedup 600
You will see data flowing into your BigQuery table.
This module showcases how to build and evaluate machine learning models directly within BigQuery using SQL.
- Navigate to the module directory:
cd 08_bqml
- Open the Jupyter Notebook:
Launch a Jupyter Notebook server (e.g., via Vertex AI Workbench or locally) and open
bqml_logistic.ipynb
.# If using local Jupyter jupyter notebook
- Execute Notebook Cells:
Follow the steps in
bqml_logistic.ipynb
to:- Create a BigQuery dataset if not already present.
- Load sample data into a BigQuery table (or use data from
02_ingest
). - Define and train a logistic regression model using
CREATE MODEL
SQL statements. - Evaluate the model's performance using
EVALUATE
statements. - Make predictions using
PREDICT
statements.
This module focuses on taking a trained TensorFlow model and deploying it to Vertex AI for online predictions.
-
Navigate to the module directory:
cd 09_vertexai
-
Open the Jupyter Notebook: Open
flights_model_tf2.ipynb
in your Jupyter environment. -
Train and Export Model: Execute the notebook cells to:
- Preprocess data.
- Train a TensorFlow 2.x model for flight delay prediction.
- Export the trained model in the TensorFlow SavedModel format.
-
Deploy to Vertex AI: The notebook will guide you through uploading the model to Vertex AI Model Registry and deploying it to an Endpoint. Alternatively, you can use the
gcloud
CLI:Click to expand: Deploy model via `gcloud` CLI (after model export)
# First, upload the model MODEL_DISPLAY_NAME="flights-delay-model" MODEL_ID=$(gcloud ai models upload \ --project=YOUR_PROJECT_ID \ --region=YOUR_GCP_REGION \ --display-name=${MODEL_DISPLAY_NAME} \ --container-image-uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest" \ --artifact-uri="gs://YOUR_PROJECT_ID-bucket/flights_model_dir" \ # Path where your SavedModel is --format="value(model_id)") # Then, create an endpoint ENDPOINT_DISPLAY_NAME="flights-delay-endpoint" ENDPOINT_ID=$(gcloud ai endpoints create \ --project=YOUR_PROJECT_ID \ --region=YOUR_GCP_REGION \ --display-name=${ENDPOINT_DISPLAY_NAME} \ --format="value(endpoint_id)") # Finally, deploy the model to the endpoint gcloud ai endpoints deploy-model ${ENDPOINT_ID} \ --project=YOUR_PROJECT_ID \ --region=YOUR_GCP_REGION \ --model=${MODEL_ID} \ --display-name=${MODEL_DISPLAY_NAME}-deployed \ --machine-type=n1-standard-2 \ --min-replica-count=1 \ --max-replica-count=1
-
Make Predictions: Use
call_predict.sh
(orcall_predict.py
from10_mlops
) to send prediction requests to the deployed endpoint../call_predict.sh
- Dataset Size: While designed for scalability, certain examples might use smaller datasets or samples for quick execution. Adapting to extremely large, multi-TB datasets may require further optimization.
- Regional Restrictions: Some GCP services or features might not be available in all regions. Ensure your chosen
YOUR_GCP_REGION
supports all services used in the modules you run. - Cost Management: Running all examples, especially those involving Dataproc or continuous Dataflow jobs, can incur significant GCP costs. Monitor your billing regularly.
- Dependency Conflicts: Python library versions can sometimes cause conflicts. It's highly recommended to use
venv
orconda
for isolated environments. - GCP API Enablement: Forgetting to enable a required GCP API is a common issue. If you encounter permissions errors, double-check that all relevant APIs (as listed in Getting Started) are enabled.
- Qwiklabs Verification: The code on Qwiklabs is continually tested and kept up-to-date. If you encounter issues, we recommend trying the corresponding Qwiklab lab to verify if any setup steps were missed. Please leave feedback in Qwiklabs or file an issue in this repository if problems persist.
We are continuously working to improve and expand this repository. Planned enhancements include:
- Expanded Vertex AI Coverage: More examples for Vertex AI Pipelines, Feature Store, and Monitoring.
- Cost Optimization Guides: Adding specific tips and configurations to manage GCP costs effectively across different modules.
- Terraform/Pulumi Integration: Providing Infrastructure as Code (IaC) examples for setting up GCP resources.
- Newer GCP Services: Incorporating examples for new or updated GCP data and ML services as they become available.
- Generalize for Other Datasets: Providing guidelines or modified scripts to easily adapt the workflows to different datasets.
We welcome contributions from the community to improve this repository! Please follow these guidelines to ensure a smooth collaboration process.
- Fork the Repository: Start by forking the
GDSC-FSC/data-science-on-gcp
repository to your GitHub account. - Create a Feature Branch: Create a new branch for your changes (e.g.,
feature/my-new-example
orfix/bug-description
).git checkout -b feature/your-feature-name
- Implement Your Changes: Make your modifications, additions, or bug fixes.
- Test Your Changes: Ensure your code works as expected and doesn't introduce regressions.
- Commit Your Changes: Write clear and concise commit messages.
git commit -m "feat: Add new example for X service"
- Push to Your Fork:
git push origin feature/your-feature-name
- Open a Pull Request (PR):
- Go to the original repository on GitHub.
- You should see a prompt to create a new Pull Request from your recently pushed branch.
- Provide a descriptive title and detailed explanation of your changes in the PR description.
- Reference any related issues.
- Branch Naming: Use descriptive names for your branches (e.g.,
feature/new-bqml-example
,fix/streaming-bug
). - Small, Focused PRs: Keep your PRs small and focused on a single change or feature to facilitate easier review.
- Clear Descriptions: Explain the "what" and "why" of your changes in the PR description. Include screenshots or terminal outputs if helpful.
- Review Process: Your PR will be reviewed by maintainers. Be responsive to feedback and be prepared to make iterative changes.
- Python: Adhere to PEP 8 style guidelines. Use
black
for formatting andflake8
for linting. - Shell Scripts: Ensure scripts are clear, well-commented, and robust.
- SQL: Format SQL queries for readability, using consistent capitalization for keywords.
- Jupyter Notebooks: Keep notebooks clean, with clear explanations for each step and cell output.
- Local Testing: Before submitting a PR, thoroughly test your changes in your local environment and, if applicable, on a GCP project.
- Qwiklabs Verification: If your changes relate to a specific module covered by Qwiklabs, try to verify compatibility there.
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for full details.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
- Author: Valliappa Lakshmanan (for the original book and much of the codebase).
- Book: "Data Science on the Google Cloud Platform, 2nd Edition" (O'Reilly, Apr 2022).
- Google Cloud Platform: For providing the comprehensive suite of services that make these solutions possible.
- Contributors: To all individuals who have contributed to the evolution and improvement of this codebase.
If you have questions, suggestions, or encounter issues, please:
- File an Issue: Use the GitHub Issues page for bug reports, feature requests, or general questions.
- Qwiklabs Feedback: If you are following the Qwiklabs labs, please use their feedback mechanisms for specific lab-related queries.
- April 2022: Initial release of the 2nd Edition companion code.
- November 2019: Major update to support TensorFlow 2.0, Cloud Functions, and BigQuery ML.
- Earlier Versions: Continuous updates and improvements aligned with the evolution of Google Cloud services and the book's content.
Q: How can I manage costs when running these examples?
A: Many GCP services incur costs. To manage them: 1. **Monitor Billing:** Regularly check your GCP billing dashboard. 2. **Delete Resources:** After completing a module, delete resources like Dataproc clusters (`gcloud dataproc clusters delete`), Dataflow jobs (cancel from Dataflow UI), and Vertex AI Endpoints (`gcloud ai endpoints delete`). 3. **Use Free Tier:** Leverage GCP's free tier for services where applicable. 4. **Set Quotas:** Consider setting custom quotas for resource usage to prevent unexpected high costs.Q: My script/notebook failed with a `permission denied` error. What should I do?
A: This usually indicates that the service account or user running the operation lacks the necessary IAM permissions. 1. **Check IAM Roles:** Ensure your user account or the service account assigned to your VM/Cloud Run instance has the appropriate roles (e.g., `BigQuery Data Editor`, `Storage Object Admin`, `Dataflow Developer`). 2. **Enable APIs:** Verify that all required APIs are enabled for your project (see [Getting Started](#2-google-cloud-project-setup)). 3. **`gcloud auth`:** Ensure your `gcloud` CLI is correctly authenticated (`gcloud auth login`) and has application default credentials (`gcloud auth application-default login`).Q: Can I use a different dataset with these examples?
A: Yes, many examples are adaptable. 1. **Schema Alignment:** The most critical step is to adapt the data ingestion and transformation scripts (`02_ingest`, `04_streaming`) to your dataset's schema. 2. **Feature Engineering:** Update SQL queries (`03_sqlstudio`, `08_bqml`) and Python notebooks (`05_bqnotebook`, `09_vertexai`) to use your dataset's features. 3. **Model Training:** Retrain models with your data, adjusting model architectures if necessary.- Check
gcloud
Configuration:- Verify your project ID and region:
gcloud config list
- Ensure you are authenticated:
gcloud auth list
andgcloud auth application-default print-access-token
- Verify your project ID and region:
- API Enablement: Review the list of APIs to enable in Getting Started and run the
gcloud services enable
command again to be safe. - Resource Existence: Ensure that GCS buckets, Pub/Sub topics, and BigQuery datasets/tables you're trying to use actually exist and are correctly named.
- Logs: Always check the logs in the Google Cloud Console for specific error messages.
- For Dataflow jobs: Dataflow UI -> Job Details -> Logs
- For Cloud Run/Functions: Cloud Logging
- For Vertex AI: Vertex AI Console -> Model/Endpoint -> Logs
- Environment Variables: Confirm that all necessary environment variables (e.g.,
PROJECT_ID
,BUCKET
) are set correctly before running scripts.echo $PROJECT_ID echo $BUCKET
- Python Virtual Environments: Use
venv
orconda
to avoid dependency conflicts.python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # if a requirements.txt is provided in a module