`data-science-on-gcp`: Practical Data Science and Machine Learning on Google Cloud Platform

Leverage Google Cloud's powerful ecosystem for end-to-end data science and machine learning workflows.

📚 Table of Contents

📚 Table of Contents
✨ Overview / Introduction
🚀 Feature Highlights
🏗️ Architecture & Core Components
- High-Level Data Science Workflow
- Technology Stack
🚦 Getting Started
🛠️ Usage & Workflows
⚠️ Limitations, Known Issues & Future Roadmap
🤝 Contributing & Development Guidelines
📄 License, Credits & Contact
🗃️ Appendix

✨ Overview / Introduction

Welcome to the data-science-on-gcp repository! This project serves as the official companion code for the book "Data Science on the Google Cloud Platform, 2nd Edition" by Valliappa Lakshmanan. It provides a comprehensive collection of practical examples, scripts, and Jupyter notebooks to guide you through building end-to-end data science and machine learning solutions using Google Cloud Platform (GCP).

Purpose & Goals: The primary goal of this repository is to bridge the gap between theoretical data science concepts and their practical implementation on a robust, scalable cloud platform. It aims to empower data scientists, machine learning engineers, and developers to effectively leverage GCP services for various stages of the data science lifecycle, from data ingestion and processing to model training, deployment, and real-time predictions.

Why it Matters / Problem it Solves: Building production-ready data science pipelines often requires navigating complex cloud infrastructure and services. This repository provides a structured, hands-on approach, demonstrating best practices and concrete code examples to:

Simplify the adoption of GCP for data science workloads.
Showcase scalable solutions for large datasets.
Enable efficient experimentation and deployment of machine learning models.
Provide a reusable codebase for common data science tasks on GCP.

Target Audience: This project is ideal for:

Data Scientists & ML Engineers: Looking to operationalize their models and scale their workflows on GCP.
Software Developers: Interested in integrating data science capabilities into their applications using cloud services.
Students & Learners: Seeking practical, hands-on experience with real-world data science problems on a leading cloud platform.
Anyone: Who wants to understand the end-to-end journey of data from raw input to actionable insights and predictions using Google Cloud.

⬆️ Back to Top

🚀 Feature Highlights

This repository covers a wide array of data science and machine learning capabilities on GCP, organized into modular chapters.

🔍 Data Ingestion & Storage (02_ingest)
- ✅ Robust scripts for downloading and uploading large datasets to Google Cloud Storage (GCS).
- 💡 Automated monthly updates via Cloud Run and Docker for continuous data refresh.
- 🚀 Efficient loading of data into BigQuery for analytics.
📊 SQL-based Data Exploration & Preparation (03_sqlstudio)
- ✅ Examples of complex SQL queries for data cleansing, transformation, and feature engineering within BigQuery.
- 💡 Creation of views for simplifying data access and ensuring data consistency.
- 🚀 Contingency scripts for managing data schema evolution.
⚡ Real-time Streaming & Processing (04_streaming)
- ✅ Simulating real-time data streams using Pub/Sub.
- 💡 Data transformation and aggregation with Apache Beam on Dataflow.
- 🚀 Real-time analytics and dashboarding patterns.
🧪 Interactive Data Exploration (05_bqnotebook)
- ✅ Jupyter notebooks for interactive data exploration directly on BigQuery datasets.
- 💡 Scripts for generating training data subsets (e.g., trainday).
⚙️ Big Data Processing with Dataproc (06_dataproc)
- ✅ Creating and managing Apache Spark/Hadoop clusters on Dataproc.
- 💡 Running Spark jobs for large-scale data processing and machine learning (e.g., Naive Bayes).
- 🚀 Serverless Spark execution using Dataproc Serverless.
🧠 Scalable Machine Learning with Spark MLlib (07_sparkml)
- ✅ Training machine learning models (e.g., Logistic Regression) using Spark MLlib on Dataproc.
- 💡 Notebooks for visualizing model performance and experimentation.
🌟 Serverless ML with BigQuery ML (08_bqml)
- ✅ Training various ML models (logistic regression, nonlinear, time-series) directly within BigQuery using SQL.
- 💡 Demonstrations of feature engineering and evaluation with BigQuery ML.
📈 Model Deployment with Vertex AI (09_vertexai)
- ✅ End-to-end workflow for training and deploying TensorFlow 2.x models to Vertex AI Endpoints.
- 💡 Scripts for calling deployed models for predictions.
🔄 MLOps Principles & Pipelines (10_mlops)
- ✅ Implementing MLOps best practices, including continuous training and deployment on Vertex AI.
- 💡 Custom model training with Vertex AI Training.
📡 Real-time Prediction Systems (11_realtime)
- ✅ Building real-time prediction services using deployed models.
- 💡 Data transformation pipelines for preparing input data for predictions.
📦 Full Dataset Operations (12_fulldataset)
- ✅ Guidance for working with complete, larger datasets as opposed to samples.

⬆️ Back to Top

🏗️ Architecture & Core Components

This project demonstrates a typical data science architecture on GCP, focusing on modularity, scalability, and leveraging managed services.

High-Level Data Science Workflow

The core architecture follows a common pattern: data ingestion, processing, analysis, model training, and deployment.

graph TD
    subgraph Data Sources
        A[Raw Data (CSV, APIs)]
    end

    subgraph Ingestion & Storage
        B[Cloud Storage] --> C[BigQuery]
        A --> B
        A -- (Real-time) --> D[Pub/Sub]
    end

    subgraph Data Processing & Transformation
        D --> E[Dataflow (Apache Beam)]
        E --> C
        C --> F[Dataproc (Apache Spark)]
        F --> C
    end

    subgraph Model Training & Development
        C -- SQL -- H[BigQuery ML]
        C -- Python/Notebooks -- G[Vertex AI Workbench]
        F -- Spark MLlib -- G
        G -- TensorFlow/Scikit-learn -- I[Vertex AI Training]
    end

    subgraph Model Deployment & Prediction
        H --> J[Vertex AI Endpoints]
        I --> J
        J -- Batch/Online -- K[Applications/Services]
    end

    subgraph Monitoring & Orchestration
        L[Cloud Monitoring]
        M[Cloud Scheduler/Cloud Functions/Cloud Run]
        M -- Automate Ingestion/Training --> A,I
    end

    K -- Feedback Loop --> A

Component Responsibilities:

Cloud Storage (GCS): Acts as the primary data lake for raw and semi-processed data. Used for staging files before ingestion into BigQuery or for large file storage for Dataproc.
Pub/Sub: A real-time messaging service used for ingesting streaming data, enabling decoupled and scalable event-driven architectures.
BigQuery: A fully-managed, serverless data warehouse. Central for analytical queries, data exploration, feature engineering, and direct machine learning (BigQuery ML).
Dataflow (Apache Beam): A fully-managed service for executing Apache Beam pipelines, enabling scalable batch and streaming data processing and transformation.
Dataproc (Apache Spark/Hadoop): A managed service for running Apache Spark, Hadoop, Flink, and other open-source data tools. Ideal for large-scale data processing, complex transformations, and Spark MLlib training.
Vertex AI (Workbench, Training, Endpoints): Google Cloud's unified ML platform.
- Vertex AI Workbench: Managed Jupyter notebooks for interactive development and experimentation.
- Vertex AI Training: For custom model training, leveraging managed infrastructure.
- Vertex AI Endpoints: For deploying trained models as scalable, low-latency prediction services.
Cloud Run / Cloud Functions: Serverless compute platforms used for scheduled tasks, event-driven triggers, and deploying small microservices (e.g., monthly data updates).
Cloud Monitoring: For observing the performance, health, and availability of GCP services and applications.
Cloud Scheduler: For defining and executing cron jobs, often used to trigger Cloud Run or Cloud Functions.

Technology Stack

Programming Languages: Python (primary), SQL, Shell Scripting (Bash)
Core Libraries:
- Data Manipulation: Pandas, NumPy
- Machine Learning: TensorFlow (v2.x), Scikit-learn, Spark MLlib
- Big Data: Apache Beam, Apache Spark
- GCP Interaction: Google Cloud Client Libraries
Data Storage & Warehousing: Google Cloud Storage, BigQuery
Compute: Cloud Run, Dataflow, Dataproc, Vertex AI (managed services)
Orchestration: Cloud Scheduler, gcloud CLI

⬆️ Back to Top

🚦 Getting Started

Follow these steps to set up your environment and start exploring the code.

Prerequisites

Before you begin, ensure you have the following installed and configured:

Google Cloud Platform Account: You'll need an active GCP account.
gcloud CLI: Google Cloud SDK installed and authenticated.
- Install Google Cloud SDK
- Authenticate: gcloud auth login
- Set up application default credentials: gcloud auth application-default login
Python 3.8+: Recommended version.
- Download Python
Docker: Required for running Dockerized services (e.g., 02_ingest/monthlyupdate).
- Install Docker

1. Clone the Repository

Start by cloning the project repository to your local machine.

git clone https://github.com/GDSC-FSC/data-science-on-gcp.git
cd data-science-on-gcp

2. Google Cloud Project Setup

All operations require a GCP project.

Click to expand: Create a new GCP Project

If you don't have a project, create one:

Go to the Google Cloud Console.
In the project selector, click "New Project".
Enter a Project name (e.g., ds-on-gcp-project).
Note down your Project ID (e.g., ds-on-gcp-project-12345). This will be used frequently.
Click "Create".

3. Configure `gcloud` CLI

Set your default project and region for the gcloud CLI. Replace YOUR_PROJECT_ID with your actual GCP Project ID and YOUR_GCP_REGION with a suitable region (e.g., us-central1).

gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/region YOUR_GCP_REGION
gcloud config set compute/zone YOUR_GCP_REGION-a # Or a specific zone within your region

Enable Required APIs: Many services require explicit API enablement. You can enable them all at once:

gcloud services enable \
    compute.googleapis.com \
    storage.googleapis.com \
    bigquery.googleapis.com \
    pubsub.googleapis.com \
    dataflow.googleapis.com \
    dataproc.googleapis.com \
    aiplatform.googleapis.com \
    cloudbuild.googleapis.com \
    cloudfunctions.googleapis.com \
    run.googleapis.com \
    cloudscheduler.googleapis.com \
    cloudresourcemanager.googleapis.com

💡 Tip: Open in Cloud Shell
For a fully pre-configured environment, click the "Open in Cloud Shell" button at the top of this README. This will clone the repository and open it in a browser-based terminal with the Google Cloud SDK already installed and authenticated.

⬆️ Back to Top

🛠️ Usage & Workflows

Each directory in this repository corresponds to a chapter/module in the book and focuses on a specific aspect of data science on GCP. Below are examples for key modules.

Module 02: Data Ingestion

This module covers downloading raw data, uploading it to Google Cloud Storage, and loading it into BigQuery. It also includes setting up automated monthly updates.

Navigate to the module directory:
```
cd 02_ingest
```
Download and Upload Raw Data: The ingest.sh script orchestrates downloading flight data, uploading it to GCS, and creating a raw table in BigQuery.
Click to expand: View `ingest.sh` details

The ingest.sh script performs the following:
- Calls download.sh to download the flight data from a public source.
- Calls upload.sh to upload the downloaded data to a GCS bucket you specify.
- Loads the data from GCS into a BigQuery table.
You might need to adjust bucket names and dataset IDs in the scripts. Ensure your PROJECT_ID and BUCKET environment variables are set.
```bash # Ensure you have a GCS bucket, e.g., gs://YOUR_PROJECT_ID-bucket export BUCKET="gs://YOUR_PROJECT_ID-bucket" # Replace YOUR_PROJECT_ID ./ingest.sh ```
Explore Data in BigQuery: Once loaded, you can query the data in the BigQuery console or using the bq command-line tool.
```
bq query --use_legacy_sql=false "SELECT * FROM flights.rawflights LIMIT 10"
```

Module 04: Streaming Data Processing with Dataflow

This module demonstrates building a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery.

Navigate to the module directory:
```
cd 04_streaming
```

Setup Pub/Sub Topic and BigQuery Table: Create a Pub/Sub topic and a BigQuery table to store the streaming results.

Click to expand: Create Pub/Sub topic and BigQuery table

# Create Pub/Sub topic
gcloud pubsub topics create flights_realtime

# Create BigQuery dataset and table (schema defined in design/airport_schema.json)
bq mk --dataset YOUR_PROJECT_ID:flights
bq mk --table YOUR_PROJECT_ID:flights.streaming_airports design/airport_schema.json

Run a Dataflow Job: The df01.py script transforms streaming airport data. Execute it as a Dataflow job.
Click to expand: View `df01.py` details

This script uses Apache Beam to read messages from a Pub/Sub topic, apply transformations, and write the results to a BigQuery table. It typically involves:
- Reading from flights_realtime Pub/Sub topic.
- Parsing JSON messages.
- Applying windowing and aggregations (e.g., calculating averages).
- Writing to a BigQuery table.
```bash # Make sure you are in the 04_streaming/transform directory cd transform python df01.py \ --project YOUR_PROJECT_ID \ --runner DataflowRunner \ --temp_location gs://YOUR_PROJECT_ID-bucket/temp \ --staging_location gs://YOUR_PROJECT_ID-bucket/staging \ --region YOUR_GCP_REGION \ --input_topic projects/YOUR_PROJECT_ID/topics/flights_realtime \ --output_table YOUR_PROJECT_ID:flights.streaming_airports ```
Simulate Real-time Data: In a separate terminal, run the simulator to publish data to the Pub/Sub topic.
```
cd ../simulate
python simulate.py --project YOUR_PROJECT_ID --topic flights_realtime --speedup 600
```
You will see data flowing into your BigQuery table.

Module 08: Machine Learning with BigQuery ML

This module showcases how to build and evaluate machine learning models directly within BigQuery using SQL.

Navigate to the module directory:
```
cd 08_bqml
```
Open the Jupyter Notebook: Launch a Jupyter Notebook server (e.g., via Vertex AI Workbench or locally) and open bqml_logistic.ipynb.
```
# If using local Jupyter
jupyter notebook
```
Execute Notebook Cells: Follow the steps in bqml_logistic.ipynb to:
- Create a BigQuery dataset if not already present.
- Load sample data into a BigQuery table (or use data from 02_ingest).
- Define and train a logistic regression model using CREATE MODEL SQL statements.
- Evaluate the model's performance using EVALUATE statements.
- Make predictions using PREDICT statements.

Module 09: Deploying Models with Vertex AI

This module focuses on taking a trained TensorFlow model and deploying it to Vertex AI for online predictions.

Navigate to the module directory:
```
cd 09_vertexai
```
Open the Jupyter Notebook: Open flights_model_tf2.ipynb in your Jupyter environment.
Train and Export Model: Execute the notebook cells to:
- Preprocess data.
- Train a TensorFlow 2.x model for flight delay prediction.
- Export the trained model in the TensorFlow SavedModel format.

Deploy to Vertex AI: The notebook will guide you through uploading the model to Vertex AI Model Registry and deploying it to an Endpoint. Alternatively, you can use the gcloud CLI:

Click to expand: Deploy model via `gcloud` CLI (after model export)

# First, upload the model
MODEL_DISPLAY_NAME="flights-delay-model"
MODEL_ID=$(gcloud ai models upload \
    --project=YOUR_PROJECT_ID \
    --region=YOUR_GCP_REGION \
    --display-name=${MODEL_DISPLAY_NAME} \
    --container-image-uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest" \
    --artifact-uri="gs://YOUR_PROJECT_ID-bucket/flights_model_dir" \ # Path where your SavedModel is
    --format="value(model_id)")

# Then, create an endpoint
ENDPOINT_DISPLAY_NAME="flights-delay-endpoint"
ENDPOINT_ID=$(gcloud ai endpoints create \
    --project=YOUR_PROJECT_ID \
    --region=YOUR_GCP_REGION \
    --display-name=${ENDPOINT_DISPLAY_NAME} \
    --format="value(endpoint_id)")

# Finally, deploy the model to the endpoint
gcloud ai endpoints deploy-model ${ENDPOINT_ID} \
    --project=YOUR_PROJECT_ID \
    --region=YOUR_GCP_REGION \
    --model=${MODEL_ID} \
    --display-name=${MODEL_DISPLAY_NAME}-deployed \
    --machine-type=n1-standard-2 \
    --min-replica-count=1 \
    --max-replica-count=1

Make Predictions: Use call_predict.sh (or call_predict.py from 10_mlops) to send prediction requests to the deployed endpoint.
```
./call_predict.sh
```

⬆️ Back to Top

⚠️ Limitations, Known Issues & Future Roadmap

Current Limitations

Dataset Size: While designed for scalability, certain examples might use smaller datasets or samples for quick execution. Adapting to extremely large, multi-TB datasets may require further optimization.
Regional Restrictions: Some GCP services or features might not be available in all regions. Ensure your chosen YOUR_GCP_REGION supports all services used in the modules you run.
Cost Management: Running all examples, especially those involving Dataproc or continuous Dataflow jobs, can incur significant GCP costs. Monitor your billing regularly.

Known Issues

Dependency Conflicts: Python library versions can sometimes cause conflicts. It's highly recommended to use venv or conda for isolated environments.
GCP API Enablement: Forgetting to enable a required GCP API is a common issue. If you encounter permissions errors, double-check that all relevant APIs (as listed in Getting Started) are enabled.
Qwiklabs Verification: The code on Qwiklabs is continually tested and kept up-to-date. If you encounter issues, we recommend trying the corresponding Qwiklab lab to verify if any setup steps were missed. Please leave feedback in Qwiklabs or file an issue in this repository if problems persist.

Future Roadmap

We are continuously working to improve and expand this repository. Planned enhancements include:

Expanded Vertex AI Coverage: More examples for Vertex AI Pipelines, Feature Store, and Monitoring.
Cost Optimization Guides: Adding specific tips and configurations to manage GCP costs effectively across different modules.
Terraform/Pulumi Integration: Providing Infrastructure as Code (IaC) examples for setting up GCP resources.
Newer GCP Services: Incorporating examples for new or updated GCP data and ML services as they become available.
Generalize for Other Datasets: Providing guidelines or modified scripts to easily adapt the workflows to different datasets.

⬆️ Back to Top

🤝 Contributing & Development Guidelines

We welcome contributions from the community to improve this repository! Please follow these guidelines to ensure a smooth collaboration process.

How to Contribute

Fork the Repository: Start by forking the GDSC-FSC/data-science-on-gcp repository to your GitHub account.
Create a Feature Branch: Create a new branch for your changes (e.g., feature/my-new-example or fix/bug-description).
```
git checkout -b feature/your-feature-name
```
Implement Your Changes: Make your modifications, additions, or bug fixes.
Test Your Changes: Ensure your code works as expected and doesn't introduce regressions.
Commit Your Changes: Write clear and concise commit messages.
```
git commit -m "feat: Add new example for X service"
```

Push to Your Fork:

git push origin feature/your-feature-name

Open a Pull Request (PR):
- Go to the original repository on GitHub.
- You should see a prompt to create a new Pull Request from your recently pushed branch.
- Provide a descriptive title and detailed explanation of your changes in the PR description.
- Reference any related issues.

Branching & Pull Request Guidelines

Branch Naming: Use descriptive names for your branches (e.g., feature/new-bqml-example, fix/streaming-bug).
Small, Focused PRs: Keep your PRs small and focused on a single change or feature to facilitate easier review.
Clear Descriptions: Explain the "what" and "why" of your changes in the PR description. Include screenshots or terminal outputs if helpful.
Review Process: Your PR will be reviewed by maintainers. Be responsive to feedback and be prepared to make iterative changes.

Code Style & Linting

Python: Adhere to PEP 8 style guidelines. Use black for formatting and flake8 for linting.
Shell Scripts: Ensure scripts are clear, well-commented, and robust.
SQL: Format SQL queries for readability, using consistent capitalization for keywords.
Jupyter Notebooks: Keep notebooks clean, with clear explanations for each step and cell output.

Testing

Local Testing: Before submitting a PR, thoroughly test your changes in your local environment and, if applicable, on a GCP project.
Qwiklabs Verification: If your changes relate to a specific module covered by Qwiklabs, try to verify compatibility there.

⬆️ Back to Top

📄 License, Credits & Contact

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for full details.

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

Credits & Acknowledgments

Author: Valliappa Lakshmanan (for the original book and much of the codebase).
Book: "Data Science on the Google Cloud Platform, 2nd Edition" (O'Reilly, Apr 2022).
Google Cloud Platform: For providing the comprehensive suite of services that make these solutions possible.
Contributors: To all individuals who have contributed to the evolution and improvement of this codebase.

Contact

If you have questions, suggestions, or encounter issues, please:

File an Issue: Use the GitHub Issues page for bug reports, feature requests, or general questions.
Qwiklabs Feedback: If you are following the Qwiklabs labs, please use their feedback mechanisms for specific lab-related queries.

⬆️ Back to Top

🗃️ Appendix

Changelog

April 2022: Initial release of the 2nd Edition companion code.
November 2019: Major update to support TensorFlow 2.0, Cloud Functions, and BigQuery ML.
Earlier Versions: Continuous updates and improvements aligned with the evolution of Google Cloud services and the book's content.

FAQ

Q: How can I manage costs when running these examples?

A: Many GCP services incur costs. To manage them: 1. **Monitor Billing:** Regularly check your GCP billing dashboard. 2. **Delete Resources:** After completing a module, delete resources like Dataproc clusters (`gcloud dataproc clusters delete`), Dataflow jobs (cancel from Dataflow UI), and Vertex AI Endpoints (`gcloud ai endpoints delete`). 3. **Use Free Tier:** Leverage GCP's free tier for services where applicable. 4. **Set Quotas:** Consider setting custom quotas for resource usage to prevent unexpected high costs.

Q: My script/notebook failed with a `permission denied` error. What should I do?

A: This usually indicates that the service account or user running the operation lacks the necessary IAM permissions. 1. **Check IAM Roles:** Ensure your user account or the service account assigned to your VM/Cloud Run instance has the appropriate roles (e.g., `BigQuery Data Editor`, `Storage Object Admin`, `Dataflow Developer`). 2. **Enable APIs:** Verify that all required APIs are enabled for your project (see [Getting Started](#2-google-cloud-project-setup)). 3. **`gcloud auth`:** Ensure your `gcloud` CLI is correctly authenticated (`gcloud auth login`) and has application default credentials (`gcloud auth application-default login`).

Q: Can I use a different dataset with these examples?

A: Yes, many examples are adaptable. 1. **Schema Alignment:** The most critical step is to adapt the data ingestion and transformation scripts (`02_ingest`, `04_streaming`) to your dataset's schema. 2. **Feature Engineering:** Update SQL queries (`03_sqlstudio`, `08_bqml`) and Python notebooks (`05_bqnotebook`, `09_vertexai`) to use your dataset's features. 3. **Model Training:** Retrain models with your data, adjusting model architectures if necessary.

Troubleshooting Guide

Check gcloud Configuration:
- Verify your project ID and region: gcloud config list
- Ensure you are authenticated: gcloud auth list and gcloud auth application-default print-access-token
API Enablement: Review the list of APIs to enable in Getting Started and run the gcloud services enable command again to be safe.
Resource Existence: Ensure that GCS buckets, Pub/Sub topics, and BigQuery datasets/tables you're trying to use actually exist and are correctly named.
Logs: Always check the logs in the Google Cloud Console for specific error messages.
- For Dataflow jobs: Dataflow UI -> Job Details -> Logs
- For Cloud Run/Functions: Cloud Logging
- For Vertex AI: Vertex AI Console -> Model/Endpoint -> Logs
Environment Variables: Confirm that all necessary environment variables (e.g., PROJECT_ID, BUCKET) are set correctly before running scripts.
```
echo $PROJECT_ID
echo $BUCKET
```

Python Virtual Environments: Use venv or conda to avoid dependency conflicts.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt # if a requirements.txt is provided in a module

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 548 Commits
02_ingest		02_ingest
03_sqlstudio		03_sqlstudio
04_streaming		04_streaming
05_bqnotebook		05_bqnotebook
06_dataproc		06_dataproc
07_sparkml		07_sparkml
08_bqml		08_bqml
09_vertexai		09_vertexai
10_mlops		10_mlops
11_realtime		11_realtime
12_fulldataset		12_fulldataset
.gitignore		.gitignore
COPYRIGHT		COPYRIGHT
LICENSE		LICENSE
README.md		README.md
cover_edition2.jpg		cover_edition2.jpg

License

GDSC-FSC/data-science-on-gcp

Folders and files

Latest commit

History

Repository files navigation

data-science-on-gcp: Practical Data Science and Machine Learning on Google Cloud Platform

📚 Table of Contents

✨ Overview / Introduction

🚀 Feature Highlights

🏗️ Architecture & Core Components

High-Level Data Science Workflow

Technology Stack

🚦 Getting Started

Prerequisites

1. Clone the Repository

2. Google Cloud Project Setup

3. Configure gcloud CLI

🛠️ Usage & Workflows

Module 02: Data Ingestion

Module 04: Streaming Data Processing with Dataflow

Module 08: Machine Learning with BigQuery ML

Module 09: Deploying Models with Vertex AI

⚠️ Limitations, Known Issues & Future Roadmap

Current Limitations

Known Issues

Future Roadmap

🤝 Contributing & Development Guidelines

How to Contribute

Branching & Pull Request Guidelines

Code Style & Linting

Testing

📄 License, Credits & Contact

License

Credits & Acknowledgments

Contact

🗃️ Appendix

Changelog

FAQ

Troubleshooting Guide

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`data-science-on-gcp`: Practical Data Science and Machine Learning on Google Cloud Platform

3. Configure `gcloud` CLI

Packages