Skip to content

A Spark library for deduplicating and grouping spatial features, using spatial joins and graph-based clustering.

License

Notifications You must be signed in to change notification settings

avinoamn/geolink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeoLink

GeoLink is a scalable, Spark-based library for spatial entity deduplication and grouping. It efficiently links, clusters, and manages spatial data entities (such as buildings or other geographic features) by leveraging spatial join, graph analysis, and advanced geometric scoring. GeoLink is designed to work with large geospatial datasets and supports incremental (batch) updates using modern data lake technologies (e.g., Apache Iceberg).

Features

  • Spatial Deduplication: Group overlapping or similar geometries using spatial predicates (e.g., intersection, IoU).
  • Flexible Grouping Logic: Use custom geometric scoring to select representative shapes within each group.
  • Incremental Batch Updates: Efficiently update entity groups with new data using Iceberg-backed storage.
  • Plug-and-Play with Spark: Built on Apache Spark, leveraging Sedona for spatial operations and GraphX for graph-based grouping.
  • Extensible Data Model: Type-safe, generic models for entities, groups, and storage integration.

Architecture Overview

  1. Entity Model: Represents spatial features (e.g., a building footprint).
  2. Group Model: Represents a deduplicated cluster of entities sharing similar geometry.
  3. Spatial Join: Uses Sedona to find overlapping or similar entities.
  4. Graph Building: Entities are connected in a graph based on spatial relationships.
  5. Connected Components: GraphX identifies clusters (groups) of linked entities.
  6. Group Geometry Selection: A custom scoring function (e.g., shape quality) selects the best representative geometry per group.
  7. Batch Updates: Supports efficient read, write, delete, and compaction operations on groups using Apache Iceberg.

Installation

Prerequisites

  • Java 8/11
  • Apache Spark 3.5.x
  • Scala 2.12.x
  • Apache Sedona 1.6.x
  • Apache Iceberg 1.9.x
  • Maven 3.x

Clone and Build

To install GeoLink, clone the repository and build it using Maven:

git clone https://github.com/avinoamn/geolink.git
cd geolink
mvn clean package
  • This will build both the core library and example modules.
  • The resulting JARs can be found in core/target/ and examples/target/.

Getting Started

See the examples module for ready-to-run demos.

Simple Deduplication Example

import github.avinoamn.geolink.GeoLink
import github.avinoamn.geolink.examples.models.EntityId
import github.avinoamn.geolink.models.Entity
import github.avinoamn.geolink.examples.utils.ShapeScorer.engineeredScore
import github.avinoamn.geolink.examples.utils.SpatialPredicates.iou
import org.apache.sedona.core.spatialOperator.SpatialPredicate

val geoLink = GeoLink[EntityId](
  SpatialPredicate.INTERSECTS,
  Some(iou(0.5)),
  _.maxBy(engineeredScore)
)

val deduplicatedGroups = geoLink.run(inputEntities) // inputEntities: Dataset[Entity[EntityId]]

Batch Incremental Update Example

import github.avinoamn.geolink.examples.dataStore.GroupsDataStore

val groupsStore = GroupsDataStore(
  warehousePath = "...",
  catalog = "...",
  namespace = "...",
  tableName = "groups"
)

val geoLinkBatch = GeoLink[EntityId](
  SpatialPredicate.INTERSECTS,
  Some(iou(0.5)),
  _.maxBy(engineeredScore)
).Batch(groupsStore)

val batchResult = geoLinkBatch.run(newEntities)

Configuration

  • Spatial Predicate: Use Sedona's predicates (INTERSECTS, CONTAINS, etc.) or a custom predicate (e.g., IoU threshold).
  • Scoring Function: Plug in your own geometry scoring for selecting the best group representative.

File Structure

  • core/ - Main library code (models, logic, extensions)
  • examples/ - Example apps demonstrating usage, including batch and single-pass deduplication
  • core/src/main/scala/github/avinoamn/geolink/models/ - Entity and group model definitions
  • core/src/main/scala/github/avinoamn/geolink/utils/logic/ - Core logic: grouping, graph, join, Z-order utilities

How It Works

  1. Convert entities to candidate groups (toGroups)
  2. Spatial join to find overlapping/similar entities
  3. Build a graph where nodes are groups, edges are spatial links
  4. Find connected components (groups of linked entities)
  5. Aggregate group members and select best geometry per group
  6. Write, delete, and compact groups in efficient columnar storage (Iceberg)

Advanced Topics

  • Iceberg Integration: See GroupsDataStore.scala for advanced data lake integration (partitioning, compaction, etc.)
  • Custom Scoring: See ShapeScorer.scala for engineered shape scores (rectangularity, convexity, right angles)
  • Spatial Predicates: See SpatialPredicates.scala for IoU and other custom join logic

License

MIT License

Credits


For questions or contributions, open an issue or pull request!

About

A Spark library for deduplicating and grouping spatial features, using spatial joins and graph-based clustering.

Topics

Resources

License

Stars

Watchers

Forks

Languages