GeoLink

GeoLink is a scalable, Spark-based library for spatial entity deduplication and grouping. It efficiently links, clusters, and manages spatial data entities (such as buildings or other geographic features) by leveraging spatial join, graph analysis, and advanced geometric scoring. GeoLink is designed to work with large geospatial datasets and supports incremental (batch) updates using modern data lake technologies (e.g., Apache Iceberg).

Features

Spatial Deduplication: Group overlapping or similar geometries using spatial predicates (e.g., intersection, IoU).
Flexible Grouping Logic: Use custom geometric scoring to select representative shapes within each group.
Incremental Batch Updates: Efficiently update entity groups with new data using Iceberg-backed storage.
Plug-and-Play with Spark: Built on Apache Spark, leveraging Sedona for spatial operations and GraphX for graph-based grouping.
Extensible Data Model: Type-safe, generic models for entities, groups, and storage integration.

Architecture Overview

Entity Model: Represents spatial features (e.g., a building footprint).
Group Model: Represents a deduplicated cluster of entities sharing similar geometry.
Spatial Join: Uses Sedona to find overlapping or similar entities.
Graph Building: Entities are connected in a graph based on spatial relationships.
Connected Components: GraphX identifies clusters (groups) of linked entities.
Group Geometry Selection: A custom scoring function (e.g., shape quality) selects the best representative geometry per group.
Batch Updates: Supports efficient read, write, delete, and compaction operations on groups using Apache Iceberg.

Installation

Prerequisites

Java 8/11
Apache Spark 3.5.x
Scala 2.12.x
Apache Sedona 1.6.x
Apache Iceberg 1.9.x
Maven 3.x

Clone and Build

To install GeoLink, clone the repository and build it using Maven:

git clone https://github.com/avinoamn/geolink.git
cd geolink
mvn clean package

This will build both the core library and example modules.
The resulting JARs can be found in core/target/ and examples/target/.

Getting Started

See the examples module for ready-to-run demos.

Simple Deduplication Example

import github.avinoamn.geolink.GeoLink
import github.avinoamn.geolink.examples.models.EntityId
import github.avinoamn.geolink.models.Entity
import github.avinoamn.geolink.examples.utils.ShapeScorer.engineeredScore
import github.avinoamn.geolink.examples.utils.SpatialPredicates.iou
import org.apache.sedona.core.spatialOperator.SpatialPredicate

val geoLink = GeoLink[EntityId](
  SpatialPredicate.INTERSECTS,
  Some(iou(0.5)),
  _.maxBy(engineeredScore)
)

val deduplicatedGroups = geoLink.run(inputEntities) // inputEntities: Dataset[Entity[EntityId]]

Batch Incremental Update Example

import github.avinoamn.geolink.examples.dataStore.GroupsDataStore

val groupsStore = GroupsDataStore(
  warehousePath = "...",
  catalog = "...",
  namespace = "...",
  tableName = "groups"
)

val geoLinkBatch = GeoLink[EntityId](
  SpatialPredicate.INTERSECTS,
  Some(iou(0.5)),
  _.maxBy(engineeredScore)
).Batch(groupsStore)

val batchResult = geoLinkBatch.run(newEntities)

Configuration

Spatial Predicate: Use Sedona's predicates (INTERSECTS, CONTAINS, etc.) or a custom predicate (e.g., IoU threshold).
Scoring Function: Plug in your own geometry scoring for selecting the best group representative.

File Structure

core/ - Main library code (models, logic, extensions)
examples/ - Example apps demonstrating usage, including batch and single-pass deduplication
core/src/main/scala/github/avinoamn/geolink/models/ - Entity and group model definitions
core/src/main/scala/github/avinoamn/geolink/utils/logic/ - Core logic: grouping, graph, join, Z-order utilities

How It Works

Convert entities to candidate groups (toGroups)
Spatial join to find overlapping/similar entities
Build a graph where nodes are groups, edges are spatial links
Find connected components (groups of linked entities)
Aggregate group members and select best geometry per group
Write, delete, and compact groups in efficient columnar storage (Iceberg)

Advanced Topics

Iceberg Integration: See GroupsDataStore.scala for advanced data lake integration (partitioning, compaction, etc.)
Custom Scoring: See ShapeScorer.scala for engineered shape scores (rectangularity, convexity, right angles)
Spatial Predicates: See SpatialPredicates.scala for IoU and other custom join logic

License

MIT License

Credits

For questions or contributions, open an issue or pull request!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
core		core
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GeoLink

Features

Architecture Overview

Installation

Prerequisites

Clone and Build

Getting Started

Simple Deduplication Example

Batch Incremental Update Example

Configuration

File Structure

How It Works

Advanced Topics

License

Credits

About

Uh oh!

Languages

License

avinoamn/geolink

Folders and files

Latest commit

History

Repository files navigation

GeoLink

Features

Architecture Overview

Installation

Prerequisites

Clone and Build

Getting Started

Simple Deduplication Example

Batch Incremental Update Example

Configuration

File Structure

How It Works

Advanced Topics

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages