GeoLink is a scalable, Spark-based library for spatial entity deduplication and grouping. It efficiently links, clusters, and manages spatial data entities (such as buildings or other geographic features) by leveraging spatial join, graph analysis, and advanced geometric scoring. GeoLink is designed to work with large geospatial datasets and supports incremental (batch) updates using modern data lake technologies (e.g., Apache Iceberg).
- Spatial Deduplication: Group overlapping or similar geometries using spatial predicates (e.g., intersection, IoU).
- Flexible Grouping Logic: Use custom geometric scoring to select representative shapes within each group.
- Incremental Batch Updates: Efficiently update entity groups with new data using Iceberg-backed storage.
- Plug-and-Play with Spark: Built on Apache Spark, leveraging Sedona for spatial operations and GraphX for graph-based grouping.
- Extensible Data Model: Type-safe, generic models for entities, groups, and storage integration.
- Entity Model: Represents spatial features (e.g., a building footprint).
- Group Model: Represents a deduplicated cluster of entities sharing similar geometry.
- Spatial Join: Uses Sedona to find overlapping or similar entities.
- Graph Building: Entities are connected in a graph based on spatial relationships.
- Connected Components: GraphX identifies clusters (groups) of linked entities.
- Group Geometry Selection: A custom scoring function (e.g., shape quality) selects the best representative geometry per group.
- Batch Updates: Supports efficient read, write, delete, and compaction operations on groups using Apache Iceberg.
- Java 8/11
- Apache Spark 3.5.x
- Scala 2.12.x
- Apache Sedona 1.6.x
- Apache Iceberg 1.9.x
- Maven 3.x
To install GeoLink, clone the repository and build it using Maven:
git clone https://github.com/avinoamn/geolink.git
cd geolink
mvn clean package
- This will build both the core library and example modules.
- The resulting JARs can be found in
core/target/
andexamples/target/
.
See the examples module for ready-to-run demos.
import github.avinoamn.geolink.GeoLink
import github.avinoamn.geolink.examples.models.EntityId
import github.avinoamn.geolink.models.Entity
import github.avinoamn.geolink.examples.utils.ShapeScorer.engineeredScore
import github.avinoamn.geolink.examples.utils.SpatialPredicates.iou
import org.apache.sedona.core.spatialOperator.SpatialPredicate
val geoLink = GeoLink[EntityId](
SpatialPredicate.INTERSECTS,
Some(iou(0.5)),
_.maxBy(engineeredScore)
)
val deduplicatedGroups = geoLink.run(inputEntities) // inputEntities: Dataset[Entity[EntityId]]
import github.avinoamn.geolink.examples.dataStore.GroupsDataStore
val groupsStore = GroupsDataStore(
warehousePath = "...",
catalog = "...",
namespace = "...",
tableName = "groups"
)
val geoLinkBatch = GeoLink[EntityId](
SpatialPredicate.INTERSECTS,
Some(iou(0.5)),
_.maxBy(engineeredScore)
).Batch(groupsStore)
val batchResult = geoLinkBatch.run(newEntities)
- Spatial Predicate: Use Sedona's predicates (INTERSECTS, CONTAINS, etc.) or a custom predicate (e.g., IoU threshold).
- Scoring Function: Plug in your own geometry scoring for selecting the best group representative.
core/
- Main library code (models, logic, extensions)examples/
- Example apps demonstrating usage, including batch and single-pass deduplicationcore/src/main/scala/github/avinoamn/geolink/models/
- Entity and group model definitionscore/src/main/scala/github/avinoamn/geolink/utils/logic/
- Core logic: grouping, graph, join, Z-order utilities
- Convert entities to candidate groups (
toGroups
) - Spatial join to find overlapping/similar entities
- Build a graph where nodes are groups, edges are spatial links
- Find connected components (groups of linked entities)
- Aggregate group members and select best geometry per group
- Write, delete, and compact groups in efficient columnar storage (Iceberg)
- Iceberg Integration: See
GroupsDataStore.scala
for advanced data lake integration (partitioning, compaction, etc.) - Custom Scoring: See
ShapeScorer.scala
for engineered shape scores (rectangularity, convexity, right angles) - Spatial Predicates: See
SpatialPredicates.scala
for IoU and other custom join logic
MIT License
For questions or contributions, open an issue or pull request!