Skip to content

Conversation

@lhaibach
Copy link
Contributor

This draft builds on #46 and experiments with extending the diagram detection logic in src/identifiers/diagram.py.

The identify_diagram function uses a voting system based on keywords, units, and axis progression checks, where the axis detection is includes a new axis_checks function, which checks for both monotonicity and numeric progression in clusters of entries.

However it introduces more code without improving the F1 scores compared to the latest version in the other branch. We probably do not want to merge this? But the arithmetic checks might come in handy as additional features for the treebased model training?

Introduces an axis_checks helper that evaluates clusters for both monotonicity and numeric progression. It updates identify_diagram to use a voting system across:

  • diagram keywords
  • units
  • y-axis and x-axis monotonicity
  • y-axis and x-axis numeric progression (new)

However, F1 scores are not improved compared to the latest implementation in the other branch and the code increases complexity. We likely do not want to merge this, however the progression logic might still be valuable as features for tree-based models.

For diagram

Branch F1 Score Precision Recall
detect-diagram 62.22 60.87 63.64
detect-diagram-draft 60.47 61.90 59.09
Metric text boreprofile map geo_profile title_page diagram table unknown Macro Avg
F1 Score 65.00 77.77 73.07 0.00 48.78 62.22 0.00 26.47 44.16
F1 Score draft 65.00 77.77 73.07 0.00 48.78 60.47 0.00 25.71 43.85

@lhaibach lhaibach marked this pull request as draft September 18, 2025 11:35
@lhaibach lhaibach requested a review from Copilot September 18, 2025 11:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR experiments with extending diagram detection logic by implementing a voting system that incorporates new axis progression checks alongside existing keyword and unit detection. The changes aim to improve diagram identification by analyzing both monotonicity and numeric progression patterns in clustered data points.

Key changes:

  • Introduces a voting system combining keyword detection, unit detection, and axis analysis
  • Adds new axis_checks function to evaluate monotonicity and numeric progression in data clusters
  • Implements arithmetic and logarithmic progression detection for axis validation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@lhaibach lhaibach requested review from TicaGit and letao September 23, 2025 12:32
Base automatically changed from detect-tables to develop September 24, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant