Skip to content

PathHierarchyTokenizer to support multiple delimiters #15196

@dadoonet

Description

@dadoonet

Description

As I reported at elastic/elasticsearch#133989, I'd love to have a way to support multiple delimiters for the Path Hierarchy Tokenizer.

Currently, it only supports a single pattern for the delimiter parameter (default is /). This makes it difficult to tokenize both Windows (\\) and Linux (/) paths efficiently in the same index. Supporting multiple delimiters (such as both / and \\) would greatly improve usability for systems dealing with cross-platform file paths.

For example, a user may need to index file paths from both Windows and Linux environments and expects the analysis to work seamlessly regardless of path format. At the moment, the only workaround I found is to preprocess the data to normalize delimiters, which adds extra complexity.

Feature Request: Allow the path_hierarchy tokenizer to accept multiple delimiter patterns (e.g., an array of delimiters) so both / and \\ can be handled simultaneously.

Another possible implementation would be to create a new PathsHierarchyTokenizer (note the s) which implements this behavior.

Before working a such PR, I'd like to get your views about this proposal... May be I'm just wrong trying to do so.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions