-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
As I reported at elastic/elasticsearch#133989, I'd love to have a way to support multiple delimiters for the Path Hierarchy Tokenizer.
Currently, it only supports a single pattern for the delimiter
parameter (default is /
). This makes it difficult to tokenize both Windows (\\
) and Linux (/
) paths efficiently in the same index. Supporting multiple delimiters (such as both /
and \\
) would greatly improve usability for systems dealing with cross-platform file paths.
For example, a user may need to index file paths from both Windows and Linux environments and expects the analysis to work seamlessly regardless of path format. At the moment, the only workaround I found is to preprocess the data to normalize delimiters, which adds extra complexity.
Feature Request: Allow the path_hierarchy
tokenizer to accept multiple delimiter patterns (e.g., an array of delimiters) so both /
and \\
can be handled simultaneously.
Another possible implementation would be to create a new PathsHierarchyTokenizer
(note the s
) which implements this behavior.
Before working a such PR, I'd like to get your views about this proposal... May be I'm just wrong trying to do so.