Skip to content

codelibs/analyzers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lucene Analyzers

Java CI with Maven Maven Central License

A comprehensive Java library providing extended Lucene analyzers with specialized support for Japanese and English text processing. Built on Apache Lucene 10.2.2 with Java 21 compatibility.

πŸš€ Features

Japanese Text Analysis

  • ProlongedSoundMarkCharFilter: Normalizes prolonged sound marks (γƒΌ) in Japanese text
  • KanjiNumberFilter: Converts kanji numerals to Arabic numerals
  • CharTypeFilter: Filters tokens based on Japanese character types
  • IterationMarkCharFilter: Handles Japanese iteration marks (γ€…, ヽ, γƒΎ)
  • PosConcatenationFilter: Concatenates tokens based on part-of-speech tags
  • PatternConcatenationFilter: Pattern-based token concatenation
  • NumberConcatenationFilter: Specialized number concatenation
  • StopTokenPrefixFilter & StopTokenSuffixFilter: Advanced stop word filtering

English Text Analysis

  • ReloadableStopFilter: Dynamic stop word filtering with reload capability
  • FlexiblePorterStemFilter: Enhanced Porter stemming with flexibility controls
  • ReloadableKeywordMarkerFilter: Dynamic keyword protection during stemming
  • AlphaNumWordFilter: Alphanumeric word processing

Base Components

  • StopTokenFilter: Core stop word filtering functionality
  • ConcatenationFilter: Token concatenation base class
  • BufferedCharFilter: Character-level filtering with buffering

πŸ“¦ Installation

Maven

<dependency>
    <groupId>org.codelibs</groupId>
    <artifactId>analyzers</artifactId>
    <version>10.2.2.0</version>
</dependency>

Gradle

implementation 'org.codelibs:analyzers:10.2.2.0'

πŸ—οΈ Requirements

  • Java: 21 or higher
  • Apache Lucene: 10.2.2
  • Maven: 3.6+ (for building from source)

πŸ”§ Usage Examples

Japanese Text Processing

// Create analyzer with Japanese filters
Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new StandardTokenizer();
        
        // Apply prolonged sound mark normalization
        TokenStream stream = tokenizer;
        stream = new KanjiNumberFilter(stream);
        stream = new CharTypeFilter(stream, CharType.KATAKANA);
        
        return new TokenStreamComponents(tokenizer, stream);
    }
    
    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        reader = new ProlongedSoundMarkCharFilter(reader);
        reader = new IterationMarkCharFilter(reader);
        return reader;
    }
};

English Text Processing

// Create analyzer with English filters
Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new StandardTokenizer();
        
        TokenStream stream = tokenizer;
        stream = new ReloadableStopFilter(stream, stopWords);
        stream = new FlexiblePorterStemFilter(stream);
        stream = new AlphaNumWordFilter(stream);
        
        return new TokenStreamComponents(tokenizer, stream);
    }
};

πŸ›οΈ Architecture

The library is organized into language-specific packages:

org.codelibs.analysis/
β”œβ”€β”€ Base classes (StopTokenFilter, ConcatenationFilter, BufferedCharFilter)
β”œβ”€β”€ ja/ - Japanese-specific analyzers
β”‚   β”œβ”€β”€ Character filters (ProlongedSoundMark, IterationMark)
β”‚   β”œβ”€β”€ Token filters (KanjiNumber, CharType, various concatenation)
β”‚   └── Stop word filters (Prefix/Suffix variants)
└── en/ - English-specific analyzers
    β”œβ”€β”€ Stop word filtering (Reloadable variants)
    β”œβ”€β”€ Stemming (FlexiblePorter)
    └── Word processing (AlphaNumWord)

πŸ”¨ Building from Source

# Clone the repository
git clone https://github.com/codelibs/analyzers.git
cd analyzers

# Build with Maven
mvn clean compile

# Run tests
mvn test

# Create package
mvn clean package

Development Commands

# Format code (required before commits)
mvn formatter:format

# Apply license headers
mvn license:format

# Generate coverage report
mvn jacoco:report

# Run specific test class
mvn test -Dtest=KanjiNumberFilterTest

πŸ§ͺ Testing

The library uses Lucene's comprehensive testing framework:

  • All tests extend BaseTokenStreamTestCase
  • JUnit 4 for test structure
  • Lucene test utilities for token stream validation
  • Comprehensive coverage of analyzer behavior

πŸ“„ License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Run mvn formatter:format and mvn license:format
  4. Ensure all tests pass with mvn clean package
  5. Submit a pull request

πŸ“š Documentation

πŸ› Issues & Support

πŸ“ˆ Version History

The version number follows Lucene's versioning scheme with an additional patch level:

  • Format: {lucene.version}.{patch}
  • Current: 10.2.2.0 (based on Lucene 10.2.2)

About

This library provides Lucene's analyzers.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages