A comprehensive Java library providing extended Lucene analyzers with specialized support for Japanese and English text processing. Built on Apache Lucene 10.2.2 with Java 21 compatibility.
- ProlongedSoundMarkCharFilter: Normalizes prolonged sound marks (γΌ) in Japanese text
- KanjiNumberFilter: Converts kanji numerals to Arabic numerals
- CharTypeFilter: Filters tokens based on Japanese character types
- IterationMarkCharFilter: Handles Japanese iteration marks (γ , γ½, γΎ)
- PosConcatenationFilter: Concatenates tokens based on part-of-speech tags
- PatternConcatenationFilter: Pattern-based token concatenation
- NumberConcatenationFilter: Specialized number concatenation
- StopTokenPrefixFilter & StopTokenSuffixFilter: Advanced stop word filtering
- ReloadableStopFilter: Dynamic stop word filtering with reload capability
- FlexiblePorterStemFilter: Enhanced Porter stemming with flexibility controls
- ReloadableKeywordMarkerFilter: Dynamic keyword protection during stemming
- AlphaNumWordFilter: Alphanumeric word processing
- StopTokenFilter: Core stop word filtering functionality
- ConcatenationFilter: Token concatenation base class
- BufferedCharFilter: Character-level filtering with buffering
<dependency>
<groupId>org.codelibs</groupId>
<artifactId>analyzers</artifactId>
<version>10.2.2.0</version>
</dependency>
implementation 'org.codelibs:analyzers:10.2.2.0'
- Java: 21 or higher
- Apache Lucene: 10.2.2
- Maven: 3.6+ (for building from source)
// Create analyzer with Japanese filters
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
// Apply prolonged sound mark normalization
TokenStream stream = tokenizer;
stream = new KanjiNumberFilter(stream);
stream = new CharTypeFilter(stream, CharType.KATAKANA);
return new TokenStreamComponents(tokenizer, stream);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
reader = new ProlongedSoundMarkCharFilter(reader);
reader = new IterationMarkCharFilter(reader);
return reader;
}
};
// Create analyzer with English filters
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream stream = tokenizer;
stream = new ReloadableStopFilter(stream, stopWords);
stream = new FlexiblePorterStemFilter(stream);
stream = new AlphaNumWordFilter(stream);
return new TokenStreamComponents(tokenizer, stream);
}
};
The library is organized into language-specific packages:
org.codelibs.analysis/
βββ Base classes (StopTokenFilter, ConcatenationFilter, BufferedCharFilter)
βββ ja/ - Japanese-specific analyzers
β βββ Character filters (ProlongedSoundMark, IterationMark)
β βββ Token filters (KanjiNumber, CharType, various concatenation)
β βββ Stop word filters (Prefix/Suffix variants)
βββ en/ - English-specific analyzers
βββ Stop word filtering (Reloadable variants)
βββ Stemming (FlexiblePorter)
βββ Word processing (AlphaNumWord)
# Clone the repository
git clone https://github.com/codelibs/analyzers.git
cd analyzers
# Build with Maven
mvn clean compile
# Run tests
mvn test
# Create package
mvn clean package
# Format code (required before commits)
mvn formatter:format
# Apply license headers
mvn license:format
# Generate coverage report
mvn jacoco:report
# Run specific test class
mvn test -Dtest=KanjiNumberFilterTest
The library uses Lucene's comprehensive testing framework:
- All tests extend
BaseTokenStreamTestCase
- JUnit 4 for test structure
- Lucene test utilities for token stream validation
- Comprehensive coverage of analyzer behavior
Licensed under the Apache License, Version 2.0. See LICENSE for details.
- Fork the repository
- Create a feature branch
- Run
mvn formatter:format
andmvn license:format
- Ensure all tests pass with
mvn clean package
- Submit a pull request
The version number follows Lucene's versioning scheme with an additional patch level:
- Format:
{lucene.version}.{patch}
- Current:
10.2.2.0
(based on Lucene 10.2.2)