Lucene Analyzers

A comprehensive Java library providing extended Lucene analyzers with specialized support for Japanese and English text processing. Built on Apache Lucene 10.2.2 with Java 21 compatibility.

🚀 Features

Japanese Text Analysis

ProlongedSoundMarkCharFilter: Normalizes prolonged sound marks (ー) in Japanese text
KanjiNumberFilter: Converts kanji numerals to Arabic numerals
CharTypeFilter: Filters tokens based on Japanese character types
IterationMarkCharFilter: Handles Japanese iteration marks (々, ヽ, ヾ)
PosConcatenationFilter: Concatenates tokens based on part-of-speech tags
PatternConcatenationFilter: Pattern-based token concatenation
NumberConcatenationFilter: Specialized number concatenation
StopTokenPrefixFilter & StopTokenSuffixFilter: Advanced stop word filtering

English Text Analysis

ReloadableStopFilter: Dynamic stop word filtering with reload capability
FlexiblePorterStemFilter: Enhanced Porter stemming with flexibility controls
ReloadableKeywordMarkerFilter: Dynamic keyword protection during stemming
AlphaNumWordFilter: Alphanumeric word processing

Base Components

StopTokenFilter: Core stop word filtering functionality
ConcatenationFilter: Token concatenation base class
BufferedCharFilter: Character-level filtering with buffering

📦 Installation

Maven

<dependency>
    <groupId>org.codelibs</groupId>
    <artifactId>analyzers</artifactId>
    <version>10.2.2.0</version>
</dependency>

Gradle

implementation 'org.codelibs:analyzers:10.2.2.0'

🏗️ Requirements

Java: 21 or higher
Apache Lucene: 10.2.2
Maven: 3.6+ (for building from source)

🔧 Usage Examples

Japanese Text Processing

// Create analyzer with Japanese filters
Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new StandardTokenizer();
        
        // Apply prolonged sound mark normalization
        TokenStream stream = tokenizer;
        stream = new KanjiNumberFilter(stream);
        stream = new CharTypeFilter(stream, CharType.KATAKANA);
        
        return new TokenStreamComponents(tokenizer, stream);
    }
    
    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        reader = new ProlongedSoundMarkCharFilter(reader);
        reader = new IterationMarkCharFilter(reader);
        return reader;
    }
};

English Text Processing

// Create analyzer with English filters
Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new StandardTokenizer();
        
        TokenStream stream = tokenizer;
        stream = new ReloadableStopFilter(stream, stopWords);
        stream = new FlexiblePorterStemFilter(stream);
        stream = new AlphaNumWordFilter(stream);
        
        return new TokenStreamComponents(tokenizer, stream);
    }
};

🏛️ Architecture

The library is organized into language-specific packages:

org.codelibs.analysis/
├── Base classes (StopTokenFilter, ConcatenationFilter, BufferedCharFilter)
├── ja/ - Japanese-specific analyzers
│   ├── Character filters (ProlongedSoundMark, IterationMark)
│   ├── Token filters (KanjiNumber, CharType, various concatenation)
│   └── Stop word filters (Prefix/Suffix variants)
└── en/ - English-specific analyzers
    ├── Stop word filtering (Reloadable variants)
    ├── Stemming (FlexiblePorter)
    └── Word processing (AlphaNumWord)

🔨 Building from Source

# Clone the repository
git clone https://github.com/codelibs/analyzers.git
cd analyzers

# Build with Maven
mvn clean compile

# Run tests
mvn test

# Create package
mvn clean package

Development Commands

# Format code (required before commits)
mvn formatter:format

# Apply license headers
mvn license:format

# Generate coverage report
mvn jacoco:report

# Run specific test class
mvn test -Dtest=KanjiNumberFilterTest

🧪 Testing

The library uses Lucene's comprehensive testing framework:

All tests extend BaseTokenStreamTestCase
JUnit 4 for test structure
Lucene test utilities for token stream validation
Comprehensive coverage of analyzer behavior

📄 License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

🤝 Contributing

Fork the repository
Create a feature branch
Run mvn formatter:format and mvn license:format
Ensure all tests pass with mvn clean package
Submit a pull request

📚 Documentation

🐛 Issues & Support

📈 Version History

The version number follows Lucene's versioning scheme with an additional patch level:

Format: {lucene.version}.{patch}
Current: 10.2.2.0 (based on Lucene 10.2.2)

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lucene Analyzers

🚀 Features

Japanese Text Analysis

English Text Analysis

Base Components

📦 Installation

Maven

Gradle

🏗️ Requirements

🔧 Usage Examples

Japanese Text Processing

English Text Processing

🏛️ Architecture

🔨 Building from Source

Development Commands

🧪 Testing

📄 License

🤝 Contributing

📚 Documentation

🐛 Issues & Support

📈 Version History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

codelibs/analyzers

Folders and files

Latest commit

History

Repository files navigation

Lucene Analyzers

🚀 Features

Japanese Text Analysis

English Text Analysis

Base Components

📦 Installation

Maven

Gradle

🏗️ Requirements

🔧 Usage Examples

Japanese Text Processing

English Text Processing

🏛️ Architecture

🔨 Building from Source

Development Commands

🧪 Testing

📄 License

🤝 Contributing

📚 Documentation

🐛 Issues & Support

📈 Version History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages