A robust and easy-to-use Python tool for detecting Persian (Farsi) offensive text using both rule-based and machine learning (ML) approaches.
- Hybrid Detection: Combines rule-based and ML-based detection for high accuracy
- Confidence Scores: Provides confidence levels for predictions
- Persian Language Support: Handles Persian text preprocessing and normalization
- CLI Interface: Simple command-line interface for quick testing
- Model Persistence: Save and load trained models for fast deployment
├── swear_detector.py # Main detector script
├── requirements.txt # Python dependencies
├── dataset/
│ └── dataset.json # Labeled dataset for training (Offensive/Normal)
├── models/
│ └── model.pkl # Trained ML model
├── Dockerfile # Docker support
├── docker-compose.yml # Docker Compose config
└── README.md # Documentation
- Clone the repository
- Install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Run the detector:
python3 swear_detector.py
Each prediction includes:
- Original text
- Final prediction (Offensive/Normal)
- Confidence score
- ML confidence score (for offensive predictions)
The project uses a labeled dataset (dataset.json
) containing:
- Offensive texts: Inappropriate or offensive content
- Normal texts: Regular, non-offensive content
The system uses a hybrid approach:
- Machine Learning: TF-IDF + Logistic Regression
- Rule-based detection
- Combined scoring for final prediction
Build and run with Docker:
docker-compose up --build