A high-performance CLI tool and Python library for detecting open source components and security threats in binaries through semantic signature matching. Specialized for analyzing mobile apps (APK/IPA), Java archives, ML models, and source code to identify OSS components, their licenses, and potential security risks.
- Fuzzy Matching: Detect modified, recompiled, or patched OSS components using TLSH
- Deterministic Results: Consistent analysis results across multiple runs
- Fast Local Analysis: SQLite-based signature storage with optimized direct matching
- Efficient Matching: MinHash LSH for similarity detection, trigram indexing for substring matching
- Dual Interface: Use as CLI tool or Python library
- Smart Compression: ZSTD-compressed signatures with ~90% size reduction
- Low Memory Footprint: Streaming analysis with <100MB memory usage
- CycloneDX Format: Industry-standard SBOM export for security and compliance toolchains
- File Path Tracking: Evidence includes file paths for component location tracking
- Feature Extraction: Optional feature dump for signature recreation
- Confidence Scores: All detections include confidence levels in SBOM
- Multi-file Support: Aggregate SBOM for entire projects
- Comprehensive File Enumeration: Extract complete file listings from archives
- Rich Metadata: MIME types, compression ratios, file sizes, timestamps
- Hash Calculation: MD5, SHA1, SHA256 for integrity verification
- Fuzzy Hashing: TLSH and ssdeep for similarity analysis
- Component Detection: Run OSS detection on individual files within packages
- Multiple Export Formats: JSON, CSV, tree visualization, summary reports
- Advanced Format Support: ELF, PE, Mach-O analysis with symbol and import extraction via LIEF
- Static Library Support: Parse and analyze .a archives, examining each object file separately
- Android DEX Support: Specialized extractor for DEX bytecode files
- Improved Detection: 25+ components detected in APK files with 152K+ features extracted
- Substring Matching: Detects components even with partial pattern matches
- Progress Indication: Real-time progress bars for long analysis operations
- Mobile Applications: Android APK and iOS IPA with manifest parsing and native library analysis
- Java Archives: JAR/WAR files with MANIFEST.MF parsing and package detection
- Python Packages: Wheels (.whl) and eggs (.egg) with metadata extraction
- Linux Packages: DEB (Debian/Ubuntu) and RPM (Red Hat/Fedora) packages
- Extended Formats: 7z, RAR, Zstandard (.zst, .tar.zst), CPIO
- Nested Archives: Handle archives containing other archives (up to 5 levels deep)
- Intelligent Extraction: Prioritizes binaries, bytecode, and source files for analysis
- CTags Integration: Advanced source code analysis when universal-ctags is available
- Multi-language Support: C/C++, Python, Java, JavaScript, Go, Rust, PHP, Swift, Kotlin
- Semantic Symbol Extraction: Functions, classes, structs, constants, and dependencies
- Graceful Fallback: Regex-based extraction when CTags is unavailable
- Comprehensive Security Module: Deep analysis of ML models for security threats
- MITRE ATT&CK Integration: Maps threats to ATT&CK framework techniques
- Multi-Level Risk Assessment: SAFE, LOW, MEDIUM, HIGH, CRITICAL risk levels
- Pickle File Parser: Safe analysis of Python pickle files without code execution
- ONNX Model Parser: Comprehensive analysis of ONNX format models
- SafeTensors Parser: Validation of secure tensor storage format
- PyTorch/TensorFlow Native: Handles .pt, .pth, .pb, .h5 native formats
- Malicious Detection: 100% detection rate on real-world ML exploits
- Framework Detection: Identifies PyTorch (96%), TensorFlow, sklearn (94%), XGBoost (77%) origins
- Obfuscation Detection: Entropy analysis and pattern matching for hidden threats
- Model Integrity Validation: Hash verification and tampering detection
- Architecture Recognition: Detects ResNet, BERT, YOLO, LLaMA, ViT, etc.
- Format Validation: Detects tampering, injection attempts, and format violations
- Malformed File Detection: Identifies corrupted or invalid model files with clear warnings
- Data Exfiltration Detection: Flags oversized tensors and suspicious patterns
- Supply Chain Security: Verifies model provenance and integrity
- SARIF Output: CI/CD integration with GitHub Actions and security tools
- Security-Enhanced SBOM: CycloneDX format with ML security metadata
- 188 OSS Components: Comprehensive coverage including libraries, frameworks, ML models, and multimedia codecs
- 1,400+ Total Signatures: High-quality patterns with improved accuracy and reduced false positives
- Multimedia Support: H.264/H.265, AAC, Dolby, AV1, GStreamer, GLib, FFmpeg components
- System Libraries: libcap, Expat XML, LZ4, XZ Utils, WebP, cURL, Cairo, Opus
- License Detection: Automatic license identification for detected components
- Security Analysis: Detection of malicious patterns with severity levels (CRITICAL, HIGH, MEDIUM, LOW)
- Rich Metadata: Publisher, version, and ecosystem information for each component
pip install binarysniffergit clone https://github.com/SemClone/binarysniffer
cd binarysniffer
pip install -e .pip install binarysniffer[fast]# Includes TLSH for detecting modified/recompiled components
pip install binarysniffer[fuzzy]# Includes support for 7z, RAR, DEB, RPM formats
pip install binarysniffer[archives]# Includes Androguard for advanced APK analysis
pip install binarysniffer[android]BinarySniffer can leverage external tools when available to provide enhanced analysis capabilities. These tools are optional - the core functionality works without them, but installing them unlocks additional features.
| Format | Python Package | System Tool (Alternative) | Fallback |
|---|---|---|---|
| 7z | py7zr (included) | 7-Zip | - |
| RAR | rarfile (included) | unrar | 7-Zip |
| DEB | python-debian (included) | ar | 7-Zip |
| RPM | - | rpm2cpio | 7-Zip |
| ZIP/JAR | Built-in | - | - |
| TAR/GZ | Built-in | - | - |
Enables: Extraction and analysis of Windows installers, macOS packages, and additional compressed formats
# macOS
brew install p7zip
# Ubuntu/Debian
sudo apt-get install p7zip-full
# Windows
# Download from https://www.7-zip.org/Benefits:
- Analyze Windows installers (.exe, .msi) by extracting embedded components
- Analyze macOS installers (.pkg, .dmg) to detect bundled frameworks
- Support for NSIS, InnoSetup, and other installer formats
- Extract and analyze self-extracting archives
- Support for additional archive formats (RAR, CAB, ISO, etc.)
When using the [archives] installation option, these tools enhance format support:
# For DEB packages (Debian/Ubuntu)
# Option 1: Install python-debian (included with [archives])
pip install binarysniffer[archives]
# Option 2: Use system ar command (usually pre-installed)
# Ubuntu/Debian
which ar # Check if available
# macOS
# ar is included with Xcode Command Line Tools
xcode-select --install # If not already installed# For RPM packages (Red Hat/Fedora/CentOS)
# Option 1: Install rpm2cpio
# Ubuntu/Debian
sudo apt-get install rpm2cpio
# macOS
brew install rpm2cpio
# Fedora/RHEL/CentOS
# rpm2cpio is usually pre-installed
# Option 2: Falls back to 7-Zip if availableThe [archives] option includes Python libraries for:
- 7z files: py7zr (pure Python, no external tools needed)
- RAR files: rarfile (requires unrar tool)
# Install unrar for RAR support # Ubuntu/Debian sudo apt-get install unrar # macOS brew install unrar # Note: Falls back to 7-Zip if unrar not available
Enables: Enhanced source code analysis with semantic understanding
# macOS
brew install universal-ctags
# Ubuntu/Debian
sudo apt-get install universal-ctags
# Windows
# Download from https://github.com/universal-ctags/ctags-win32/releasesBenefits:
- Better function/class/method detection in source code
- Multi-language semantic analysis
- More accurate symbol extraction
- Improved signature matching for source code components
Without 7-Zip:
$ binarysniffer analyze installer.exe
# Analyzes as compressed binary - limited detectionWith 7-Zip installed:
# Windows installers
$ binarysniffer analyze installer.exe
$ binarysniffer analyze setup.msi
# Automatically extracts and analyzes contents
# Detects: Qt5, OpenSSL, SQLite, ICU, libpng, etc.
# macOS installers
$ binarysniffer analyze app.pkg
$ binarysniffer analyze app.dmg
# Automatically extracts and analyzes contents
# Detects: Qt5, WebKit, OpenCV, React Native, etc.# Basic analysis
binarysniffer analyze /path/to/binary
binarysniffer analyze app.apk # Android APK
binarysniffer analyze app.ipa # iOS IPA
binarysniffer analyze library.jar # Java JAR
# ML model component detection
binarysniffer analyze model.pkl # Pickle files
binarysniffer analyze model.onnx # ONNX models
binarysniffer analyze model.safetensors # SafeTensors format
binarysniffer analyze suspicious_model.pkl --show-features # Detailed analysis
# ML model security scanning (v1.10.0+)
binarysniffer ml-scan model.pkl # Security analysis of ML models
binarysniffer ml-scan model.pkl --deep # Deep security analysis
binarysniffer ml-scan models/ -r --format sarif # SARIF output for CI/CD
binarysniffer ml-scan model.pkl -o report.md # Markdown security report
binarysniffer ml-scan model.pkl --risk-threshold 0.5 # Custom risk threshold
# Analyze directories recursively
binarysniffer analyze /path/to/project -r
# Output with auto-format detection
binarysniffer analyze app.apk -o report.json # Auto-detects JSON format
binarysniffer analyze app.apk -o report.csv # Auto-detects CSV format
binarysniffer analyze app.apk -o app.sbom # Auto-detects SBOM format
# Performance modes
binarysniffer analyze large.bin --fast # Quick scan (no fuzzy matching)
binarysniffer analyze app.apk --deep # Thorough analysis
# Custom confidence threshold
binarysniffer analyze file.exe -t 0.3 # More sensitive (30% confidence)
binarysniffer analyze file.exe -t 0.8 # More conservative (80% confidence)
# Include file hashes in output
binarysniffer analyze file.exe --with-hashes -o report.json
binarysniffer analyze file.exe --basic-hashes # Only MD5, SHA1, SHA256
# Filter by file patterns
binarysniffer analyze project/ -r -p "*.so" -p "*.dll"
# Export as CycloneDX SBOM
binarysniffer analyze app.apk -f sbom -o app-sbom.json
binarysniffer analyze app.apk --format cyclonedx -o sbom.json
# Save features for signature creation
binarysniffer analyze binary.exe --save-features features.json --show-features
# Filter results
binarysniffer analyze lib.so --min-matches 5 # Show components with 5+ matches
binarysniffer analyze app.apk --show-evidence # Show detailed match evidenceThe analysis results display a Classification column that shows either:
- Software licenses (e.g., Apache-2.0, BSD-3-Clause, MIT) for legitimate OSS components
- Security severity levels (CRITICAL, HIGH, MEDIUM, LOW) for detected threats
Example output:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Component ┃ Confidence ┃ Classification ┃ Type ┃ Evidence ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PyTorch-Native │ 94.0% │ BSD-3-Clause │ library│ 2 patterns │
│ SafeTensors │ 90.0% │ Apache-2.0 │ library│ 3 patterns │
│ Pickle-Malicious │ 98.5% │ CRITICAL │ threat │ RCE risk detected│
└──────────────────┴────────────┴────────────────┴────────┴──────────────────┘
from binarysniffer import EnhancedBinarySniffer
# Initialize analyzer (enhanced mode is default)
sniffer = EnhancedBinarySniffer()
# Analyze a single file
result = sniffer.analyze_file("/path/to/binary")
for match in result.matches:
print(f"{match.component} - {match.confidence:.2%}")
print(f"Classification: {match.license}") # Shows license or severity level
# Analyze mobile applications
apk_result = sniffer.analyze_file("app.apk")
ipa_result = sniffer.analyze_file("app.ipa")
jar_result = sniffer.analyze_file("library.jar")
# Analyze with custom threshold (default is 0.5)
result = sniffer.analyze_file("file.exe", confidence_threshold=0.3) # More sensitive
result = sniffer.analyze_file("file.exe", confidence_threshold=0.8) # More conservative
# Analyze with file hashes
result = sniffer.analyze_file("file.exe", include_hashes=True, include_fuzzy_hashes=True)
# Directory analysis
results = sniffer.analyze_directory("/path/to/project", recursive=True)
for file_path, result in results.items():
if result.matches:
print(f"{file_path}: {len(result.matches)} components detected")
# TLSH fuzzy matching for modified components
result = sniffer.analyze_file(
"modified_binary.exe",
use_tlsh=True, # Enable TLSH fuzzy matching (default)
tlsh_threshold=50 # Lower threshold = more similar required
)
for match in result.matches:
if match.match_type == 'tlsh_fuzzy':
print(f"Fuzzy match: {match.component} (similarity: {match.confidence:.0%})")Generate Software Bill of Materials in CycloneDX format for integration with security and compliance tools:
# Export single file analysis as SBOM
binarysniffer analyze app.apk --format cyclonedx -o app-sbom.json
# Export directory analysis as aggregated SBOM
binarysniffer analyze project/ -r --format cdx -o project-sbom.json
# Include extracted features for signature recreation
binarysniffer analyze binary.exe --format cyclonedx --show-features -o sbom-with-features.jsonThe SBOM includes:
- Component names, versions, and licenses
- Confidence scores for each detection
- File paths showing where components were found
- Evidence details including matched patterns
- Optional extracted features for signature recreation
Extract comprehensive file inventories from packages with metadata, hashes, and component detection:
# Basic inventory summary
binarysniffer inventory app.apk
# Export full inventory with auto-format detection
binarysniffer inventory app.apk -o inventory.json
binarysniffer inventory app.jar -o files.csv
# Include file hashes (MD5, SHA1, SHA256, TLSH, ssdeep)
binarysniffer inventory app.jar --analyze --with-hashes -o files.csv
# Full analysis with component detection
binarysniffer inventory app.ipa \
--analyze \
--with-hashes \
--with-components \
-o full_inventory.json
# Export as directory tree visualization
binarysniffer inventory archive.zip --format tree -o structure.txtfrom binarysniffer import EnhancedBinarySniffer
sniffer = EnhancedBinarySniffer()
# Basic inventory extraction
inventory = sniffer.extract_package_inventory("app.apk")
print(f"Total files: {inventory['summary']['total_files']}")
print(f"Package size: {inventory['package_size']:,} bytes")
# Full analysis with all features
inventory = sniffer.extract_package_inventory(
"app.apk",
analyze_contents=True, # Extract and analyze file contents
include_hashes=True, # Calculate MD5, SHA1, SHA256
include_fuzzy_hashes=True, # Calculate TLSH and ssdeep
detect_components=True # Run OSS component detection
)
# Access comprehensive file metadata
for file_entry in inventory['files']:
if not file_entry['is_directory']:
print(f"File: {file_entry['path']}")
print(f" MIME: {file_entry['mime_type']}")
print(f" Size: {file_entry['size']:,} bytes")
print(f" Compression ratio: {file_entry['compression_ratio']:.1%}")
if 'hashes' in file_entry:
print(f" SHA256: {file_entry['hashes']['sha256']}")
if 'components' in file_entry:
for comp in file_entry['components']:
print(f" Component: {comp['name']} ({comp['confidence']:.0%})")- JSON: Complete structured data with all metadata
- CSV: Tabular format for data analysis (includes hashes, MIME types, components)
- Tree: Visual directory structure representation
- Summary: Quick overview with file type statistics
Detect and analyze software licenses using pattern matching and SPDX identifier recognition:
# Analyze licenses in a file or directory
binarysniffer license /path/to/project
# Check license compatibility
binarysniffer license . --check-compatibility
# Show which files contain each license
binarysniffer license src/ --show-files
# Export license report
binarysniffer license app.apk -o licenses.json
binarysniffer license project/ -o report.md --format markdownCombine component and license detection in a single analysis:
# Add license detection to regular analysis
binarysniffer analyze app.jar --license-focus
# Perform only license detection (skip component analysis)
binarysniffer analyze source/ --license-onlyfrom binarysniffer import EnhancedBinarySniffer
sniffer = EnhancedBinarySniffer()
# Analyze licenses in a project
license_result = sniffer.analyze_licenses("/path/to/project")
print(f"Detected licenses: {', '.join(license_result['licenses_detected'])}")
# Check compatibility
compatibility = license_result['compatibility']
if not compatibility['compatible']:
for warning in compatibility['warnings']:
print(f"Warning: {warning}")- Pattern-based detection for common licenses (MIT, Apache-2.0, GPL, BSD, LGPL, ISC)
- SPDX identifier support with 100% confidence
- License compatibility checking to identify conflicts
- Multiple output formats: Table, JSON, CSV, Markdown
- Works on: License files, source code with embedded licenses, archives
Create custom signatures for components you want to detect:
# From binary files (recommended for compiled components)
binarysniffer signatures create /usr/bin/ffmpeg --name FFmpeg --version 4.4.1
# From source code directories
binarysniffer signatures create /path/to/source --name MyLibrary --license MIT
# With complete metadata for better attribution
binarysniffer signatures create binary.so \
--name "My Component" \
--version 2.0.0 \
--license Apache-2.0 \
--publisher "My Company" \
--description "Component description" \
--output signatures/my-component.json
# Specify minimum signature requirements
binarysniffer signatures create /path/to/library \
--name "LibraryName" \
--min-signatures 10 # Require at least 10 unique patternsThe signature generator includes automatic collision detection to identify patterns that appear in multiple existing components:
# Check for collisions with existing signatures
binarysniffer signatures create /usr/bin/myapp \
--name "MyApp" \
--check-collisions
# Interactive review - decide on each collision
binarysniffer signatures create /usr/bin/myapp \
--name "MyApp" \
--interactive
# Auto-remove patterns with high collision severity
binarysniffer signatures create /usr/bin/myapp \
--name "MyApp" \
--check-collisions \
--collision-threshold high # Remove patterns in 3+ componentsCollision Severity Levels:
- Critical: Pattern appears in 5+ unrelated components (likely generic)
- High: Pattern appears in 3-4 components
- Medium: Pattern appears in 2 unrelated components
- Low: Pattern appears in 2 related components (e.g., ffmpeg/libav)
Features:
- Automatic generic word filtering (100+ common programming terms)
- Smart deduplication - all signatures are unique
- Cross-signature collision detection
- Interactive and automatic filtering modes
- Preserves library-specific prefixes (av_, curl_, SSL_, etc.)
Help improve detection by contributing your signatures:
-
Generate the signature file:
binarysniffer signatures create /path/to/component \ --name "Component Name" \ --version "1.0.0" \ --license "MIT" \ --publisher "Publisher Name" \ --output signatures/component-name.json
-
Test your signature:
# Import locally for testing binarysniffer signatures import signatures/component-name.json # Verify detection works binarysniffer analyze /path/to/test/binary
-
Submit via GitHub Pull Request:
# Fork the repository on GitHub, then: git clone https://github.com/YOUR_USERNAME/binarysniffer cd binarysniffer # Add your signature file cp /path/to/component-name.json signatures/ # Commit and push git add signatures/component-name.json git commit -m "Add signatures for Component Name v1.0.0" git push origin main # Create a Pull Request on GitHub
For detailed contribution guidelines, see CONTRIBUTING.md.
The tool uses a multi-tiered approach for efficient matching:
- Pattern Matching: Direct string/symbol matching against signature database
- MinHash LSH: Fast similarity search for near-duplicate detection (milliseconds)
- TLSH Fuzzy Matching: Locality-sensitive hashing to detect modified/recompiled components
- Detailed Verification: Precise signature verification with confidence scoring
TLSH (Trend Micro Locality Sensitive Hash) enables detection of:
- Modified Components: Components with patches or custom modifications
- Recompiled Binaries: Same source code compiled with different options
- Version Variants: Different versions of the same library
- Obfuscated Code: Components with mild obfuscation or optimization
The TLSH algorithm generates a compact hash that remains similar even when files are modified, making it ideal for detecting OSS components that have been customized or rebuilt.
- Analysis Speed: ~1 second per binary file (5x faster in v1.6.3)
- Archive Processing: ~100-500ms for APK/IPA files (depends on contents)
- Signature Storage: ~3.5MB database with 5,136 signatures from 131 components
- Memory Usage: <100MB during analysis, <200MB for large archives
- Deterministic Results: Consistent detection across runs (NEW in v1.6.3)
Configuration file location: ~/.binarysniffer/config.json
{
"signature_sources": [
"https://signatures.binarysniffer.io/core.xmdb"
],
"cache_size_mb": 100,
"parallel_workers": 4,
"min_confidence": 0.5,
"auto_update": true,
"update_check_interval_days": 7
}The tool includes a pre-built signature database with 131 OSS components including:
- Mobile SDKs: Facebook Android SDK, Google Firebase, Google Ads
- Java Libraries: Jackson, Apache Commons, Google Guava, Netty
- Media Libraries: FFmpeg, x264, x265, Vorbis, Opus
- Crypto Libraries: Bounty Castle, mbedTLS variants
- Development Tools: Lombok, Dagger, RxJava, OkHttp
Maintaining an up-to-date signature database is critical for accurate detection. BinarySniffer provides comprehensive signature management commands:
# Check current signature database status
binarysniffer signatures status
# Shows: total signatures, components, last update, database location
# View detailed statistics
binarysniffer signatures stats
# Shows: signatures per component, database size, index status# Update signatures from GitHub repository (recommended)
binarysniffer signatures update
# Pulls latest community-contributed signatures
# Alternative update command (backward compatible)
binarysniffer update
# Force update even if current
binarysniffer signatures update --force# Rebuild database from packaged signatures
binarysniffer signatures rebuild
# Useful when database is corrupted or needs fresh start
# Import specific signature files
binarysniffer signatures import signatures/*.json
# Import from custom directory
binarysniffer signatures import /path/to/signatures --recursive# Create signature from binary
binarysniffer signatures create /usr/bin/curl \
--name "curl" \
--version 7.81.0 \
--license "MIT" \
--output signatures/curl.json
# Create from source code directory
binarysniffer signatures create /path/to/source \
--name "MyLibrary" \
--version 1.0.0 \
--license "Apache-2.0" \
--min-length 8 # Minimum pattern length
# Create with metadata
binarysniffer signatures create binary.so \
--name "Custom Component" \
--publisher "My Company" \
--description "Custom implementation" \
--url "https://github.com/mycompany/component"# Validate signature quality before adding
binarysniffer signatures validate signatures/new-component.json
# Checks for: generic patterns, minimum length, uniqueness
# Test signature against known files
binarysniffer signatures test signatures/component.json /path/to/test/files# Export signatures to JSON (for backup or sharing)
binarysniffer signatures export --output my-signatures/
# Creates one JSON file per component
# Clear database (use with caution)
binarysniffer signatures clear --confirm
# Removes all signatures from database
# Optimize database
binarysniffer signatures optimize
# Rebuilds indexes and vacuums database for better performanceConfigure automatic signature updates in ~/.binarysniffer/config.json:
{
"auto_update": true,
"update_check_interval_days": 7,
"signature_sources": [
"https://github.com/oscarvalenzuelab/binarysniffer-signatures"
]
}- Regular Updates: Run
binarysniffer signatures updateweekly for latest detections - Custom Signatures: Create signatures for proprietary components you want to track
- Validation: Always validate new signatures to avoid false positives
- Backup: Export signatures before major updates using
signatures export - Performance: Run
signatures optimizemonthly for best performance
For detailed signature creation and management documentation, see docs/SIGNATURE_MANAGEMENT.md.
Apache License 2.0 - See LICENSE file for details.
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.