Violence detection in surveillance videos is vital for ensuring public safety. VioNet is a deep hybrid architecture designed for real-time violence detection by leveraging spatial, temporal, and contextual information from video clips.
Manual monitoring of video feeds is inconsistent and infeasible at scale. VioNet addresses this challenge using a deep learning-based approach combining ConvLSTM, BiLSTM, and Multi-Head Attention layers to focus on spatio-temporal and sequence-level cues in video data.
- Variations in lighting, occlusion, crowd density, and camera angles complicate violence recognition.
- Existing models fail to effectively capture both spatial and temporal dynamics.
- VioNet aims to be fast, lightweight, and highly accurate for real-time violence detection.
- Detect violent behavior in short surveillance clips using deep learning.
- Introduce a hybrid model (ConvLSTM + BiLSTM + Attention).
- Design a system ready for real-world deployment, even on low-power edge devices.
VioNet is a hybrid network that processes 20 RGB video frames of size 64×64 using:
-
TimeDistributed Residual CNN Blocks
→ Extract spatial features per frame. -
ConvLSTM2D Layer
→ Capture motion patterns across frames. -
Bidirectional LSTM
→ Learn temporal dynamics from both directions. -
Multi-Head Attention
→ Focus on the most informative time steps. -
GlobalAveragePooling + Dense Head
→ Produce a single probability for classification.
Layer | Purpose |
---|---|
TimeDistributed + ResNet |
Spatial feature extraction per frame |
ConvLSTM2D |
Motion encoding (no optical flow needed) |
BiLSTM |
Bidirectional temporal learning |
Multi-Head Attention |
Contextual relevance across time steps |
GlobalAvgPooling + Dense |
Classification |
Dataset: Hockey Fight Dataset
- 500 violent & 500 non-violent videos
- Converted to sequences of 20 frames each using OpenCV
- Frames resized to 64x64, normalized to
[0,1]
, converted to RGB - Sequence padding applied to handle varying frame lengths
Parameter | Value |
---|---|
Framework | TensorFlow + Keras |
Video Processing | OpenCV |
GPU | Google Colab (Tesla T4) |
Epochs | 50 |
Batch Size | 32 |
Optimizer | Adam |
Loss Function | Binary Crossentropy |
Regularization | Dropout (0.4), L2 (0.001) |
Metrics | Accuracy, F1, AUC-ROC |
Metric | Score |
---|---|
Accuracy | 94% |
F1-Score | 94% |
AUC-ROC | 97% |
VioNet demonstrates robustness even in blurry or occluded scenes.
The violence-detection-vionet
project is organized with the following top-level directories: models
for the VioNet architecture, utils
for data loaders, trainers, evaluators, and visualizers, data
to store input videos or extracted frames, assets
for architecture diagrams and sample outputs, and plots
for evaluation plots.
- Clone the repo:
git clone https://github.com/sarathir-dev/VioNet.git cd violence-detection-vionet
- Install requirements:
pip install -r requirements.txt
- Train the model:
python main.py
- View plots:
Open plots/ folder for training curves, confusion matrix, and ROC.
- Enhance performance in low-light surveillance
- Train on multiple types of violence scenarios
- Optimize for Jetson Nano / Raspberry Pi
- Studd, N.B., Terence, S., Immaculate, J. and Ewards, V., 2024, March. Violence Detection Using Convlstm and LRCN. In 2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS) (Vol. 1, pp. 1635-1639). IEEE.
- Bagga, N., Singh, G., Balusamy, B. and Singh, A.S., 2022, April. Violence detection in real life videos using convolutional neural network. In 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 872-876). IEEE.
- Traoré, A. and Akhloufi, M.A., 2020, October. Violence detection in videos using deep recurrent and convolutional neural networks. In 2020 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 154-159). IEEE.