The model committed to this repository is an incomplete model that has not yet been fully trained, and is therefore a demo model whose detection performance is only about 45% of the original accuracy.
A computer vision pipeline that integrates YOLOv9 (object detection with 34 whole-body keypoints) and DepthAnythingV2 (depth estimation) to perform real-time object detection, pose estimation, and depth estimation simultaneously. Optionally supports person segmentation.
- Object Detection: High-precision object detection using YOLOv9
- Pose Estimation: Detection of 34 whole-body keypoints (face, hands, body)
- Depth Estimation: Relative or metric depth estimation (indoor/outdoor) using DepthAnythingV2
- Segmentation: Optional person segmentation support
- Edge Detection: Edge detection from depth maps
- Head Distance Measurement: Distance calculation to head using camera FOV
- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- ONNXRuntime or TensorRT (for fast inference)
# Basic dependencies
pip install opencv-python
pip install numpy
pip install torch
pip install pyyaml
pip install onnx
pip install onnxruntime-gpu # GPU version (CUDA environment)
# or
pip install onnxruntime # CPU version
# ONNX simplification tool
pip install onnxsim
# PINTO0309's ONNX manipulation tools (required for model merging)
pip install sne4onnx
pip install sor4onnx
pip install snc4onnx
pip install soa4onnx
pip install sio4onnx
pip install sit4onnx # For benchmarking (optional)
# For TensorRT usage (optional)
# TensorRT must be installed separately from NVIDIA official sourcesThe following ONNX models are required for this repository. You can download all the necessary ONNX files from: https://github.com/PINTO0309/yolo-depthanythingv2-merge/releases/tag/onnx
Required models:
- YOLOv9 Model:
yolov9_e_wholebody34_post_0100_1x3x480x640.onnx - DepthAnythingV2 Models:
- Relative depth:
depth_anything_v2_small.onnx - Metric depth (indoor):
depth_anything_v2_metric_hypersim_vits_indoor_maxdepth20_1x3x518x518.onnx - Metric depth (outdoor):
depth_anything_v2_metric_vkitti_vits_outdoor_maxdepth80_1x3x518x518.onnx
- Relative depth:
- Segmentation Model (optional):
peopleseg_1x3x480x640.onnx
Create an integrated model combining YOLOv9 and DepthAnythingV2.
# Adjust DepthAnythingV2 model resolution (H and W must be multiples of 14)
H=490 # Multiples of 14
W=644 # Multiples of 14
ONNXSIM_FIXED_POINT_ITERS=10000 onnxsim depth_anything_v2_small.onnx depth_anything_v2_small_${H}x${W}.onnx \
--overwrite-input-shape "pixel_values:1,3,${H},${W}"
# Merge models
python merge_preprocess_onnx.py# Merge with segmentation support
python merge_preprocess_onnx_depth_seg.pyRun inference on an image folder:
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-i ./images \
-ep cuda \
-dvw \
-dwk \
-kst 0.25 \
-dnm \
-dgm \
-dlr \
-dhm \
-kdm dot \
-edm# Webcam (device index 0)
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-v 0 \
-ep cuda
# Video file
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-v ./video.mp4 \
-ep tensorrtpython demo_yolov9_onnx_wholebody34_with_edges_with_depth_seg.py \
-m yolov9_e_wholebody34_with_depth_seg_post_0100_1x3x480x640.onnx \
-i ./images \
-ep cuda \
-edm \
-ehd-m, --model: Path to ONNX model file to use-v, --video: Video file path or camera index (0, 1, 2...)-i, --images_dir: Image folder path (supports jpg, png)
-ep, --execution_provider: Provider to use for inferencecpu: CPU executioncuda: CUDA GPU execution (recommended)tensorrt: TensorRT execution (fastest)
-it, --inference_type: Inference precision (fp16orint8)-dvw, --disable_video_writer: Disable video writer (improves processing speed)-dwk, --disable_waitKey: Disable key input wait (for batch processing)
-ost, --object_score_threshold: Object detection score threshold (default: 0.35)-ast, --attribute_score_threshold: Attribute score threshold (default: 0.70)-kst, --keypoint_threshold: Keypoint score threshold (default: 0.25)
-kdm, --keypoint_drawing_mode: Keypoint drawing mode (dot,box,both)-dnm, --disable_generation_identification_mode: Disable generation identification mode-dgm, --disable_gender_identification_mode: Disable gender identification mode-dlr, --disable_left_and_right_hand_identification_mode: Disable left/right hand identification mode-dhm, --disable_headpose_identification_mode: Disable head pose identification mode-drc, --disable_render_classids: Disable rendering of specific class IDs (e.g.,-drc 17 18 19)
-efm, --enable_face_mosaic: Enable face mosaic (toggle with F key)-ebd, --enable_bone_drawing: Enable bone drawing (toggle with B key)-edm, --enable_depth_map_overlay: Enable depth map overlay (toggle with D key)-ehd, --enable_head_distance_measurement: Enable head distance measurement (toggle with M key)-oyt, --output_yolo_format_text: Output YOLO format text files and images
-bblw, --bounding_box_line_width: Bounding box line width (default: 2)-chf, --camera_horizontal_fov: Camera horizontal FOV (default: 90 degrees)
N: Toggle generation identification modeG: Toggle gender identification modeH: Toggle left/right hand identification modeP: Toggle head pose identification modeF: Toggle face mosaicB: Toggle bone drawingD: Toggle depth map overlayM: Toggle head distance measurementQorESC: Exit
sit4onnx \
-if yolov9_e_wholebody34_with_depth_post_0100_1x3x480x640.onnx \
-oep tensorrt \
-fs 1 3 480 640INFO: file: yolov9_e_wholebody34_with_depth_post_0100_1x3x480x640.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: input_bgr shape: [1, 3, 480, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time: 120.57948112487793 ms
INFO: avg elapsed time per pred: 12.057948112487793 ms
INFO: output_name.1: batchno_classid_score_x1y1x2y2_depth shape: [0, 8] dtype: float32
INFO: output_name.2: depth shape: [1, 1, 480, 640] dtype: float32
This example shows that using TensorRT to process 480x640 images, inference can be performed at approximately 12ms per frame (about 83 FPS).
# Prepare DepthAnythingV2 model with custom resolution
H=518 # Must be multiple of 14
W=518 # Must be multiple of 14
ONNXSIM_FIXED_POINT_ITERS=10000 onnxsim depth_anything_v2_small.onnx depth_anything_v2_small_${H}x${W}.onnx \
--overwrite-input-shape "pixel_values:1,3,${H},${W}"For indoor environments:
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-m yolov9_e_wholebody34_with_depth_metric_indoor_post_0100_1x3x480x640.onnx \
-i ./images \
-ep cuda \
-ehd # Enable head distance measurementFor outdoor environments:
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-m yolov9_e_wholebody34_with_depth_metric_outdoor_post_0100_1x3x480x640.onnx \
-i ./images \
-ep cuda \
-ehd# Run on CPU
python demo_yolov9_onnx_wholebody34_with_edges_with_depth.py \
-i ./images \
-ep cpu- Reduce input image resolution
- Keep batch size at 1 (default)
Verify that the required ONNX model files exist:
ls *.onnxhttps://github.com/PINTO0309/yolov9_wholebody34_heatmap_vis
https://github.com/liruilong940607/Pose2Seg
https://github.com/sreeharshaparuchur1/Pose2Seg-human-segmentation
https://github.com/azuraservices/human-pose-segmentation
高精度 Human-Instance-Segmentation (People-Instance-Segmentation, Person-Instance-Segmentation) の実装アイデア
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.





