Skip to content

slrbl/Intrusion-and-anomaly-detection-with-machine-learning

Repository files navigation

Webhawk/Catch 3.0

Unsupervised Machine Learning web attacks detection.

Image source:https://unsplash.com/photos/i4Y9hr5dxKc (Mathew Schwartz)

About

Webhawk/Catch helps automatically finding web attack traces in HTTP logs and abnormal OS processes without using any preset rules. Based on the usage of Unsupervised Machine Learning, Catch groups log lines into clusters, and detects the outliers that it considers as potentially attack traces.

The tool is able to parse both raw HTTP log files (Apache, Nginx, ...) and files including OS statistics (generated by top command). The tool takes these files as input and returns a report with a list of findings.

Catch uses PCA (Principal Component Analysis) technique to select the most relevant features (Example: user-agent, IP address, number of transmitted parameters, etc.. ). Then, it runs DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to get all the possible log line clusters and anomalous points (potential attack traces).

Advanced users can fine tune Catch based on a set of options that help optimising the clustering algorithm (Example: minimum number of points by cluster, or the maximum distance between two points within the same cluster).

The current version of Webhawk/Catch generates an easy-to-read HTML report which includes all the findings, and the severity of each one.

Webhawk/Catch is an open-source tool. Catch is the unsupervised version of Webhawk which is a supervised machine learning based cyber-attack detection tool. In contrary to the supervised Webhawk, Catch can be used without manually pertaining a model, the thing that makes it a lightweight and flexible solution to easily identify potential attack traces. Catch is available as an independent repository in Github, it is also included as part of Webhawk which is starred 125 times and forked 68 times.

High level design

Production services setup

Configuring webhawk/settings.conf file

Details about setting up this configuration file can be found in Development setup/Create a settings.conf file section below.

Building and launching Docker services

docker compose build
docker compose up

Once the above command are launched, thre services will be running:

Webhawk

This is the service used for detection, it takes as input a log file and it return detections.

Ollama

This service is used to make prompt for LLMs and getting response.

Webhawk web application

This service is used to run the web application where the detections results will be treated by cyber analysts.

Launching Docker services for the first time

If you are running launching the Docker services after the first build then you need to expect some delay for the first agent request, which you can test using the script ./webhawk/api_test.py. In fact this delay is related to downloading the selected Ollama selected model, note that this delay will disappear if you have already added this model pulling to ./ollama_get_models_and_serve.sh.

Development setup

The goal of this section is to help using webhawk detection, without using the web application. For the next steps you need to be in ./webhawk folder.

Using a Python virtual env

python -m venv webhawk_venv
source webhawk_venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Create a settings.conf file

Copy settings.conf.template file to settings.conf and fill it with the required parameters as the following.

[FEATURES]
features:length,params_number,return_code,size,upper_cases,lower_cases,special_chars,url_depth,user_agent,http_query,ip

[LOG]
apache_regex:([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (.+) "(.*?)" "(.*?)"
apache_names:["ip","date","query","code","size","referrer","user_agent"]

nginx_regex:([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) (.+) "(.*?)" "(.*?)"
nginx_names:["ip","date","query","code","size","referrer","user_agent"]

http_regex:^(\d*?\.\d*?)\t.*?\t(.*?)\t.*?\t.*?\t.*?\t.*?\t(.*?\t.*?\t.*?\t.*?)\t(.*?)\t.*?\t(.*?)\t(.*?)\t.*$
http_names:["date","ip","query","user_agent","size","code"]

apache_error:
nginx_error:

[PROCESS_DETAILS]
attributes:['status', 'num_ctx_switches', 'memory_full_info', 'connections', 'cmdline', 'create_time', 'num_fds', 'cpu_percent', 'terminal', 'ppid', 'cwd', 'nice', 'username', 'cpu_times', 'memory_info', 'threads', 'open_files', 'name', 'num_threads', 'exe', 'uids', 'gids', 'memory_percent', 'environ']

[CVE]
source:https://services.nvd.nist.gov/rest/json/cves/2.0?keywordSearch=
year_threshold:YYYY

[LLM]
url:http://ollama:11434 #if using docker compose
url:http://localhost:11434 #if not using docker compose
model:intigration/analyzer:latest #or select a model of your choice
prompt:Analyze this web log line for malicious activity. Provide a brief one pragraph (less than 60 words) as a response. Indidate if there is a known related attack or vulnerability. Do not start with 'This log line'

[WEBAPP]
webhawk_ui
url:http://webhawk_ui:3000/api/v1/incidents #if using docker compose
url:http://localhost:3000/api/v1/incidents #if not using docker compose

Unsupervised detection Usage

Catch.py script

python catch.py -h
usage: catch.py [-h] -l LOG_FILE -t LOG_TYPE [-e EPS] [-s MIN_SAMPLES] [-j LOG_LINES_LIMIT] [-y OPT_LAMDA] [-m MINORITY_THRESHOLD] [-p] [-o] [-r] [-z] [-b] [-c] [-v] [-a] [-q]

options:
  -h, --help            show this help message and exit
  -l, --log_file LOG_FILE
                        The raw log file
  -t, --log_type LOG_TYPE
                        apache, http, nginx or os_processes
  -e, --eps EPS         DBSCAN Epsilon value (Max distance between two points)
  -s, --min_samples MIN_SAMPLES
                        Minimum number of points with the same cluster. The default value is 2
  -j, --log_lines_limit LOG_LINES_LIMIT
                        The maximum number of log lines of consider
  -y, --opt_lamda OPT_LAMDA
                        Optimization lambda step
  -m, --minority_threshold MINORITY_THRESHOLD
                        Minority clusters threshold
  -p, --show_plots      Show informative plots
  -o, --standardize_data
                        Standardize feature values
  -r, --report          Create a HTML report
  -z, --opt_silouhette  Optimize DBSCAN silouhette
  -b, --debug           Activate debug logging
  -c, --label_encoding  Use label encoding instead of frequeny encoding to encode categorical features
  -v, --find_cves       Find the CVE(s) that are related to the attack traces
  -a, --get_ai_advice   Get AI advice on the detection
  -q, --quick_scan      Only most critical detection (no minority clusters)
  -f, --submit_to_app   Submit the finding to Webhawk app

Example with apache logs

Encoding is automatic for the unsupervised mode. You just need to run the catch.py script. Get inspired from this example:

python catch.py -l ./SAMPLE_DATA/RAW_APACHE_LOGS/access.log.2022-12-22 --log_type apache --standardize_data --report --find_cves --get_ai_advice

The output of this command is:

Example with OS processes

Before running the catch.py, you need to generate a .txt file containing the OS process statistics by taking advantage of top command:

top > PATH/os_processes.txt

You can then run the catch.py to detect potential abnormal OS processes:

python catch.py -l PATH/os_processes.txt --log_type os_processes --show_plots --standardize_data --report

Webhawk API

Webhawk API can be launched using the following command:

uvicorn app:app --reload

Testing the API using: The API can be tested using the script api_test.py or by launching the follwoing python commands:

import requests
with open("./SAMPLE_DATA/RAW_APACHE_LOGS/access.log.2017-05-24",'r') as f:
    logs=str(f.read())
params = {"hostname":"nothing","logs_content":logs}
response=requests.post("http://127.0.0.1:8000/scan",json=params)
print(response.json())

Used sample data

The data you will find in ./SAMPLE_DATA folder comes from
https://www.secrepo.com.

Generating your own testing data

You can alos generate test data using the script ./TESTING_LOGS_GENERATOR/apache_http_log_generator.py

Interesting data samples

https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3QBYB5

TODO

Adding more details to the high level design digaram Adding finding one by one to Webhawk UI Enhancing the UI Decoupling data transfer using Kafka (or equivalent)

Reference

Silhouette Effeciency
https://bioinformatics-training.github.io/intro-machine-learning-2017/clustering.html


Optimal Value of Epsilon
https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc


Max curvature point
https://towardsdatascience.com/detecting-knee-elbow-points-in-a-graph-d13fc517a63c

Contribution

All feedback, testing, and contributions are very welcome! If you would like to contribute, fork the project, add your changes, and submit a pull request.

About

Machine learning algorithms applied on log analysis to detect intrusions and suspicious activities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •