How to calculate Shannon entropy based on HTTP header

Shannon entropy is given by the formula:

Shannon

Where Ti will be the data extracted from my network dump (dump.pcap).

The end of an HTTP header on a normal connection is marked by \r\n\r\n: full HTTP header

Example of an incomplete HTTP header (may be a denial of service attack):

insert the description of the image here

My goal is to calculate the entropy of the number of packets with \r\n\r\n and without \r\n\r\n in order to compare them.

I can read the PCAP file like this:

import pyshark

pkts = pyshark.FileCapture('dump.pcap')

The entropy based on IP numbers I did:

import numpy as np
import collections

sample_ips = [
    "131.084.001.031",
    "131.084.001.031",
    "131.284.001.031",
    "131.284.001.031",
    "131.284.001.000",
]

C = collections.Counter(sample_ips)
counts = np.array(list(C.values()),dtype=float)
#counts  = np.array(C.values(),dtype=float)
prob    = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)

Any ideas? Is it possible/does it make sense to calculate entropy based on the number of packets with \r\n\r\n and without \r\n\r\n? Or is it something that doesn't make sense? Any idea how to do the calculation?

The network dump is here: https://ufile.io/y5c7k

Some lines of it:

dump with HTTP filter

30  2017/246 11:20:00.304515    192.168.1.18    192.168.1.216   HTTP    339 GET / HTTP/1.1 


GET / HTTP/1.1
Host: 192.168.1.216
accept-language: en-US,en;q=0.5
accept-encoding: gzip, deflate
accept: */*
user-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0
Connection: keep-alive
content-type: application/x-www-form-urlencoded; charset=UTF-8
Author: Ed S, 2017-09-01

1 answers

I don't know what the structure of your packet returned by pyshark looks like, but I imagine it has 2 info, the ip address and the contents of the packet. Imagining that you have these 2 pieces of information in a dict, you could do something like:

pkgs = [
    {
        'ip': '127.0.0.1',
        'content': 'Im a http header\r\n\r\n<html><body>',
    },
    {
        'ip': '127.0.0.1',
        'content': 'Im a not a http header',
    },
    {
        'ip': '127.0.0.2',
        'content': 'Im a http header\r\n\r\n<html><body>',
    },
    {
        'ip': '127.0.0.2',
        'content': 'Im a not a http header',
    },
    {
        'ip': '127.0.0.2',
        'content': 'Im a not a http header too',
    }
]

def is_http(content):
    return '\r\n\r\n' in content

classified_pkgs = [(p['ip'], is_http(p['content'])) for p in pkgs]
>> [('127.0.0.1', True),
>> ('127.0.0.1', False),
>> ('127.0.0.2', True),
>> ('127.0.0.2', False),
>> ('127.0.0.2', False)]

Then you just calculate the odds as you calculated before:

import numpy as np
import collections

counter = collections.Counter(classified_pkgs)
counts  = np.array(list(counter.values()),dtype=float)

prob = counts/counts.sum()
shannon_entropy = (-prob * np.log2(prob)).sum()
print (shannon_entropy)
 4
Author: Begnini, 2017-09-06 01:03:06