How to calculate Shannon entropy based on HTTP header
Shannon entropy is given by the formula:
Where Ti
will be the data extracted from my network dump (dump.pcap).
The end of an HTTP header on a normal connection is marked by \r\n\r\n
:
Example of an incomplete HTTP header (may be a denial of service attack):
My goal is to calculate the entropy of the number of packets with \r\n\r\n
and without \r\n\r\n
in order to compare them.
I can read the PCAP file like this:
import pyshark
pkts = pyshark.FileCapture('dump.pcap')
The entropy based on IP numbers I did:
import numpy as np
import collections
sample_ips = [
"131.084.001.031",
"131.084.001.031",
"131.284.001.031",
"131.284.001.031",
"131.284.001.000",
]
C = collections.Counter(sample_ips)
counts = np.array(list(C.values()),dtype=float)
#counts = np.array(C.values(),dtype=float)
prob = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)
Any ideas? Is it possible/does it make sense to calculate entropy based on the number of packets with \r\n\r\n
and without \r\n\r\n
? Or is it something that doesn't make sense?
Any idea how to do the calculation?
The network dump is here: https://ufile.io/y5c7k
Some lines of it:
30 2017/246 11:20:00.304515 192.168.1.18 192.168.1.216 HTTP 339 GET / HTTP/1.1
GET / HTTP/1.1
Host: 192.168.1.216
accept-language: en-US,en;q=0.5
accept-encoding: gzip, deflate
accept: */*
user-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0
Connection: keep-alive
content-type: application/x-www-form-urlencoded; charset=UTF-8
1 answers
I don't know what the structure of your packet returned by pyshark looks like, but I imagine it has 2 info, the ip address and the contents of the packet. Imagining that you have these 2 pieces of information in a dict, you could do something like:
pkgs = [
{
'ip': '127.0.0.1',
'content': 'Im a http header\r\n\r\n<html><body>',
},
{
'ip': '127.0.0.1',
'content': 'Im a not a http header',
},
{
'ip': '127.0.0.2',
'content': 'Im a http header\r\n\r\n<html><body>',
},
{
'ip': '127.0.0.2',
'content': 'Im a not a http header',
},
{
'ip': '127.0.0.2',
'content': 'Im a not a http header too',
}
]
def is_http(content):
return '\r\n\r\n' in content
classified_pkgs = [(p['ip'], is_http(p['content'])) for p in pkgs]
>> [('127.0.0.1', True),
>> ('127.0.0.1', False),
>> ('127.0.0.2', True),
>> ('127.0.0.2', False),
>> ('127.0.0.2', False)]
Then you just calculate the odds as you calculated before:
import numpy as np
import collections
counter = collections.Counter(classified_pkgs)
counts = np.array(list(counter.values()),dtype=float)
prob = counts/counts.sum()
shannon_entropy = (-prob * np.log2(prob)).sum()
print (shannon_entropy)