Tracking changes in HTML pages with hashing and Python

In the digital world, integrity of data plays an important role. Integrity is used primarily to make sure that data are transferred intact through a communication channel. To calculate the integrity of data, various hashing algorithms exist and are used for that purpose - the most popular of which are MD5, SHA256 and SHA512. Hashing is in the digital space the equivalent of what our brain does when we notice changes around us. We get a past image and compare it with the current one, without maintaining a record of each and every detail.

However, integrity checks are additionally used to monitor attacker infrastructure and thus assist in threat intelligence. A human brain cannot tell if a blank HTML page - as it’s displayed on the browser - hides content that changes over time. By hashing the response that a web server sends to clients and repeating this process frequently, it becomes possible for humans to observe changes in the content of the tracked page.

Threat Intel use case: Monitor attacker infrastructure

Let’s assume that a threat actor hosts malicious code on a benign-looking HTML page. For the sake of this post, let’s also assume that the malicious code, is enclosed within HTML comments for the launcher to be able to locate that code. The hosted code may change over time, as the attacker adopts new techniques, attacks new targets or makes any sort of changes on their infrastructure.

To effectively monitor these changes, hashing comes into play. By locating the area on a page where the malicious code is hosted, a threat intelligence analyst - or anyone who is keen to monitor attacker infrastructure - can simply hash the content and with this track changes to the page.

Python code

This section hosts a short implementation in Python of what has been discussed above.

The code povided below:

fetches the content of a page from the provided URL
parses the page and locates the delimiters in the page
calculates the hash of the content that is enclosed within the delimeters
compares the calculated hash with a hardcoded hash

import urllib.request
from html.parser import HTMLParser
import hashlib
import io

deli_line=[]

class MyHTMLParser(HTMLParser):
    def handle_comment(self, deli):
        # parse html and return the line of the delimiter
	# select delimiter start/end
        if deli in ['', '']:
            line, offset = self.getpos()
            deli_line.append(line)

def deltaf():
    # CHANGE ME
    url = ''
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)
    data = resp.read()
    resp.close()

    parser = MyHTMLParser()
    parser.feed(data.decode())
    parser.close()

    # parse stream into lines
    stream = io.BytesIO(data)
    lines = stream.readlines()

    b = b''
    for i in range(deli_line[0], deli_line[1]-1):
        b = b + lines[i]

    # calculate hash
    h = hashlib.new('sha256')
    h.update(b)
    newhsh = h.hexdigest()

    # hardcoded hash
    hsh = ''
    if (newhsh != hsh):
        print("[!] Page has been altered")
    else:
        print("[+] No change")

if __name__ == '__main__':
    deltaf()

tags:#threat intelligence