Archive for the ‘Malware’ Category

I’ve been a bit lax in writing this post; around a month ago Miguel Jacq got in contact to let me know about a couple of errors he encountered when running InfoSanity’s with a small data set. Basically if your log file did not include any submissions, or was for a period shorter than 24hours the script would crash out, not the biggest problem as most will be working with larger data sets but annoying non the less.

Not only did Miguel let me know about the issues, he was also gracious enough to provide a fix, the updated script can be found here. An example of the script in action is below:

cat /opt/dionaea/var/log/dionaea.log| python

Statistics engine written by Andrew Waite –

Number of submissions: 84
Number of unique samples: 39
Number of unique source IPs: 65

First sample seen: 2010-06-08 08:25:39.569003
Last sample seen: 2010-06-21 15:24:37.105594
System Uptime: 13 days, 6:58:57.536591
Average daily submissions: 6

Most recent submissions:
2010-06-21 15:24:37.105594,, emulate://, 56b8047f0f50238b62fa386ef109174e
2010-06-21 15:18:08.347568,, tftp://, fd28c5e1c38caa35bf5e1987e6167f4c
2010-06-21 15:17:08.391267,, tftp://, bb39f29fad85db12d9cf7195da0e1bfe
2010-06-21 06:29:03.565988,, tftp://, fd28c5e1c38caa35bf5e1987e6167f4c
2010-06-20 23:34:15.967299,,, 094e2eae3644691711771699f4947536

— Andrew Waite


Amun statistics

Amun has been running away quite happily in my lab since initial install. From a statistic perspective my wor has been made really easy as Miguel Cabrerizo has previously taken one of the InfoSanity statistic scripts written for Nepenthes and Dionaea and adapted it to parse Amun’s submission.log files.

Results generated from the script in my environment are below, if you’re wanting to get an overview of submissions from another Amun sensor the script has been uploaded alongside the other InfoSanity resources and is available here.

~$ cat /opt/amun/logs/submissions.log* | ./

Statistics engine written by Andrew Waite ( modified by Miguel Cabrerizo (

Number of submissions      : 25
Number of unique samples   : 25
Number of unique source IPs: 18

Origin of the malware:
Ukraine :     1
None :     7
Poland :     2
Romania :     1
United States :     8
Russian Federation :     2
Hungary :     1
Norway :     1
Bulgaria :     2

Vulnerabilities exploited:
MS08067 :    13
DCOM :    12

Most recent submissions:
2010-05-31, 11:37:22,, 63.exe, acf5c09d547417fe53c163ec09199cab, MS08067
2010-05-30, 19:23:09,, 63.exe, 89b578839f1c39f79d48e5f9e70b5e2f, MS08067
2010-05-28, 10:27:03,, 63.exe, f7c4f677218070ab52d422b3c018a4ba, MS08067
2010-05-27, 16:23:14,, ssms.exe, 1f8a826b2ae94daa78f6542ad4ef173b, DCOM
2010-05-24, 19:46:35,, 63.exe, 53979f1820886f089a75689ed15ecf6e, MS08067

A comment on a recent post asked for a comparison between different honeypots, while this is far from conclusive and only focuses on a single aspect of the technologies one of InfoSanity’s Nepenthes sensors ‘saw’ more attacks in the last 24hrs than my Amun installation did in the almost three weeks shown above. As both are running within the same, small, IP allocation I think I’m safe in assuming that one IP isn’t actually receiving a disproportionate level of interest from the badguys and bots that are out there.

— Andrew Waite

24hrs of HoneyD logs

After an initial setup and configuration of HoneyD I took a snapshot of the honeyd.log file after running for a 24hr period.

Running honeydsum against the log file generated some good overview information. There were over 12000 connections made to the emulated network, averaging one connection every 7 seconds. Despite the volume of connections, each source generally only initiated a handful of connections, likely looking for a single particular service before moving on.

Top 10 Source Hosts
Rank     Source IP       Connections
1       3066
2      984
3           65
4          65
5             57
6        48
7            39
8               37
9       30
10          30

The summaries from honeydsum also suggest that the rate of incoming connections is generally constant. The only real variation to this was between 17:00 and 18:00, but the spike coincides with the source IP running an ordered port sweep against a single target IP address, starting at TCP1042 and running up to around TCP 1300. Not sure why anyone is scanning this particular port range (if anyone can provide any additional information to slake my curiosity I’d appreciate it) but this event explains the outliers in both the above and below summary tables, highlighting the dangers of working with a small data set.

Connections per Hour
Hour  Connections
00:00      329
01:00      325
02:00      281
03:00      366
04:00      360
05:00      322
06:00      300
07:00      299
08:00      258
09:00      369
10:00      317
11:00      324
12:00      423
13:00      367
14:00      351
15:00      479
16:00      486
17:00   3590
18:00      498
19:00      515
20:00      576
21:00      441
22:00      397
23:00      311

The below table summarises the targetted resources within the environment. It shouldn’t come as a surprise that the most popular targets were tcp ports 445 and 135, but this is the case even though the honeyd configuration does not have any services listening on those ports. From this I would suggest that if you are trying to gather data on a particular port or service that you employ a filter (firewall/ACL/etc.) to block the noise before it reaches honeyd to keep the log files relevant.

Top 10 Accessed Resources
Rank   Resource    Connections
1           445/tcp         7349
2           135/tcp         1086
3             8/icmp           123
4              22/tcp           102
5            1433/tcp          95
6           8080/tcp          73
7           4899/tcp          52
8           5900/tcp          39
9         10000/tcp         39
10           3/icmp            38

In addition to running honeydsum the data set was run through InfoSanity’s script, top 10 sources are listed below. The results are likely skewed as the largest ‘location’ for the results is ‘none’ according to the GeoIP Country Lite database being used. One feature of the result set is that the country linked to the public IP addresses used by the honeyd environment did not feature in the list, as infrastructure improves and botnets become more prevalent today’s malware no longer needs to target ‘closer’ IP addresses to remain efficient.

None:   692
United States:  196
Russian Federation:     123
Taiwan: 118
Brazil: 109
Germany:        99
Australia:      99
China:  90
Romania:        86
Italy:  82

— Andrew Waite

Categories: Honeypot, InfoSec, Malware

Book Review: Virtual Honeypots

It took longer than I had wanted, but I have just finished reading through Virtual Honeypots: From Botnet Tracking to Intrusion Detection. The book is written by Niels Provos, creator of HoneyD (among other things) and Thorsten Holz.

Given the authors I had high expectation when the delivery came through, thankfully it didn’t disappoint. Unsurprisingly the first chapter provides an overview of honypotting in general, covering high and low interaction systems over both physical and virtual systems, additionally the chapter introduces some core tools for your toolkit.

The next two chapters cover both high and low interaction honeypots respectively. I really liked the coverage of hi-int honeypots, it was this idea that drew me towards honeypots in the first place the idea of watching an attacker carefully exploit and utilise a dummy system always appealed. The material provided gives a great foundation for starting with a high interaction honeypot and some best practice advice for how to do so securely and safely. While I have read many reports and case studies that involved honeypots I have had difficulty finding in depth setup information and advice, leaving high interaction honeypots feeling a bit like black magic. The author’s information cuts through all the mystery allowing the reader to get a firm understanding of the topic. Likewise the discussion of low-interaction honeypots was equally well covered, although as I’ve spent some time with low-int systems in the past this chapter was more of a refresher than providing unknown information as I had found with the hi-int section.

Given that Neils is one of the books authors, it shouldn’t be too much of a surprise that HoneyD is covered in depth. For me, this was the most useful section of the book. As honeyd is one of the older publicly available low-int systems I had mistakenly assumed that one of the newer systems would provide more functionality, after reading through the material and regularly going ‘ooh’ out loud honeyd is now firmly at the top of my ‘need to implement’ list.

The book also covers honeypot systems that are designed for specialised purposes. For malware collection, the authors mainly focus on Nepenthes, but also touch on Honeytrap among others. This was the only section that I found to be slightly dated, as the Nepenthes’ newly released sprirtual successor Dionaea was not covered. But as the fundamental material is very well explained, Nepenthes is still a very functional system and the inherent similarities between Nepenthes and Dionaea the material still useful regardless so the chapter still provides an excellent foundation if you’re wanting to start collecting malware.

An interesting chapter covers the idea of hybrid honeypots, which is the idea of using low-int systems to monitor and handle the bulk of traffic, while forwarding anything unknown or unusual to a high-int system for more indepth analysis of the attack traffic. Unfortunately at this point openly available hybrid systems are limited, with the more functional systems being kept closed by the researchers and companies that build them (but I have just found Honeybrid while looking for a good link for hybrid systems which I wasn’t aware of. Looks promising…)

The last chapter covering honeypot systems looks at client-side honeypots, designed to look for client-side attacks. As client-side attacks have become more prominent over the last few years this is an evolving area of research, but as the attack vector is newer than traditional attacks, the honeypot systems aren’t as mature as more traditional systems. This isn’t an area that I’m experienced with so I can’t comment too much on the systems detailed by the authors, but they cover several honeyclient systems in great detail, and I’m intending to use the chapter as a foundation for implementing the systems and techniques proposed.

As well as detailing the use of honeypot systems, the authors also provide a brilliant discussion of ways that attackers (or users) can determine that they are interacting with a honeypot system. While the detailed descriptions for ways to identify a honeypot system is interesting and important from a theoretical standpoint, from previous experience running honeypot systems there are more than enough attackers and automated threats that blindly assume the system is legitimate to still enable honeypots to provide plenty of benefit to the honeypot administrators.

The book finishes up with an fairly detailed discussion of both tracking botnets using the information gathered from honeypot systems (this chapter is available as a sample PDF download from thanks to InformIT, here) and analysing the malware sample reports provided by CWSandbox. While both chapters are useful in he context of honeypot systems I didn’t think there was enough room to provide the reader with anything beyond a general overview of the topics, which if you were interested in the topic enough to purchase the book, then the reader will likely already have a similar level of understanding to the information provided.

There is also a chapter covering case studies of actual incidents that were captured by the books authors during their research. I’ve always been a fan of case studies, so enjoyed this chapter, it definitely helps whet the appetite to implement the technologies covered by the book.

Overall I really enjoyed the book, if you’re interested in systems and network monitoring, honeypots or malware then this book should probably be on your bookshelf.

Andrew Waite

Categories: honeyd, Honeypot, InfoSec, Malware

Fuzzy hashing, memory carving and malware identification

2009/12/15 Comments off

I’ve recently been involved in a couple of discussions for different ways for identifying malware. One of the possibilities that has been brought up a couple of times is fuzzy hashing, intended to locate files based on similarities to known files. I must admit that I don’t fully understand the maths and logic behind creating fuzzy hash signatures or comparing them. If you’re curious Dustin Hurlbut has released a paper on the subject, Hurlbut’s abstract does a better job of explaining the general idea behind fuzzy hashing.

Fuzzy hashing allows the discovery of potentially incriminating documents that may not be located using traditional hashing methods. The use of the fuzzy hash is much like the fuzzy logic search; it is looking for documents that are similar but not exactly the same, called homologous files. Homologous files have identical strings of binary data; however they are not exact duplicates. An example would be two identical word processor documents, with a new paragraph added in the middle of one. To locate homologous files, they must be hashed traditionally in segments to identify the strings of identical data.

I have previously experimented with a tool called ssdeep, which implements the theory behind fuzzy hashing. To use ssdeep to find files similar to known malicious files you can run ssdeep against the known samples to generate a signature hash, then run ssdeep against the files you are searching, comparing with the previously generated sample.

One scenarios I’ve used ssdeep for in the past is to try and group malware samples collected by malware honeypot systems based on functionality. In my attempts I haven’t found this to be a promising line of research, as different malware can typically have the same and similar functionality most of the samples showed a high level of comparison whether actually related or not.

Another scenario that I had developed was running ssdeep against a clean WinXP install with a malicious binary. In the tests I had run I haven’t found this to be a useful process, given the disk capacity available to modern systems running ssdeep against a large HDD can be a time consuming process. It can also generate a good number of false positives when run against the OS.

After recently reading Leon van der Eijk’s post on malware carving I have been mulling a method for combining techniques to improve fuzzy hashing’s ability to identify malicious files, while reducing the number of false positives and workload required for an investigator. The theory was that, while any unexpected files on a system are not desirable, if they aren’t running in memory then they are less threatening than those that are active.

To test the theory I infected an XP SP2 victim with a sample of Blaster that had been harvested by my Dionaea honeypot and dumped the RAM following Leon’s methodology. Once the image was dissected by foremost I ran ssdeep against extracted resources. Ssdeep successfully identified the malicious files with a 100% comparison to the maliciuos sample. So far so good.

With my previous experience with ssdeep I ran a control test, repeating the procedure against the dumped memory of a completely clean install. Unsurprisingly the comparison did not find a similar 100% match, however it did falsely flag several files and artifacts with a 90%+ comparison so there is still a significant risk of false positives.

From the process I have learnt a fair deal (reading and understanding Leon’s methodolgy was no comparison to putting it into practice) but don’t intend to utilise the methods and techniques attempted in real-world scenarios any time soon. Similar, and likely faster, results can be achieved by following Leon’s process completely and running the files carved by Foremost against an anti-virus scan.

Being able to test scenarios similar to this was the main reason for me to build up the my test and development lab which I have described previously. In particular, if I had run the investigation on physical hardware I would likely not have rebuilt the environment for the control test with a clean system, losing the additional data for comparison, virtualisation snap shots made re-running the scenario trivial.

–Andrew Waite

P.S. Big thanks to Leon for writing up the memory capture and carving process used as a foundation for testing this scenario.

Analysis: Honeypot Datasets

Earlier this week Markus released two anonymised data sets from live Dionaea installations. The full write-up and data sets can be found on the newly migrated news feed here. Perhaps unsurprisingly I couldn’t help but run the data through my statistics scripts to get a quick idea of  what was seen by the sensors.

This caused some immediate problems, before the data was released Markus had contacted me to point out/complain that the performance from my script is ideal. Performance wasn’t an issue I had encountered, but the database from the sensor I run is ~1MB, the smaller of the released data sets is ~300MB, with the larger being 4.1GB. I immediately tried to rectify the problem and am proud to report,…

I failed miserably. I had tried to move some of the counting and loops from the python code and migrate to more complex SQL queries, working on the theory that working with large datasets should be more efficient within databases as they are designed for working with sets of data. Theory was proved false, actually increasing run-time by about 20%, so I won’t be releasing the changes. Good job I’ve never claimed to be a developer. All this being said, the script still crunches through the raw data in 30seconds and 3minutes respectively.

Without further ado, the Berlin data-set:

Statistics engine written by Andrew Waite –

Number of submissions: 2726
Number of unique samples: 133
Number of unique source IPs: 639

First sample seen: 2009-11-05 12:02:48.104760
Last sample seen: 2009-12-07 11:13:55.930130
SystemrRunning: 31 days, 23:11:07.825370
Average daily submissions: 87.935483871

Most recent submissions:
2009-12-07 11:13:55.930130,,, ae8705a7b4bf8c13e5d8214d374e6c34
2009-12-07 11:12:59.389940,, ftp://1:1@, 14a09a48ad23fe0ea5a180bee8cb750a
2009-12-07 11:10:27.296370,, tftp://, df51e3310ef609e908a6b487a28ac068
2009-12-07 10:55:24.607140,, tftp://, df51e3310ef609e908a6b487a28ac068
2009-12-07 10:43:48.872170,, ftp://1:1@, 14a09a48ad23fe0ea5a180bee8cb750a

And Paris:

Statistics engine written by Andrew Waite –

Number of submissions: 749518
Number of unique samples: 2064
Number of unique source IPs: 30808

First sample seen: 2009-11-30 03:10:24.591650
Last sample seen: 2009-12-07 08:46:23.657530
SystemrRunning: 7 days, 5:35:59.065880
Average daily submissions: 107074.0

Most recent submissions:
2009-12-07 08:46:23.657530,,, d45895e3980c96b077cb4ed8dc163db8
2009-12-07 08:46:20.985190,,, 94e689d7d6bc7c769d09a59066727497
2009-12-07 08:46:21.000540,,, 908f7f11efb709acac525c03839dc9e5
2009-12-07 08:46:18.398500,,, ed12bcac6439a640056b4795d22608da
2009-12-07 08:46:15.753080,,, 94e689d7d6bc7c769d09a59066727497

Still need to dig further into the data, they’ll be another post in the making if I uncover anything interesting…

— Andrew Waite

Categories: Dionaea, Honeypot, Malware

Expert speaker session at Northumbria University

Last week I had the pleasure of being asked to speak at Northumbria University, presenting to students of the Computer Forensics and Ethical Hacking for Computer Security programmes. As I graduated from Northumbria a few years ago it was interesting to come back to see some familiar faces and have a look at how the facilities had developed.

Despite the nerves of having to speak in front of a crowd I really enjoyed the event, especially as the other speakers were excellent and I enjoyed their sessions. The event kicked off with Dave Kennedy, a soon to retire member of Durham Police’s computer crime unit. Dave’s talked about his personal experience with a couple of high profile cases, explaining some of the groundwork and behind the scenes activity that isn’t known to the general public. I found the information interesting; but also disturbing, given the nature of the material that is handled by Dave and his department I can safely state that I wouldn’t want to have much experience in the area.

Next up was Phil Byrne, an internal auditor for HM Revenue and Customs (HMRC). For those that don’t know, HMRC were/are at the centre of one of the UK’s largest data loss stories in 2007 after CDs containing approximately 25 million child benefit records were sent, unencrypted, by standard post and did not reach their intended destination (some backstory here). Phil talked openly about the incident, discussing both the incident itself and the changes made in response. One of Phil’s comments has stayed with me (if I’m mis-quoting someone let me know):

If you put people into the process, something will go wrong at some time

Third to the stand was Gary Witts, owner of a manage services company specialising in on-line backups. The talk was very indepth and had some interesting content, but from my perspective I felt it was more of a sales pitch than a technical discussion of the secure backup’s place within a security standing.

I took the fourth and final slot of the day, which left me with the unenviable position of being between around 100 students and the pub, which didn’t help my usual rapid-fire presentation style. My presentation took a different focus from the previous sessions, discussing some of the real-world security incidents that can regularly be encountered, and some advice on handling the incidents in question. I also discussed my findings from honeypot systems, introducing a less common method for monitoring an environment for malicious activity. Assuming the feedback I’ve recieved is genuine the presentation seems to have been well-recieved.

From a student’s perspective; Tom was in the audience and has been writing up his take on the event in a series of blog postings. Tom also recorded the talks, for any one interested a direct link to my session is available here.

Andrew Waite