3. EL VÍDEO DE ARTE CONTEMPORÁNEO EN INTERNET
3.3. MODELADO CONCEPTUAL
3.3.1. TERMINOLOGÍA
3.3.1.2. VOCABULARIOS NO CONTROLADOS
Figure 5.1: Spamassassin hitcounts per machine (university mail servers)
bulk of the email is not spam. However, if an IP is found to be suspicious by the Netflow spam detection algorithm and a large percentage of the messages sent to the university mail server is labeled as spam, it is reasonable to assume that the IP is correctly detected as being a spam machine.
5.2
Validation results
The first validation has been done on the first experimental steps, which are described in Chapter 3. This immediately poses a problem because the data was not recent. This poses problems for validating with DNS blacklists because they are, as explained, time sensitive. This is why the results of this first validation will not be discussed further in this chapter. The results, however, are available in Appendix B. After the first experimental steps in Chapter 3 the criteria for the detection mechanism are proposed in Chapter 4. Of course those should be validated. But before this, it is interesting to get a sense of what can be expected. So first 2 sets of 100 random IPs with outgoing SMTP connections were picked from the live Utwente data. Those were validated with the the set of 25 blacklists and the set of conservative 5 blacklists listed in Section 5.1.1. The number of IPs for which validation was negative is shown in Table 5.1.
Run Optimistic Conservative
1 1/100 5/100
2 7/100 11/100
Figure 5.2: Number of outgoing SMTP connections per IP
So picking random IPs already seems to be a good measure to get a high percentage of IPs validated! Note however, that the goal itself is not to get a high percentage of IPs validated, but to reliably detect spam machines. With randomly picking IPs you might get a high percentage validated, but it will not exclude legal mail sending machines. However, validation with DNS blacklists will be difficult when this high number of machines is already positively validated with randomly picking IPs. There is not a lot of room for improvement left, how to decide that the algorithm really works? And also, is it really the case that more then 90% of all IPs with outgoing SMTP connections are spamming? Figure 5.2 gives some more insight. Of the 1.064.363 IPs with outgoing SMTP connections, almost 40% has only one outgoing connection in 7 days time and about 80% has less than 10 outgoing SMTP connections. So only 10% has more then 10 outgoing SMTP connections. Also note that this plot of all email observed on the Utwente dataset shows roughly the same picture as the plot of Spamassassin hitcounts on the University mail servers (Figure 5.1). Two notes can be made based on those plots:
1. Because about 80% of IPs with outgoing SMTP connections have less than 10 outgoing SMTP connections, those IPs do not have enough connections to do an analysis upon. This is already solved by acceptance criterium 3 (Section 4.1.1) of the algorithm. Now the impact of this criterium is known; only a small percentage of mail sending IPs is left by this criterium.
2. As stated in the literature study in Section 2.2.3, botnet spam is on the rise. The percentages shown in Figures 5.1 and 5.2 closely resemble those of Abhinav Pathak et al. [11]. As is shown among others by Abhinav Pathak et al. [11], bots will send a low
5.2. Validation results
volume of spam to any single host to avoid detection (=LVS, Low Volume Spammers). This could be an explanation for what is observed in the plots, a high volume of IPs that have only a low number of outgoing connections. This are most probably IPs outside of the Utwente domain that send a low volume of spam messages to our domain to avoid detection. This is effective in the case of the proposed Netflow spam detection algorithm with the Utwente data, because probably only a small percentage of all outgoing SMTP traffic for those IPs is known if the IP is outside of the Utwente domain. If the Netflow data of the main router(s) of such an IP is available, probably the same low number of outgoing connections to a high number of distinct destinations will be observed, making detection possible again.
To conclude, bots outside the domain of the router(s) generating Netflow data are very difficult (possibly impossible) to detect with only the Netflow data itself, when they only have a low number of outgoing SMTP connections directed at that domain. For bots that are in our domain, probably a high number of distinct SMTP destinations will be observed, which is taken into account with ordering criteria 2 (Section 4.1.2). Because of those observations it seems another nice criterium to classify spammers is the following ratio:
Number of outgoing connections
Number of distinct destinations (5.1)
This has been left as future work. Also, Anirudh Ramachandran et al. [21] have very similar ideas as an approach to detect spamming machines. They used log files as a datasource for their research, it would be interesting to also try Netflow data with this approach.
After the first observations mentioned above, a parameter analysis was done on a fresh Netflow data capture of 7 days on the University routers. To avoid the LVS spammers, acceptance criterium 3 was set to at least 150 outgoing SMTP connections. The amount of unvalidated machines per criterium and combined are shown in Table 5.2. The results for each row are described below:
1. The first row displays the number of machines that did not validate positively if the top 100 IPs with the highest number of outgoing SMTP connections is selected. As can be seen, there are 56/57 machines that did not validate with the optimistic and the conservative DNS blacklist validation. Note that this is a much higher number of unvalidated machines than with randomly picking IPs! Also note that because the top 100 IPs with the highest number of outgoing SMTP connections is selected as the base, acceptance criterium 3 (minimum number of outgoing connections) is not separately validated as this set will already satisfy this criterium.
2. The second row adds acceptance criterium 1, the ratio between incoming and outgoing SMTP connections. The optimistic validation gives a significant decrease of the num- ber of unvalidated IPs (as expected), while the conservative validation has two more unvalidated IPs, which is a bit stange. No explanation could be found for this. In the result set however, a lot of legitimate mailing machines (such as the University mail servers) disappeared from the result set, which is definitely a good thing. Also, with this criterium we found a high number of IPs with suspicious behavior with the first experimental steps, described in Chapter 3. Alltogether we conclude that this criterium is certainly a very important part of the algorithm, the optimistic validation and closer analysis do show this, the conservative observation is probably just a fluke.
Criterium Optimistic Conservative Optimistic Conservative 1. Base 56 57 2. Acc. crit. 1 42 59 3. Acc. crit. 2 50 67 4. Combined 1 28 48 5. Ord. crit. 1 38 55 6. Combined 2 27 46 7. Ord. crit. 2 51 66 8. Combined 3 26 35 9. Ord. crit. 3 4 4 10. Combined 4 27 38 11. Ord. crit. 4 9 10 12. Combined 5 19 22 13. Ord. crit. 5 32 44 14. Combined 6 13 16 15. Combined 7 1 1
Table 5.2: Parameter analysis results. The second column displays the results for individual parameters, the third column displays the results for sets of combined parameters.
3. The third row adds acceptance criterium 2, the number of distinct destinations. In this analysis the minimum is set at least 5 distinct destinations (as the university has 5 load balanced mail servers). Note that this criterium is added to the base without acceptance criterium 1. The optimistic validation shows a slightly lower number of unvalidated machines, while the conservative validation shows a much higher number of unvalidated machines. This step citerium in itself does not seem a very good acceptance criterium. However, the idea here is to combine it with acceptance criterium 1 to get the IPs with a high number of outgoing SMTP connections to several destinations with a low number of incoming SMTP connections (or none at all).
4. This row combines the criteria from the rows above. This time, as well the optimistic as the conservative validation show a significant improvement compared to the base or two criterium results. The combination really seems to does it work of selecting suspicious machines well. However, there is still a lot of room for improvement (28 and 48 unvalidated machines!).
5. This row shows the validation results for the first order criterium. Because this criterium simply states that suspicious machines have no incoming connections this has been validated as if it is an acceptance criterium. The results show that less unvalidated machines, so the assumption that machines without incoming connections are more suspicious seems to be correct.
6. When adding order criterium 1 to the algorithm the results are slightly better, as can be expected.
7. Order criterium 2, which states that there should at least be 10 distinct destinations is just like order criterium 2 added as if it is an acceptance criterium. The results are very similar to acceptance criterium 2.
5.2. Validation results
8. When adding order criterium 2 the total result improves again, with again slightly less validated machines. Certainly when considering the improvement with the conservative validation, order criterium 2 seems to be a valid one.
9. This row shows the most surprising result. Order criterium 3 states that the higher the percentage of idle time (with respect to outgoing SMTP connections) the more suspicious an IPs is. Because this is the first (and only) order criterium is not a binary result (1 or 0) but a percentage, the validation was done by ordering the IPs that fall trough the acceptance criteria by the percentage of idle time (descending). The results show for both the optimistic as the conservative validation only 4 unvalidated machines! Apparently, machines that match the acceptance criteria ordered by idle time is a large improvement compared to ordering by the number of outgoing connections!
10. Combining every criteria until now gives a slight increase in unvalidated results. Note however, that this is the first validation were the order criteria are averaged to order the result set from the acceptance criteria. This result already indicates that better results can be obtained by just ordering by idle time instead of the combination of the first 3 order criteria.
11. Order criterium 4 states that the standard deviation of the 5 minute time-span plot points of the number of outgoing SMTP connections should be larger than 1 for sus- picious machines. This criterium also seems to be a very good measure to detect sus- picious machines, with an optimistic number of unvalidated machines of only 9 and a conservative of only 10.
12. Adding order criterium 4 to the algorithm results in a significant improvement. The combined result is the best so far (excluding only ordering by order criterium 3 or 4, which both give better results).
13. Order criterium 5 is a simple peak detection mechanism. As can be seen in row 13, there are less unvalidated IPs than the base validation. This seems to eliminate unvalidated machines, although the results are not as good as order criterium 3 and 4.
14. Adding order criterium 5 to the algorithm significantly improves upon the amount of unvalidated IPs. This is the last step of the algorithm and this results in the best combined criteria results, the algorithm indeed reduces the amount of unvalidated IPs significantly compared to the base validation. However, there are two individual criteria analyses that give better results.
15. The best validation results were obtained by ordering the IPs from the result set of the acceptance criteria by idle time. After closer inspection of the result set of the complete algorithm (’Combined 6’ in the table), it was found that almost all unvalidated machines had more than 80% of idle time. Thus it might make sense to make this an acceptance criterium, as this seems to eliminate a high number of false positives. The following criterium is defined:
Idle time frames
All time frames > 0.8 (5.2)
The result of adding this as an acceptance criterium, was only 1 unvalidated IP. So a 99% validation rate is reached by this version of the algorithm.