Automatic filtering in URL scanning
Puttonen, Jarmo (2020)
Puttonen, Jarmo
2020
Tietotekniikan DI-tutkinto-ohjelma - Degree Programme in Information Technology, MSc (Tech)
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-05-29
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202005255621
https://urn.fi/URN:NBN:fi:tuni-202005255621
Tiivistelmä
Efficient blackbox web security testing requires that the penetration tester should detect all hidden and visible web content on the targeted web server. Detecting hidden content is usually achieved by enumerating commonly appearing web page paths with a wordlist and inspecting how the server responds to each request.
The RFC 7231 describes that a 404 Not Found status code can be used in HTTP responses to inform the client that the requested path was not found on the web server. If all web applications strictly adhered to the RFC 7231, there would not have been a need for this thesis. However, there are endless ways how a website can react to a request towards a nonexistent path. The server reaction can, for example, depend on the web server software, web application software, web application frameworks, system configurations, customized error page configurations, firewalls, load balancers, and so on. The HTTP response content can be almost anything such as an empty page, redirection or a generic web page that changes a bit on each request. As servers cannot be trusted to always return 404 status codes for nonexistent paths, interpreting the responses can require manual labour. When scanning thousands of hosts, the required amount of manual labour can become too excessive.
Various popular open-source URL scanners provide automatic filtering capabilities which attempt to reduce the required manual work. These capabilities attempt to detect what kind of responses the targeted server provides for nonexistent paths. This information is then used to hide false positives from results automatically. The implementations of these capabilities are often very simple, and more could be achieved by improving the designs.
The beginning of this thesis introduces various improvements to the known automatic filtering capabilities. Other less known methods methods for filtering false positives are also provided. Developers of URL scanners can use the provided information selectively when implementing or improving automatic filtering capabilities in URL scanners.
As a result of this thesis, an adaptive URL scanner was implemented for the Bountyflow automation framework of Tmi ROT. The implemented URL scanner uses the introduced improvements and attempts to maximize false positive detection with the cost of potentially hiding edge case true positives. The implemented URL scanner was compared against the popular open-source URL scanners in a brief examination against real-world targets. As a result of the comparison, the implemented URL scanner had the best efficiency in removing false positives and duplicates.
The RFC 7231 describes that a 404 Not Found status code can be used in HTTP responses to inform the client that the requested path was not found on the web server. If all web applications strictly adhered to the RFC 7231, there would not have been a need for this thesis. However, there are endless ways how a website can react to a request towards a nonexistent path. The server reaction can, for example, depend on the web server software, web application software, web application frameworks, system configurations, customized error page configurations, firewalls, load balancers, and so on. The HTTP response content can be almost anything such as an empty page, redirection or a generic web page that changes a bit on each request. As servers cannot be trusted to always return 404 status codes for nonexistent paths, interpreting the responses can require manual labour. When scanning thousands of hosts, the required amount of manual labour can become too excessive.
Various popular open-source URL scanners provide automatic filtering capabilities which attempt to reduce the required manual work. These capabilities attempt to detect what kind of responses the targeted server provides for nonexistent paths. This information is then used to hide false positives from results automatically. The implementations of these capabilities are often very simple, and more could be achieved by improving the designs.
The beginning of this thesis introduces various improvements to the known automatic filtering capabilities. Other less known methods methods for filtering false positives are also provided. Developers of URL scanners can use the provided information selectively when implementing or improving automatic filtering capabilities in URL scanners.
As a result of this thesis, an adaptive URL scanner was implemented for the Bountyflow automation framework of Tmi ROT. The implemented URL scanner uses the introduced improvements and attempts to maximize false positive detection with the cost of potentially hiding edge case true positives. The implemented URL scanner was compared against the popular open-source URL scanners in a brief examination against real-world targets. As a result of the comparison, the implemented URL scanner had the best efficiency in removing false positives and duplicates.