Exposing TCP watermarks in the Tor network using deep learning : Emphasis on timing-based watermarks and convolutional neural networks
Kivikangas, Marko (2024)
Kivikangas, Marko
2024
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-09-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202408178152
https://urn.fi/URN:NBN:fi:tuni-202408178152
Tiivistelmä
Transmission Control Protocol, or TCP, is a connection-orientated protocol in the Internet protocol suite (TCP/IP) that offers reliable data transfer between two communicating parties. Many large Internet applications and services, including World Wide Web, email and file transfer protocols, use it. Even though the role of TCP has decreased due to the rise of other protocols such as QUIC and UDP, it still has an important role in communications. One network that relies on TCP is The Onion Router, or Tor.
Tor is both a software and an anonymity network. It achieves anonymity by routing encrypted traffic through multiple nodes, or relays. This also enables the anonymity of servers and websites. The downside of Tor is that it also enables criminal activities because criminals can hide behind anonymity and use Tor to host illegal platforms such as drug marketplaces.
We challenge the anonymity aspect of Tor with a method called watermarking. It means implanting small marks into the network flow at the sender end and attempting to detect them at the receiving end. A detected watermark indicates that the two parties are communicating, thus breaking the anonymity of the network.
The watermarking algorithms we use are called Interval-based watermarking (IBW) and Scalable watermark that is invisible and resilient to packet losses (SWIRL). The algorithms are timing-based, meaning they use the time between the packets, the inter-packet delay, to implant the watermarks. The experiments require a watermarking module, which is a modified TCP/IP stack implementation. We capture the network flow data with the packet analyser Wireshark. The collected datasets, as well as the watermarking module, are publicly available.
For the detection of the watermarks, we use a convolutional neural network, or CNN. Neural networks are relatively new in the field of network flow classification. The results show that depending on the algorithm and the available dataset size, the classification accuracy of a CNN is between 68% and 98%, with the false positive rate being often low and the true positive rate close to 100%.
Tor is both a software and an anonymity network. It achieves anonymity by routing encrypted traffic through multiple nodes, or relays. This also enables the anonymity of servers and websites. The downside of Tor is that it also enables criminal activities because criminals can hide behind anonymity and use Tor to host illegal platforms such as drug marketplaces.
We challenge the anonymity aspect of Tor with a method called watermarking. It means implanting small marks into the network flow at the sender end and attempting to detect them at the receiving end. A detected watermark indicates that the two parties are communicating, thus breaking the anonymity of the network.
The watermarking algorithms we use are called Interval-based watermarking (IBW) and Scalable watermark that is invisible and resilient to packet losses (SWIRL). The algorithms are timing-based, meaning they use the time between the packets, the inter-packet delay, to implant the watermarks. The experiments require a watermarking module, which is a modified TCP/IP stack implementation. We capture the network flow data with the packet analyser Wireshark. The collected datasets, as well as the watermarking module, are publicly available.
For the detection of the watermarks, we use a convolutional neural network, or CNN. Neural networks are relatively new in the field of network flow classification. The results show that depending on the algorithm and the available dataset size, the classification accuracy of a CNN is between 68% and 98%, with the false positive rate being often low and the true positive rate close to 100%.