See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/328630017

A Website Defacement Detection Method Based on Machine Learning Techniques

Conference Paper · December 2018

DOI: 10.1145/3287921.3287975

CITATIONS

READS

15,258

1 author:

Dau Hoang

Posts and Telecommunications Institute of Technology

35 PUBLICATIONS 525 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dau Hoang on 31 October 2018.

The user has requested enhancement of the downloaded file.

A Website Defacement Detection Method Based on Machine Learning Techniques

ABSTRACT

Xuan Dau Hoang

Cyber Security Lab, Faculty of Information Technology Posts and Telecommunications Institute of Technology Hanoi, Vietnam

dauhx@ptit.edu.vn

Website defacement attacks have been one of major threats to websites and web portals of private and public organizations. The attacks can cause serious consequences to website owners, including interrupting the website operations and damaging the owner’s reputation, which may lead to big financial losses. A number of techniques have been proposed for website defacement monitoring and detection, such as checksum comparison, diff comparison, DOM tree analysis and complex algorithms. However, some of them only work on static web pages and the others require extensive computational resources. In this paper, we propose a machine learning-based method for website defacement detection. In our method, machine learning techniques are used to build classifiers (detection profile) for page classification into either Normal or Attacked class. As the detection profile can be learned from training data, our method can work well for both static and dynamic web pages. Experimental results show that our approach achieves high detection accuracy of over 93% and low false positive rate of less than 1%. In addition, our method does not require extensive computational resources, so it is practical for online deployment.

CCS CONCEPTS

Security and privacy → Software and application security → Web application security; Website defacement detection.

KEYWORDS

Website Defacement Attack, Website Defacement Detection, Anomaly-based Attack Detection, Machine Learning-based Attack Detection

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

SoICT ’18, December 6–7, 2018, Da Nang City, Viet Nam.

https://doi.org/10.1145/3287921.3287975

ACM Reference format:

X.D. Hoang. 2018. A Website Defacement Detection Method Based on Machine Learning Techniques. In In SoICT ’18: Ninth International Symposium on Information and Communication Technology, December 6– 7, 2018, Da Nang City, Viet Nam. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3287921.3287975

INTRODUCTION
Website defacement attack is a form of attacks that modify the website content and thereby change the website’s appearance [1][2]. Fig. 1 is the home page of Rach Gia airport’s website that was defaced in Mar 2017 [1] and Fig. 2 is a web page that was defaced by the “Anonymous” hacker group [3]. According to a number of reports, the number of website defacement attacks reported in the world has increased sharply in the years of 2010- 2011 and 2012-2013 and has tended to decrease in recent years [2]. However, there were still hundreds of websites and web portals that were defaced daily around the world [2].

Figure 1: Website of Rach Gia airport was defaced in Mar 2017 [1]

There have been many reasons that websites were defaced, but mainly because websites have serious security holes where hackers can download files to the servers, or have accesses to websites’ administrative pages. Security holes existed in websites or hosting servers, which allow hackers to exploit to launch

defacement attacks, include SQL injection (SQLi), Cross-Site Scripting (XSS), local or remote file inclusion, improper account and password administration, and no-update software [1][2].

Figure 2: A web page was defaced by the “Anonymous” hacker group [3]

Website defacement attacks can cause serious consequences to website owners. The attacks can interrupt website operations, damage the owner’s reputation and potential data losses. These in turn may lead to big financial losses. Due to the popularity of website defacement attacks and their consequences, there have been a number of defensive measures that have been developed and deployed in practice. Defensive measures to website defacement attacks can be (1) scanning and fixing security holes, including SQLi, XSS and file inclusion in websites; (2) using website defacement monitoring and detection tools, such as VNCS Web Monitoring [4] and Nagios Web Application Monitoring Software [5].
In this paper, we propose a machine learning-based method for website defacement detection. The proposed scheme considers web pages as text documents and transfers the problem of website defacement detection into the problem of text document classification. In this scheme, machine learning techniques are used to build classifiers for classification of monitored pages into either Normal or Attacked class.
The rest of this paper is organized as following: Section 2 presents some related works; Section 3 is our proposed method and experiments and Section 4 is the conclusion of the paper.
RELATED WORKS
There have been a number of proposed techniques for website defacement monitoring and detection, including signature-based detection and anomaly-based detection [1][2][8][10]. Due to the space limit, this section introduces some typical anomaly-based techniques for website defacement detection. An anomaly-based detection technique first builds a “profile” of web pages of a website that is in the normal operations. Then, the web pages are monitored and compared with the profile to find the difference. If any significant difference is found, an alarm is raised. The main advantage of this approach is it has the potential to detect
new form of attacks. However, it is difficult to determine the “significant difference” threshold between the monitored page and the profile due to the web page’s content is changing frequently.
Anomaly-based techniques for website defacement detection include checksum comparison, diff comparison, DOM tree analysis, and using complex algorithms, such as machine leaning, data mining, genetic programming and page screenshot analysis [8][10]. The following parts of this section will discuss about them in details.
1. Website Defacement Detection based on Checksum Comparison, Diff Comparison, DOM Tree Analysis
  Website defacement detection based on checksum comparison is the simplest method to detect changes in web pages. First, the content (or part of content) of the web page is calculated checksum using hashing algorithms like MD5 or SHA1 and archived to the profile. Then, the web page is monitored, calculated new checksum and compared with its corresponding checksum stored in the profile. If the two checksums are different, an attack alarm is raised. This method works well on static web pages. For dynamic web pages, such as e-commerce web pages it is not applicable [8][10].
  Diff is a comparison tool to find the difference between the content of two web pages, which is popularly supported on Linux and Unix platforms. The most important thing to do is to determine an anomaly detection threshold as the input for each monitored page. This method is quite effective and works well on most dynamic websites if the anomaly detection threshold is determined properly [8][10].
  DOM (Document Object Model) is an application programming interface (API) that defines the logical structure of HTML documents - or web pages. DOM can be used to browse and analyze the composition of a web page. Website defacement detection based on the DOM tree analysis is to detect changes in the page structure, rather than the page content. In general, this approach is likely to work well on web pages that have stable structures [8][10].
2. Website Defacement Detection Based on Complex Algorithms
  In this section, we review some typical complex techniques for website defacement detection, including Kim et. al. [6], Medvet et. al. [7][9] and Borgolte et. al. [11].
  Kim et. al. [6] proposed to use 2-gram method for building a profile from normal web pages for monitoring and detecting page defacement. In this scheme’s training phase, page HTML content is vectorized using n-gram and their occurrence frequency. According to a statistical survey, the authors concluded that the top 300 2-grams that have highest occurrence frequency are enough to represent a page for defacement detection. In the detection phase, the monitored page is downloaded, vectorized using the same method and then compared with the profile’s page vectors to find the difference
  
  using cosine distance. The paper also proposes an algorithm for generating and updating the detection threshold dynamically for each page. The approach’s advantage is that it can generate dynamic detection threshold and thereby reduce the false detection rate. However, the main drawback of this method is it requires extensive computational resources for dynamic threshold generation algorithm for each page and for the calculation and comparison of the cosine distance of the monitored page with a large number of normal pages of the detection profile.
  Medvet et. al. [7][9] proposes to use genetic programming to construct the profile for website defacement detection. In this approach, 43 sensors are used to monitor and extract information from monitored pages. Each web page is transferred into a vector of 1466 elements. In the training phase, normal web pages are collected and vectorized to build the profile using genetic programming. In the detection phase, the monitored page is collected, vectorized and compared with the profile to find the difference. The major disadvantage of this approach is it requires extensive computational resources for the profile construction because of using large-size page vectors and slowly-converged genetic programming.
  Borgolte et. al. [11] build Meerkat – a website defacement detection system based on image object recognition of web page screenshots using computer vision techniques. The inputs to the system are URLs of monitored web pages. It then takes page screenshots and carries out analysis to find the difference based on high level features of screenshots using advanced machine learning techniques, such as stacked auto-encoder and deep neural network. Experimental results on 10 millions of defaced web pages and 2.5 millions of normal web pages show that the system has high detection accuracy from 97.422% to 98.816% and low false positive rate from 0.547% to 1.528%. The strong point of Meerkat is that the detection profile can be built automatically and it was tested on a large number of web pages. However, the major drawback of this method is it requires extensive computational resources for highly complex image processing and recognition. Moreover, Meerkat’s processing may also be slow because a web page must be fully displayed in order to take its screenshot.
3. Website Defacement Monitoring Tools
  In this section, we introduce 2 common tools for website defacement monitoring, including VNCS Web Monitoring [4] and Nagios Web Application Monitoring Software [5].
  1. VNCS Web Monitoring. VNCS Web Monitoring [4] is a product of the Vietnam Cyber Security company, which can monitor multiple websites based on real-time web log collection and processing using Splunk platform. Fig. 3 shows the website monitoring status provided by the tool. VNCS Web Monitoring provides centralized web log management, automatic log analysis to detect website incidents, website defacements, SQL injection and XSS attacks, and real-time alerting.
    The drawbacks of this tool are the expenses for installing and operating are high and it only uses checksum and direct page
    content comparison, which may generate high volume of false positive alarms on dynamic websites.
    
    Figure 3: Website status of VNCS Web Monitoring [4]
  2. Nagios Web Application Monitoring Software. Nagios Web Application Monitoring Software [5] is a commercial tool that provides features, including website availability, URL monitoring, HTTP status monitoring, website Transaction Monitoring and site content monitoring. Fig. 4 is the main monitoring screen of Navgios XI [5].
    
    Figure 4: Main monitoring screen of Nagios XI [5]
    
    In addition, Nagios Web Application Monitoring Software provides a number of additional tools that allow the administrators to configure the monitoring websites easier. The additional tools include Website Monitoring Wizard, Website URL Monitoring Wizard, Website Transaction Monitoring Wizard and HTTP Monitoring Plugins.
    The disadvantages of Nagios Web Application Monitoring Software are expensive because it is a commercial solution and it only uses checksum and direct page content comparison, which may generate high volume of false positive alarms on dynamic websites.
4. Comments on Current Methods and Tools
  Based on the review of anomaly-based techniques for website defacement detection, some comments can be drawn as the following:
  - Detection techniques using checksum comparison, diff comparison and DOM tree analysis can only be used effectively for static web pages. In addition, the determination of the appropriate detection threshold for each web page is challenging.
  - Detection techniques using statistical, machine learning and data mining techniques have the potential because the detection profile and threshold can be “learned” from training data automatically. The common drawback of proposals of Kim et. al. [6], Medvet et. al. [7][9] and Borgolte et. al. [11] is the extensive computational requirement because of using either the large-size feature set or highly complex processing algorithms. This may limit their application in practice.
  - Commercial website monitoring tools like VNCS Web Monitoring [4] and Nagios Web Application Monitoring Software [5] have 2 common drawbacks: they are expensive because they are commercial solutions and they only use checksum and direct page content comparison, which may generate high volume of false positive alarms on dynamic websites.
  In our method, we use the idea of Kim et. al. [6], in which web pages are vectorized using the n-gram method and their occurrence frequency. We use both 2-gram and 3-gram in our experiments to find out the n-gram that gives the best performance in terms of detection accuracy and computational resources. Furthermore, we use machine learning techniques to learn the detection profile from the training data of normal and defaced pages. This makes our detection scheme more efficient because we can remove the task of generating and updating dynamic detection threshold for each page.
PROPOSED METHOD
1. Proposed Website Defacement Detection Method
  The proposed website defacement detection method consists of two phases: (1) the training phase and the detection phase. The training phase as described in Fig. 5 includes the following steps:
  1. Collection of the training data set, including normal working web pages and defaced web pages (can be extracted from websites, such as Zone-H.org [12]);
  2. Training data set is pre-processed to extract features. Then, it is passed through the training process using machine learning techniques to create the classifier.
  Figure 5: Proposed website defacement detection method: The training phase
  The detection phase as depicted in Fig. 6 consists of the following steps:
  1. From the URL of the monitored web page, the page HTML code is downloaded;
  2. The page HTML code is pre-processed to extract features. Then, it is classified using the classifier generated in the training process. The result of the detection phase is the page status of either “Normal” or “Attacked”.
  Figure 6: Proposed website defacement detection method: The detection phase
2. Experiments and Results
  1. The Experimental Data Set. The experimental data set consists of normal web pages and defaced web pages as the following:
    - Normal web pages include 100 web pages in English, collected from websites of the world universities (MIT, Stanford, etc.) and Vietnam universities (Hanoi National university, Hanoi university of Science and Technology, etc.);
    - Defaced web pages include 300 web pages, extracted from Goldrake data set [9].
  2. Pre-processing. Pre-processing is responsible for extracting page features and vectorizing these features. The pre- processing includes two steps as following:
    - Extracting page features using n-gram method. The n-gram method is used because it is simple, fast and we do not need to worry about the meaning of the page’s content. We select 2-gram and 3-gram to extract page features.
    - Vectorizing page features using the Term Frequency (tf) method. For each n-gram, a tf value is calculated as following:
      (1)
      
      ‌where tf(t, d) is term frequency of n-gram t in HTML file d; f(t, d) is the number of occurrences of n-gram t in HTML file d; max{f(w,d):w∈d} is the maximum number of occurrences of any n-gram in HTML file d. From formula (1), we can see that tf(t, d) value is within the range of [0, 1]. Based on Kim et. al. [6], we select 300 n-grams that have the highest tf value to construct the feature vector for each web page. The output of the pre-processing is ARFF (Attribute-Relation File Format) data files.
  3. Training. The training phase uses 2 machine learning algorithms, including Naïve Bayes and J48 decision tree supported by Weka machine learning tool [13]. The reason we select these algorithms is they are fast and therefore suitable for online attack detection systems. After the experiments, the learning algorithm that produces the higher detection accuracy will be selected for the method’s implementation. In addition, the training phase can be done offline, so it does not affect the performance of the detection phase.
  4. Experimental Measurements. The experimental measurements used include PPV (Positive predictive value), FPR (False positive rate), TPR (True positive rate), ACC (Accuracy) and F1 (F1 Score). These measurements are calculated as the following:
    PPV = TP/(TP+FP) * 100% (2)
    FPR = FP/(FP+TN) * 100% (3)
    ACC = (TP+TN)/(TP+FP+TN+FN) * 100% (4)
    F1 = 2TP/(2TP+FP+FN) * 100% (5)
    
    Where TP, FP, FN and TN are described as the following:
    - TP is the number of defaced pages that are correctly classified as Attacked.
    - FP is the number of normal pages that are incorrectly classified as Attacked.
    - FN is the number of defaced pages that are incorrectly classified as Normal
    - TN the number of normal pages that are incorrectly classified as Normal.
  5. Experimental Scenarios and Results. From the training data set, we created 4 ARFF data files and carried out the training and detection (classification) phases using Weka machine learning tool. The compositions of 4 ARFF data files are as following:
    - Data file No.1 consists of 100 normal pages and 100 defaced pages using 2-gram.
    - Data file No.2 consists of 100 normal pages and 300 defaced pages using 2-gram.
    - Data file No.3 consists of 100 normal pages and 100 defaced pages using 3-gram.
    - Data file No.4 consists of 100 normal pages and 300 defaced pages using 3-gram.
      Each ARFF data file is fed into Weka for training and classification using 2 learning algorithms of Naïve Bayes and J48 decision tree. The experimental results are the average of results produced by the 10-fold cross-validation method. Table 1, Table 2,
      
      Table 3 and Table 4 are experimental results on Data files No.1, No.2, No.3 and No.4 respectively.
      
      ‌Table 1. Experimental results on Data file No.1 using 2-gram
      
      Learning Algorithms
      TP, FP, FN and TN
      
      Measurements (%)
      
      TP
      FP
      FN
      TN
      PPV
      FPR
      TPR
      ACC
      F1
      Naïve Bayes
      87
      0
      13
      100
      100
      0
      87.00
      93.50
      93.05
      J48 Tree
      100
      0
      0
      100
      100
      0
      100
      100
      100
      
      ‌Table 2. Experimental results on Data file No.2 using 2-gram
      
      Learning Algorithms
      TP, FP, FN and TN
      
      Measurements (%)
      
      TP
      FP
      FN
      TN
      PPV
      FPR
      TPR
      ACC
      F1
      Naïve Bayes
      279
      0
      21
      100
      100
      0
      93.00
      94.75
      96.37
      J48 Tree
      299
      0
      1
      100
      100
      0
      99.67
      99.75
      99.83
      
      Table 3. Experimental results on Data file No.3 using 3-gram
      
      Learning Algorithms
      TP, FP, FN and TN
      
      Measurements (%)
      
      TP
      FP
      FN
      TN
      PPV
      FPR
      TPR
      ACC
      F1
      Naïve Bayes
      93
      0
      7
      100
      100
      0
      93.00
      96.50
      96.37
      J48 Tree
      100
      0
      0
      100
      100
      0
      100
      100
      100
      
      ‌Table 4. Experimental results on Data file No.4 using 3-gram
      
      Learning Algorithms
      TP, FP, FN and TN
      
      Measurements (%)
      
      TP
      FP
      FN
      TN
      PPV
      FPR
      TPR
      ACC
      F1
      Naïve Bayes
      289
      1
      11
      99
      99.66
      1.00
      96.33
      97.00
      97.97
      J48 Tree
      299
      0
      1
      100
      100
      0
      99.67
      99.75
      99.83
3. Discussion
  The experimental results shown on Table 1, Table 2,
  
  Table 3 and Table 4 confirm that:
  - The 3-gram performs better than 2-gram on the same data set. However, this is only applicable with Naïve Bayes. For J48 decision tree, the results are the same. Therefore, we select 2-gram as our page feature extraction method because it is much faster than 3-gram.
  - The J48 decision tree outperforms Naïve Bayes algorithm. The J48 decision tree algorithm produces high and stable detection accuracy for all experiments. Therefore, we select J48 decision tree as our learning algorithm. J48 decision tree is fast and widely used in the text and document classifications.
  - Overall, the proposed method achieves high detection accuracy (ACC above 93% for all cases) and low false positive rate (FPR less than 1% for all cases). This is very promising for the practical implementation of an online website defacement monitoring and detection system.
CONCLUSION

This paper proposes a machine learning based method for website defacement detection. In our approach, the detection profile is learning automatically from the training data set of both normal and defaced web pages. The experimental results show that our method can produce high detection accuracy and low false positive rate. In addition, our method does not require extensive computational resources, so it is practical for the implementation of an online website defacement monitoring and detection system.

In the future, we will carry out more experiments on larger number of web pages in both English and Vietnamese to conform the proposed method’s detection performance.

Furthermore, we will implement the proposed method into a website defacement monitoring and detection system.

ACKNOWLEDGMENTS

This work has been supported by the Cyber Security Lab, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam and funded by the Ministry of Science and Technology, Vietnam grant number KC.01.05/16-20.

REFERENCES

Digistar.vn. 2018. What is web defacement attack and defensive measures, https://www.digistar.vn/tan-cong-giao-dien-deface-la-gi-va-cach-khac-phuc/, last accessed 2018/06/20.
M. Romagna, N.J. van den Hout. 2017. Hacktivism and Website Defacement: Motivations, Capabilities and Potential Threats. In: 27th Virus Bulletin International Conference, Vol. 1, Madrid, Spain, 2017.
Wang Wei. 2018. Rise in website Defacement attacks by Hackers around the World, https://thehackernews.com/2013/11/rise-in-website-defacement- attacks-by.html, last accessed 2018/06/20.
VNCS. 2018. VNCS – Central website monitoring solution, http://vncs.vn/portfolio/giai-phap-giam-sat-websites-tap-trung, last accessed 2018/06/20.
Nagios. 2018. Nagios Web Application Monitoring Software, https://www.nagios.com/solutions/web-application-monitoring/, last accessed 2018/06/20.
‌W. Kim, J. Lee, E. Park, S. Kim. 2006. Advanced Mechanism for Reducing False Alarm Rate in Web Page Defacement Detection. National Security Research Institute, Korea, 2006.
‌E. Medvet, C. Fillonand and A. Bartoli. 2007. Detection of Web Defacements by means of Genetic Programming. In: IAS 2007 – Manchester, UK (2007).
G. Davanzo, E. Medvet and A. Bartoli. 2008. A Comparative Study of Anomaly Detection Techniques in Web Site Defacement Detection. DOI: 10.1007/978-0-387-09699-5_50, In SEC-2008: Proceedings of The IFIP TC
11 23rd International Information Security Conference, pp.711-716 (2008).
‌A. Bartoli, G. Davanzo and E. Medvet. 2010. A Framework for Large-Scale Detection of Web Site Defacements. ACM Transactions on Internet Technology, Vol.10, No.3, Article 10 (2010).
G. Davanzo, E. Medvet and A. Bartoli. 2011. Anomaly detection techniques for a web defacement monitoring service. Journal of Expert Systems with Applications, 38 (2011) 12521–12530, doi:10.1016/j.eswa.2011.04.038, Elsevier
(2011).
‌K. Borgolte, C. Kruegel and G. Vigna. 2015. Meerkat: Detecting Website Defacements through Image-based Object Recognition. In: Proceedings of the 24th USENIX Security Symposium (USENIX Security) (2015).
Zone-H. 2018. http://zone-h.org, last accessed 2018/06/20.
Weka. 2018. https://www.cs.waikato.ac.nz/ml/weka/, last accessed 2018/06/20.

View publication stats

Learning Algorithms	TP, FP, FN and TN					Measurements (%)
Learning Algorithms	TP	FP	FN	TN	PPV	FPR	TPR	ACC	F1
Naïve Bayes	87	0	13	100	100	0	87.00	93.50	93.05
J48 Tree	100	0	0	100	100	0	100	100	100

A Website Defacement Detection Method Based on Machine Learning Techniques

ABSTRACT

CCS CONCEPTS

Security and privacy → Software and application security → Web application security; Website defacement detection.

KEYWORDS

ACM Reference format:

INTRODUCTION

Figure 1: Website of Rach Gia airport was defaced in Mar 2017 [1]

Figure 2: A web page was defaced by the “Anonymous” hacker group [3]

RELATED WORKS

Website Defacement Detection based on Checksum Comparison, Diff Comparison, DOM Tree Analysis

Website Defacement Detection Based on Complex Algorithms

Website Defacement Monitoring Tools

Comments on Current Methods and Tools

PROPOSED METHOD

Proposed Website Defacement Detection Method

Figure 5: Proposed website defacement detection method: The training phase

Figure 6: Proposed website defacement detection method: The detection phase

Experiments and Results

‌Table 1. Experimental results on Data file No.1 using 2-gram

‌Table 2. Experimental results on Data file No.2 using 2-gram

Table 3. Experimental results on Data file No.3 using 3-gram

‌Table 4. Experimental results on Data file No.4 using 3-gram

Discussion

CONCLUSION

ACKNOWLEDGMENTS

REFERENCES