This chapter discussed the fundamentals of SQLIA in the following sections: SQLIA intent (an intruders intention and purpose for SQLIA); SQLIA mechanisms (the conduit for SQLIA); SQLIA types (techniques employed to carry out SQLIA); and a review of the techniques applied in this thesis.
The scheme presented in this thesis employs a web proxy API to intercept web requests of any intent and applies predictive analytics techniques to predict SQLIA at the SQL injection point (predicate and expression locations of the SQL statement structure).
Injection mechanisms to a vulnerable application can originate from web page forms, second-order injection, exploiting web-enabled server variables, query strings, and through cookies. An intruder could exploit injection points using the following SQLIA types in any combination: Tautology; Union; Piggyback; Invalid/Logical queries; Time- based; Obfuscation encoding; and Stored procedure. The SQLIA types and SQL tokens are a source of SQLIA positive labelled data items in the scheme presented in this thesis. Also discussed in this chapter are the fundamentals of the applied techniques in this thesis that includes the use of FSA to generate related member strings, Fiddler proxy to intercept web requests for AI prediction of SQLIA, and MAML being the implementation platform.
7.3 Chapter 3: Literature Review
This chapter examined and reviewed the existing research literature on SQLI detection and prevention techniques to enhance the security of web applications. The research area of SQLI detection and prevention has seen diverse methodologies proposed over the years by various researchers. In this chapter, we broadly discussed these approaches under three categories: (1) SQLIV testing and detection; (2) Approaches that apply defensive coding in web application code sanitisation for SQLI prevention; and (3) Dynamic runtime analysis, including taint-based and methods applying AI. Each section is followed by a discussion on existing literature as to establish the gap in the context of emerging computing, and the contributions this thesis makes to fill the gap. We critique the existing research work on the following:
154
• They were SQLIA mitigation that existed before the emerging computing. • The approaches applying ML algorithms use data set that is not pattern-driven. • The dependence on static code access scanning for SQLIA mitigation in most existing
research work hinders intercepting web requests destined to cloud SDN for analysis as cloud-hosted services have restricted access to source code.
• The web requests that drives internet traffic is now big data, this is beyond traditional lookup using classic programming looping constructs to search for attack strings, or SQLI signatures which are not known to be scalable when compared with ML approaches driven by robust cloud-hosted infrastructure in the context of emerging computing.
• The need for a functional and scalable approach to SQLIA mitigation to use of robust cloud-hosted AI platforms such as Azure ML in emerging computing big data mining. • A need for the historical learning data relevant to a real-world web application type
in applying ML technique.
We presented in this thesis the multi-layers ML-based SQLIA mitigation that targets the web client form for input validation, and proxy server at SDN intercepted web requests prediction analytics for SQLIA. This thesis provides a runtime analysis technique of web requests to predict at the SQLIA hotspot (SQL predicate’s expression location). Application of ML techniques provides a functional approach to SQLIA mitigation in emerging computing, where applications and services are hosted in the cloud with big data emanating from the web requests to these cloud-hosted applications. The paradigm shift in this thesis to a pattern-driven data set to train a classifier removes the reliance on antiquated test case data sets [65], [66] and query comparison of some sort. The pattern-driven data set approach provides a technique to derive on the fly a data set relevant to a web application type context requiring ML mitigation to SQLIA. The existing approaches applying ML require full access to source code to require normal and malicious queries for comparison whether in the string lookup or SQL queries matching approaches. In the scheme presented in this thesis, the pattern-driven data set that contains attribute values of related member strings is the subject of prediction during capturing and analysis of web requests by looking at the substitution values destined to the SQL query structure predicate’s expression (SQLIA hotspot) location. Understanding expected input data, existing SQLIA signatures, including SQL tokens allows to us
155
concentrate on input data in transition analysis and prediction as against reconstructing the full queries needed in non-pattern driven data set approach.
7.4 Chapter 4: Numerical Encoding to Tame SQLIA
In this chapter, we looked at the issues of data set availability and proposed a pattern- driven data set. Availability of historical data or data set has advanced the use of AI by applying ML techniques. Unfortunately, as there is no unified pre-existing data set, researchers over the years have resorted to various approaches in generating sample data sets with most proposals lacking patterns to enhance learning and are fraught with complex computational overheads.
We investigated if the patterns that exist in any web application type context can be encoded to vector to train a supervised learning model. We conducted experiments and inferred patterns by numeric encoding of features to derive a data set vector of any magnitude to train a supervised learning model in testing the feasibility of mitigating SQLIA employing AI techniques. We presented the following contributions in this thesis chapter.
• The ontology for crafting a pattern-driven data set from the web application type with a technique to encode the data set into vectors required to train a supervised learning model.
• We trained a supervised learning classification algorithm with this pattern-driven data and validated the trained model under various classification algorithms with high performance metrics in the ROC curve, CM and cross-validation as presented in this chapter.
• The success of this conceptual approach of the web application type as the source of a pattern-driven data set to train a classifier has led to a further work in Chapters 5 and 6 by employing string hashing vectorisation in-place of manual numeric encoding in providing a proof of concept of how the proposal will be applied in a real-world application scenario.
We concluded from the experimental results with an empirical evaluation of low prediction error that a pattern-driven data set can be used to train a supervised learning model in towards ML SQLIA mitigation. The high performance metrics of the results of an AUC range of between 0.944 to 1.0 set a precedent for the suitability of applying
156
feature hashing or vectorisation to the pattern-driven data set generated from related member strings in Chapters 5 and 6.