Credits: 5EC
Pre-requisites: One of: IN4085 Pattern Recognition (TUD), IN4170 Databases and Datamining (TUD), TI2736-C Datamining (TUD), 201300239 Machine Learning (UT), 192320201 Data warehousing & data mining (UT), 201100254 Adv. Comp. Vision & Pattern Recognition (UT)
Motivation: Data analytics provides methods and tools to process large amounts of data into knowledge. An important application of data analytics is in the field of cyber security. Behavioral profiling is used to track processes and people in order to detect misuse such as fraud and to track terrorists. Data mining provides important insights into hidden systems such as malware and the dark web. The system and communication behavior of software is automatically reverse-engineered using machine learning techniques in order to understand their inner workings, find bugs, and obtain behavioral fingerprints.
Synopsis: The course provides theoretical and practical background for applying data analytics in the field of cyber security. Cyber data analytics is a huge field with a great diversity of techniques and applications. The course is centered on a selection of five such techniques:
- behavioral profiling and anomaly detection;
- data stream mining and distributed data processing;
- web-crawling and text mining;
- software fuzzing and protocol reverse-engineering; and
- information fusion and collaborative knowledge discovery.
Anomaly detection is one of the main topics in cyber security. Specific difficulties that the student will learn to handle are the huge amounts of data and the large number of false positives. Behavioral profiling applies to both people and software processes. Different techniques will be taught to handle the different kinds of input data used to construct these profiles such as websites and software logs. In addition to the traditional sample data sets, software code and implementations form an important source of information for cyber data analytics. In addition to training from execution logs, the student will learn how to use this information source by actively providing input and learning from the returned output.
There will be two lectures for each of the five topics, and a large lab exercise in which teams of two students will work on a use-case of one of these topics. Each team is free to choose their own topic from a selection of recent research in cyber data analytics.
Aim: The aim of the course is to enable students to develop solutions for anomaly detection, knowledge discovery, threat analysis, and software diagnostics in cyber security.
Learning outcomes: The student will be able to:
- Develop and analyze algorithms that learn models from large data streams;
- Detect anomalies in system logs, e.g., for fraud detection;
- Construct behavioral profiles of both people and software;
- Learn insightful models from multiple data sources (e.g., websites, network traces, software code);
- Apply knowledge fusion and collaborative knowledge discovery methods;
- Use machine learning to discover and analyze threats in software components.
Lecturers: Dr Sicco Verwer (TUD)
Examination: One large lab assignment in teams of two students resulting in a written report (50%) and an individual summative exam on selected content (50%).
Contents: (see synopsis) Core content concerns anomaly/fraud detection (clustering, frequent pattern mining, statistical profiling, semi-supervised learning, device profiling), stream mining (information sketches, frequent patterns, incremental SVM learning, Hoeffding trees), distributed data processing (HADOOP, Spark), web-crawling (preprocessing, detecting duplicates, page selection, dark-web mining), text mining (n-grams, bag-of-words, term frequency/document frequency), knowledge fusion (Bayesian networks, ensembles, diverse density), fuzzing (mutation-based, model-based, white-box, test selection), active/passive state machine learning (state merging, Bayesian model merging/splitting, L*).
Core text: State-of-the-art literature