Credits: 5EC
Pre-requisites: Fundamentals of Data Analytics (SEN1631) or Data Mining (TI2736-C) or Pattern Recognition (IN40850) or Basic Machine Learning (201600070). Click the checklist “Do I meet the prerequisites?”
Motivation: Data analytics provides methods and tools to process large amounts of data into knowledge. An important application of data analytics is in the field of cyber security. Behavioral profiling is used to track processes and people in order to detect misuse such as fraud and to track terrorists. Data mining provides important insights into hidden systems such as malware. The system and communication behavior of software is automatically reverse-engineered using machine learning techniques in order to understand their inner workings, find bugs, and obtain behavioral fingerprints.
Synopsis: The course provides theoretical and practical background for applying data analytics in the field of cyber security. Cyber data analytics is a huge field with a great diversity of techniques and applications. The course is centered on a selection of seven such techniques:
- Anomaly detection;
- Behavioural profiling;
- Data stream mining;
- Distributed data processing;
- Automated reverse-engineering;
- Information fusion; and
- Privacy-aware data mining.
Anomaly detection is one of the main topics in cyber security. Specific difficulties that the student will learn to handle are the huge amounts of data and the large number of false positives. Behavioral profiling applies to both people and software processes. Different techniques will be taught to handle the different kinds of input data used to construct these profiles such as websites and software logs. Importantly, data stream mining is used to learn from huge data streams and distributed processing is used to divide the learning task over multiple processing cores. In addition to the traditional sample data sets, software code and implementations form an important source of information for cyber data analytics. Information fusion combines data from heterogeneous sources such as network traffic and sensor measurements into one model. Privacy-aware data mining aims to learn useful models from data that has been transformed to preserve the privacy of the people in the data, e.g., by adding noise.
There will be one lecture for each of the seven techniques, and one lab exercise every two weeks in which teams of two students will work on real use-cases of these topics.
Aim: The aim of the course is to enable students to develop solutions for anomaly detection, knowledge discovery, threat analysis, and software diagnostics in cyber security.
Learning outcomes: The student will be able to:
- Develop and analyze algorithms that learn models from large data streams;
- Detect anomalies in system logs, e.g., for fraud detection;
- Construct behavioral profiles of both people and software, whilst respecting privacy;
- Learn models from multiple data sources (e.g., websites, network traces, code);
- Use machine learning to discover and analyze threats in software components.
Lecturers: Dr Sicco Verwer (TUD). There will be local support for students in Twente by a teaching assistant.
Examination: Lab assignment in teams of two students (50%) and an individual summative exam (50%).
Contents: Anomaly/fraud detection (ensemble methods, SMOTE, frequent pattern mining, statistical profiling, fingerprinting); stream mining (information sketches, Bloom filters, locality sensitive hashing, Hoeffding trees); distributed data processing (MapReduce, master/slave, HADOOP, Spark); automated reverse engineering (protocol format reversing, passive state machine learning, n-grams, hidden Markov models); knowledge fusion (Bayesian networks, diverse density, collaborative intrusion detection); privacy-aware data mining (record linkage, additive noise, k-anonimity, generating synthetic data).
Core text: State-of-the-art literature